Asrul's Blog

Imbalanced Data Set

Actually, what is data set?

What is balanced data set? and…

What is Imbalanced data set?

Why it is important? last…

What is the problem???

1)A data set consists a number of sample (or data),  have a number of  input (particular variable) either in categorical or numerical types. The output in the data set usually called a class (or label) that represent some meaning. The data set can be divided into two categories, that are small and large data set.

2) Balanced data set occurs when the class distribution in data set is same in each class. (class1:class2-0.5:0.5).

3) But, some cases in data set occurs when the number of samples in one class (majority class) is more than the other class (minority class). Means that the class distribution for that case is not balanced and is can be highly imbalanced. (class1:class2-0.95:0.05). In this area, the majority class are less significant than minority class.

4) Sorry  make you all confuse about a data set,  so why it is important? why we need to study this kind of data? It is important in an area like, pattern recognition, prediction and classification. It is also need more attention in research for multidicipline applications in medical, manufacturing, engineering, business, security and communication. In medical, based on available data consists a collection of patients did a medical check up for a cancer. Most of them is checked without found a cancer and less of them have a cancer. The ratio of patients have cancer and without cancer is 50:1000. If we have 10 patients came to meet a doctor to do a medical check up for cancer, doctor can make a prediction how many have a cancer and save without cancer by using a software that able to predict for. Within this process, doctor can know earlier and wait for the real analysis in the lab. for this scenario, patients have a cancer is grouped in minority class and other in the majority class. So, we need to develop a classifier that able to predict fairly for both classes and most significantly for the minority class.

5) So, what the problem?. there are several listed problems in imbalanced data set and almost all researcher agreed in literature. Difficulty to learn from imbalanced dataset due to the desired output’ s classifier is tend to predict majority class than minority class. The conventional classifier (e.g neural network, Bayesian network, fuzzy Set, support vector machine, probabilistic method, decision tree and etc) needs do some modification in structure, internal algorithm or the data itself by using re-sampling technique or other relevant technique.


8 January, 2010 - Posted by | Engineering

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: