Best practicies?

artvolk · July 2008

Good day!

I've at the end found RapidMiner -- the software I was looked for. Declarative approach for editing configuration is great! I'm mostly a programmer, but want to recall my math back

I'd like to ask is there are some best practicies for the case I have:

I have 116 examples, each with one numeric label (or nominal class) and 67 attributes (float values).
The label is a country rating derived from experts and the attributes are country's
activities like 'Extent of business Internet use', 'Internet users etc' etc.

The goal I want to achive:
1. Train model on that data and get some error estimations.
2. Exam model using hand-generated values to get answers
to the questions like -- what it should be if we increase
the attribute 'Extent of business Internet use' and decrease some other.

Is this really possible?

I perform cross-validation for kNN and SVM learners and get accuracy around 66% (in case of nominal labels).
How can I try to achive better accuracy? May be by feature selection or any other data preprocessing?

Are there any practicies for such case?

steffen · July 2008

Hello

Uuh, this data situation is quite hard. Here some thoughts and ideas:
1. For Validation I recommend to not simply run CV once, but at least 10 times with different random_seed to get a more reliable result.
2. As far as I understand your goal ,you want to learn more about the influences of the different attributes on the label rather than classify new data (Is this correct ?). For this task I suggest:

Use what we call a symbolic classifier like a Decision Tree or a Rule Learner. The resulting model is much easier to understand by humans.
Calculate Attribute Weights, I recommend InfoGainRatio, to learn which "decision power" each attribute has regarding the label. But be careful to remove too many features. Hence you got such a small amount of data, you risk to overfit the data, e.g. it will generalize poor on new unseen dataset of the same domain
Try some cluster algorithms to see if there are any structures in the data you have not seen before. Separating the data according to each cluster and analyse them each may result in new insights.

I am just curious: In my answer I have assumed that your labels has around 10 different values. Is this correct ? How many different values does the label consist of ?

hope this was helpful

Steffen

artvolk · July 2008

Thank you for the quick response!

steffen wrote:

As far as I understand your goal ,you want to learn more about the influences of the different attributes on the label rather than classify new data (Is this correct ?)

Yes, that's what I mean. More details about my data: I have one example set I described above
and muliple types of clasifications and numerical labels.

But for now I want to achive some results at least for one nominal labels set and one numerical. So I have two cases:

nominal label + numerical 67 attributes
numerical label + numerical 67 attributes

in both cases attributes are the same.

steffen wrote:

Use what we call a symbolic classifier like a Decision Tree or a Rule Learner. The resulting model is much easier to understand by humans.

Will it work good for my 67 float attributes? What is the prefered Learner if labels are numerical?

steffen wrote:

Hence you got such a small amount of data, you risk to overfit the data, e.g. it will generalize poor on new unseen dataset of the same domain

Yes, you are completely right, is there any way to estimate that risk?

steffen wrote:

I am just curious: In my answer I have assumed that your labels has around 10 different values. Is this correct ? How many different values does the label consist of ?

For the case where label is nominal I have 5 classes (different labels). Is this too small?

steffen · July 2008

Hello again

Will it work good for my 67 float attributes? What is the prefered Learner if labels are numerical?

I think in this case it is rather probable that SVM, KNN or Regression Models will outperform any tree. I mainly suggested trees to get a more understandable model, not for (primarily) creating a good classifier. As you mentioned, you want to learn more about the dataset, so I suggested "descriptive" models, not predictive. Beside this:
Among trees I prefer the classic C4.5 algorithm, called "Decision Tree" in RapidMiner and J-48 in Weka (which is also available within Rapidminer). For numerical label, try Regression Models => operator "KlassificationByRegression" and a Regression Operator of your choice as inner operator (Linear, Logistic..). Having such a small amount of data, I would try which one works best.

Yes, you are completely right, is there any way to estimate that risk?

The estimate is done by validation. First, the higher the variance of your accuracy estimated via CV is, the less likely is the generalisation power of the model. Second,another method of estimating the generalisation power goes like this:
Sample a part of the data before doing anything, for example 20 objects. Then use the rest of the data to create whatever fancy model you like, perform feature selection et cetera, validate via CV to gain a first estimate of accuracy. Then, after you have created the final model you think it is best, apply it to the sample. This will lead to an estimate of how accurate your model will be on unseen data.
Beside this I think that there are methods to calculate the error directly for regression models, but I do not know how and whether this is possible in RapidMiner.

For the case where label is nominal I have 5 classes (different labels). Is this too small?

No, No. I was just guessing the number. Indead, a smaller number of classes is better (if there are still enough examples for every class) since it increases the amount of information available for each class.

Some words at the end:
As I mentioned above. Since you want to learn more about the current data set, I recommend to...

calculate attribute weights
look at each class separately to check the distributions of the important attributes
calculate the Correlation Matrix
perform cluster analysis to see if there are more structures...

Long text, gotta go to sleep

Hope I could help you, in fact I am still a struggling student, but try to share what I already know

Sometimes I feel like the little doorman, catching up the less complicated questions and keep the visitors busy until the great Dark Master of Data Mining (aka Ingo and Tobias) arrive bringing real wisdom

greetings

Steffen

artvolk · July 2008

Thank you for pointing out Decision Trees and cluster analysis!

I finally figured out that for my task (find hidden relations in data) the cycle of learning and exam is not the only one method.

Ahother stupid question, does this procedure make sense?

- I take an unlabeled dataset of some integral data about countries (some economic, ecological ratings) -- about 4 numeric attributes
- perform cluster analysis (for example using KMeans)
- give clusters descriptive names (for example, 'the countries I like', 'the countries I would like to live in', 'bad countries' etc...)
- use this cluster's names as labels to other unlabeled dataset. This dataset should have the data about countries I would like to investigate (the countries are the same as for ratings!)
- perform Decision Tree learning to get information gain ratios and visualization too see what attributes are important in making a country 'good'

steffen · July 2008

Hello

As far as I see, Yes !

However, did I get this right:

artvolk wrote:

use this cluster's names as labels to other unlabeled dataset. This dataset should have the data about countries I would like to investigate (the countries are the same as for ratings!)

You want to create a classifier using the clusternames as labels for training ? This is a widely used strategy, you are on the right track !

I want to remark:

artvolk wrote:

perform cluster analysis (for example using KMeans)

Be careful with KMeans. KMeans finds the number of clusters you order him to look for. Not less, not more.

artvolk wrote:

Another stupid question, does this procedure make sense?

There are no stupid questions !

greetings

Steffen

artvolk · July 2008

steffen wrote:

However, did I get this right:You want to create a classifier using the clusternames as labels for training ? This is a widely used strategy, you are on the right track !

Just to clarify: I want to separate countries to cluster using one unlabeled dataset = obtain class unformation. Than I'd like to investigate relations between these classes (clusters) and ANOTHER dataset.

In other words: I'd like to determine which countries I should consider good using expert's ratings and than try to answer the question 'why are they so good or so bad' using objective numerical data as attributes and the labels (cluster names) obtained on the first stage. I'm consider using using decision trees (thanks again!

). As far as I understand they using information gain ration, so the most significant attributes will be clother to the root.

Does it sounds meaningful?

steffen wrote:

I want to remark:Be careful with KMeans. KMeans finds the number of clusters you order him to look for. Not less, not more.

Yes, that's a gotcha. Will it common to reduce attributes dimension to obtain vizualization? Will it help in understanding the data?

steffen wrote:

There are no stupid questions !

I just feel as complete beginner in Data Mining and I don't really like to be it

steffen · July 2008

Hello

artvolk wrote:

Does it sounds meaningful?

I think so.

Yes, that's a gotcha. Will it common to reduce attributes dimension to obtain vizualization? Will it help in understanding the data?

Visualization is always helpful to get a better understanding. Beside the standard visualizations (selecting 1-3 attributes and plot them in different ways) and since you got only numerical attributes, I suggest my favourite algorithm for such task: The ESOM
http://databionic-esom.sourceforge.net/
If you got annoyed by converting your data to the lrn format (http://databionic-esom.sourceforge.net/user.html#Data_files____lrn_), there is another implementation in RapidMiner doing the same thing (called SOM). (I cannot suggest this one, since the ESOM was created at my home university :P)

Regarding cluster validation: Here is discussion about this (Rapid-i:Universal Cluster Validation).

hope this was helpful

Steffen

artvolk · July 2008

Thank you a lot, I'm keep struggling with my data. Your link on the topic where approaches to comparing two clusterization techniques helped me a lot!

I can't find one feature in RapidMiner -- I can't attach descriptive text titles to points beeing visualized in 3D. Is this really possible?

IngoRM · July 2008

Hi,

sorry to not contributing to this as I find great discussion but I simply have to get offtopic for a second:

@Steffen:
Do you know Fabian Mörchen then?

Cheers,
Ingo

artvolk · July 2008

I've performed the procedure I described and figured out that when I label my second dataset with labels obtained after clustering, the classification accuracy is low -- around 66%. Does it mean that my classes are divided in wrong way?

Another off-topic question: if I'm interested in discovering dependencies between attributes and class labels, what options can I try except calculating information gain ratio and regression methods?

steffen · July 2008

Hello again

I've performed the procedure I described and figured out that when I label my second dataset with labels obtained after clustering, the classification accuracy is low -- around 66%. Does it mean that my classes are divided in wrong way?

Not necessarily. Maybe the used classifier is not able to learn the concept. Did you try another classifers ?

Another off-topic question: if I'm interested in discovering dependencies between attributes and class labels, what options can I try except calculating information gain ratio and regression methods?

At this stage...

Covariance-/Correlationmatrix to compare numerical values
TransitionMatrix to compare nominal values (discretize the interesting attributes before)
Looking at different standard plots (scatter,histograms)(colored with label) to look manually for patterns (since you know now which attributes are primarily interesting)

If you have some domain knowledge (or: at least you know what the different attributes mean combined with some common sense) it will ease the process.

hope this was helpful

Steffen

artvolk · July 2008

I've tried SVM with parameter optimization, the accuracy is higher -- 75%, I will keep working on it.

Thank you for all the help!

Best practicies?

Answers

Categories