Strategy to model, then predict / impute with very sparse target attribute?
ben_h
New Altair Community Member
Please excuse vague title. I am currently using an unsupervised SOM clustering approach to try to determine values for a target attribute that is mostly missing. I am using SOM for several reasons I won't go into now, however I'm also open to other suggestions.
I have ~8000 observations of 10 attributes, the last of which is about 99.99% missing (the target). It has only about 17 observations, quite spread apart (the other attributes are mostly complete, but I think I can manage their missing values simply with means & medians).
The 'typical' workflow I am aware of from Wikipedia (!) is to split the data into training (66%) and test sets, train the SOM with the training set, and then map or predict with the test set on the trained SOM. In my case I am putting the entire data set into the SOM minus the target attribute (because it's mostly missing values), and then I don't know what to do from there.
I may be on the wrong track here, but if I have <20 observations with which to 'calibrate' my model, how do I follow this strategy?
I am not a statistician, and am finding it difficult to follow answers to other questions here and elsewhere, so please dumb down any response
I have ~8000 observations of 10 attributes, the last of which is about 99.99% missing (the target). It has only about 17 observations, quite spread apart (the other attributes are mostly complete, but I think I can manage their missing values simply with means & medians).
The 'typical' workflow I am aware of from Wikipedia (!) is to split the data into training (66%) and test sets, train the SOM with the training set, and then map or predict with the test set on the trained SOM. In my case I am putting the entire data set into the SOM minus the target attribute (because it's mostly missing values), and then I don't know what to do from there.
I may be on the wrong track here, but if I have <20 observations with which to 'calibrate' my model, how do I follow this strategy?
I am not a statistician, and am finding it difficult to follow answers to other questions here and elsewhere, so please dumb down any response
Tagged:
0
Answers
-
Hi,
your task is quite hard to impossible from a data mining point of view: basically you want to create a model with only 17 observations, which most probably won't deliver any good results.
The general proceeding in a case like this is the following:
- split your data into a training set with labelled data and a set with unlabelled data. In your case, the training set are your examples with the non-missing values for the target
- in the training set, declare the target attribute as label (use Set Role)
- train a model (and validate it, with the X-Validation)
- apply the model on the rest of the data that does not have a value for the target
- you're done
However, as stated before, with only 17 observations you will get a model, but you won't get a good model. You should really try to get more training data!
Best regards,
Marius0