Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

Strategy to model, then predict / impute with very sparse target attribute?

Please excuse vague title. I am currently using an unsupervised SOM clustering approach to try to determine values for a target attribute that is mostly missing. I am using SOM for several reasons I won't go into now, however I'm also open to other suggestions.

I have ~8000 observations of 10 attributes, the last of which is about 99.99% missing (the target). It has only about 17 observations, quite spread apart (the other attributes are mostly complete, but I think I can manage their missing values simply with means & medians).

The 'typical' workflow I am aware of from Wikipedia (!) is to split the data into training (66%) and test sets, train the SOM with the training set, and then map or predict with the test set on the trained SOM. In my case I am putting the entire data set into the SOM minus the target attribute (because it's mostly missing values), and then I don't know what to do from there.

I may be on the wrong track here, but if I have <20 observations with which to 'calibrate' my model, how do I follow this strategy?

I am not a statistician, and am finding it difficult to follow answers to other questions here and elsewhere, so please dumb down any response

Find more posts tagged with

AI Studio

Accepted answers

All comments

MariusHelf

Hi,

your task is quite hard to impossible from a data mining point of view: basically you want to create a model with only 17 observations, which most probably won't deliver any good results.

The general proceeding in a case like this is the following:
- split your data into a training set with labelled data and a set with unlabelled data. In your case, the training set are your examples with the non-missing values for the target
- in the training set, declare the target attribute as label (use Set Role)
- train a model (and validate it, with the X-Validation)
- apply the model on the rest of the data that does not have a value for the target
- you're done

However, as stated before, with only 17 observations you will get a model, but you won't get a good model. You should really try to get more training data!

Best regards,
Marius