🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Strategy to model, then predict / impute with very sparse target attribute?

User: "ben_h"
New Altair Community Member
Updated by Jocelyn
Please excuse vague title. I am currently using an unsupervised SOM clustering approach to try to determine values for a target attribute that is mostly missing. I am using SOM for several reasons I won't go into now, however I'm also open to other suggestions.

I have ~8000 observations of 10 attributes, the last of which is about 99.99% missing (the target). It has only about 17 observations, quite spread apart (the other attributes are mostly complete, but I think I can manage their missing values simply with means & medians).

The 'typical' workflow I am aware of from Wikipedia (!) is to split the data into training (66%) and test sets, train the SOM with the training set, and then map or predict with the test set on the trained SOM. In my case I am putting the entire data set into the SOM minus the target attribute (because it's mostly missing values), and then I don't know what to do from there.

I may be on the wrong track here, but if I have <20 observations with which to 'calibrate' my model, how do I follow this strategy?

I am not a statistician, and am finding it difficult to follow answers to other questions here and elsewhere, so please dumb down any response :)

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "MariusHelf"
    New Altair Community Member
    Hi,

    your task is quite hard to impossible from a data mining point of view: basically you want to create a model with only 17 observations, which most probably won't deliver any good results.

    The general proceeding in a case like this is the following:
    - split your data into a training set with labelled data and a set with unlabelled data. In your case, the training set are your examples with the non-missing values for the target
    - in the training set, declare the target attribute as label (use Set Role)
    - train a model (and validate it, with the X-Validation)
    - apply the model on the rest of the data that does not have a value for the target
    - you're done :)

    However, as stated before, with only 17 observations you will get a model, but you won't get a good model. You should really try to get more training data!

    Best regards,
    Marius