Predictions based on US baby names data?

guy_davis
guy_davis New Altair Community Member
edited November 2024 in Community Q&A

Good day, 

 

I'm new to Rapid Miner and predictive analytics.  I'm trying to move beyond the tutorials (which are great!) by using the US baby names (state-by-state) found on Kaggle.  I'm able to load a random sample (1000 records) of the state-by-state data in:

  • id (ID type)
  • name (nominal type)
  • gender (binominal type)
  • state (nominal type)
  • year (integer type)
  • count (weight type)

Then I use another random selection to get 20 records without the state attribute.  I'd like make a prediction of birth state based on name, gender, and birth year.  I'm sure this is a contrived example, but I thought I'd give it a try.  Alternatively, I'd like to predict birth year given name, gender, and state.  What would be some interesting models to try in this case?

 

I've tried using Decision Tree to generate a model from the training data and Apply Model to the random Test Data.  As best I can tell, Decision Tree is only working on year and gender, ignoring name.  Is there anyway to get this model to consider name?  Perhaps the issue is that I can't train on more than 1000 records due to licensing?

 

process.pngProcess so far...decision_tree.pngDecision tree on year, then sometime gender.

 results.png

Thanks in advance,

Guy

Tagged:

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee

    Hi Guy_david,

     

    welcome to the community! Seems like a funny project to work on. Could be also some kind of marketing for us :).


    A few things:


    First of all the reason why the tree is not considering the names itself is, that they are not statistically significant. Most likely a cut on a specific name is simply not "big" enough to be counted as signficant. You might want to reduce the min_gain to let the tree grow deeper. Be aware that this might yield to overtraining. I could imagine that using the Namsor Extension to get the Origin for a name could be helpful.

     

    Another thing is, that it will be very hard to predict each 50 states. I would boil it down to more regional areas like West coast, east cost, south, mid west or something. This makes the problem way easier.

     

    ~Martin