🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"Training with multiple CSVs"

User: "TimF"
New Altair Community Member
Updated by Jocelyn
Hi all!

Very sorry if this is a head slappingly basic question. I have tried to find an answer in the manual but I probably just don't know what I'm looking for!

I am using data from a series of races. I think I need to train my model with multiple races before I can try to predict a winner. But how do I set up my data so that RapidMiner knows that each race needs to be analysed as one event with one winner rather than a series of unrelated records containing winners and losers - should I use one CSV with a label or ID for each row that belongs to the same race, or should I have separate CSVs for each race - and if so how do I use multiple CSVs as input?

Find more posts tagged with

Sort by:
1 - 4 of 41
    User: "wessel"
    New Altair Community Member
    You should join all your training data into 1 single table.

    You need to encode your inputs in such a way to make learning as easy as possible.
    As a domain expert, you know, that an arbitrary choice like race-id is not predictive for the outcome of a race.
    Therefore, a race-id should not be included as a predictive variable.

    Sometimes it can be quit a bit of work to mangle your data in the exact format you need.
    For example, you may want to include the outcome of the previous race as a predictive variable for the current race.
    If you are not handy with manipulating data this can be a bit of work.
    User: "TimF"
    New Altair Community Member
    OP
    Thank you for the reply!
    If all of my training data is in a single table with a participant on each row I don't understand how the model can work - I thought that each race should be read as one data point, since the chance of one participant winning is affected by the strength of the other participants. Is this where I would use the 'batch' data role?
    User: "wessel"
    New Altair Community Member
    No, probably not.

    Give a few example data rows, maybe I can figure out how to manipulate your data.

    User: "TimF"
    New Altair Community Member
    OP
    Thank you for taking a look. This is a bit of the data I pulled out to learn with, comma separated. The 'Won' column is what I am trying to train my model to predict, 'Race_ID' is a unique text string for each race that was run, and 'Runner' is a text string for the name of each runner in that race. The other columns are various performance history or demographic data for that runner in the race.

    Won,Race_ID,Runner,FAV,STS,WIN%,API,AGE,RLEN,Rating,
    NO,TOD315,AMBER DREAM,N,30,10,1.7,7,2.8,54.3,
    YES,TOD315,THE PARK DANCER,N,6,16.7,2.5,4,0,55.7,
    NO,TOD315,CULLEN'S SHADOW,N,49,12.2,1.4,7,2.5,44.8,
    YES,TOD350,SAXON COAST,N,13,23.1,6.3,4,0,55,
    NO,TOD350,SALUTE THE SUN,Y,21,19,1.3,4,4.1,53.8,
    NO,TOD350,THYME FOR BUSINESS,N,8,25,2.2,5,2.5,49.7,
    NO,TOD350,THE FACTOR,N,15,13.3,1.6,5,3.5,50.8,
    NO,TOD425,ROMP TO FAME,Y,10,20,7.4,5,3.5,51.6,
    NO,TOD425,DRIVE WEST,N,15,6.7,2.7,4,2.4,51.3,
    NO,TOD425,DUGITE,N,48,12.5,1.1,7,2.6,51.9,
    YES,TOD425,FINNEGANS GOLD,N,29,10.3,1.8,6,0,54.2,

    And here is a CSV version of the same: https://dl.dropbox.com/u/17535287/october%202012%20results.csv