Altair RISE

A program to recognize and reward our most engaged community members

Nominate Yourself Now!

How to select the right data for prediction?

Hi All,

I have about 2 years of historical data which I can probably use to predict responses.

For example if I have to predict my response rate for Jan 2020 how can I say how much data would be enough to come close to actual rate.

------ should I look at how my data performed in Jan 2018, Jan 2019 and may be last 4 months from 2019

----- or it should be last for months of 2019 and Jan 2019

----- or may be use everything I have which I am not comfortable with because of many outliers

when I compared actual and predicted for past few months they don't seem close at all because it was done manually (on a piece of paper)

How to select right data?

Thank you.

Find more posts tagged with

AI Studio

Data Sets

.RapidMiner

ETL + Data Prep

Predictions + Scoring

Accepted answers

PaulMSimpson

Let me help you split your data on a date, as many months back as you prefer. I'm fairly new to RapidMiner, having done most of my data science work in R previously. Therefore, I don't know if what I'm about to show you is the simplest or best way to split a dataset on a date, but it does work.

First, you would need to create a third column, one that holds your month column, "/1/" and your year column, so that now you will have actual date values for all of your records, such as 5/1/2018. I recommend using the Generate Attributes operator, then Edit List by adding an attribute name of "myDate", and in the function expressions field, put this: date_parse([yourMonthCol] + "/" + [yourYearCol]), using the name of your own month column and year column, of course.

Second, after your retrieve operator, place only one Filter Examples operator (You only need one of these because you will pipe the "unm" node with all unmatched records to be your test data. Anyway, I used the "expression" condition class, and note what I put into the parameter expression, using the date_before() function. The first param is your date field's name, and the second is a date_parse(), where you convert a string that represents the date that you plan to be the date split point into a date data type.

Image: https://us.v-cdn.net/6030995/uploads/editor/1x/qfzkj1f3x3a1.jpg

All comments

PaulMSimpson

Since this data has date/time marks, you are looking at it the right way. I suggest you begin by using, say, the first 18 months of existing data to train, then test your model on the most recent 6 months of existing data. Then, compare the accuracy of that model to using the first 22 months of existing data to train, then test on the final 2 months of existing data. Whichever way gives better accuracy is what I would then do to predict January 2020. That is, either use the 18 months preceding Jan 2020 to do your predictions, or the 22 months preceding Jan 2020 to do your predictions. The reason the 18 months "may" be more accurate is that things change, processes change, something may change that influences the data. Simply experiment with different training data time lengths.

User111113

@SChillman

Thank you for your response. I will try both the ways and which method would be better to test accuracy in this case?

For validation I use cross or split but in this case I would use cross or any other suggestions are welcome.

User111113

I ran my model on first 18 months of data and predicted next 4 months instead of 6 just to see if it is effective.

I did a performance test by putting original data for performance I predicted response rate 4 month (july-oct) and I already have the actual/original so I fed that as an input to see how much the result set would deviate from original and I got root mean squared error as 0.016

which isn't bad what do you think?

PaulMSimpson

To respond to your earlier post today, I would not recommend using cross validation, since we are using earlier data to train the model, and later data to test it. Just split it 18 months oldest/6 months newest or 20 months/4 months, even 22 months/2 months, and build & test the model that way. Also, look at accuracy, the true positive rate and the true negative rate. Sometimes, an F1 score is the best metric to use to compare models. It depends on how evenly distributed your labeled 1's and 0's are. And, then, go ahead and try it with a different split point in time.

User111113

I am not able to split my data I have 2 separate columns one for month and one for year..... no date column so I couldn't figure that out.

Another way I thought is to add status column before loading data in RM which I did and divided it between old/new but still split operator takes only standard value like ratio and other default columns... how to split using status column from my data.

Also I made RR column blank where status is new because that would be my test data.

kindly help, thank you.

User111113

I used filter based on status column and split the data do you think that's a right approach... I couldn't do it on split validation please see attached picture below.

Image: https://us.v-cdn.net/6030995/uploads/editor/wp/gwfc3ppfu8z0.jpg

PaulMSimpson