"Time Series Questions"
blurngr
New Altair Community Member
I've gone through the tutorials and the documentation, as well as the videos I found at Neural Markets. Overall, it continues to be a very cool product. I'm new to data mining, so please be gentle :-)
I have some time series data (sales data). I've already (using Excel) ETL'd some of this data so that I have these columns:
dayOfYear (the id), month, dayOfMonth, monthOfYear, dayOfWeek, weekOfYear, year, salesOfDay
I've split it up this way, so that I can see, for example, if sales occur on Mondays more often, or perhaps on the 10th day of the month, etc... Is this necessary, or does one of the operators already do this?
I've normalized / scaled the values so that they are all between 0 ... 1, simply by dividing each column by the largest value in the column. [This step seems to be required by libSVM from the command line, is it required in Rapid Miner, or is there an operator to do this as well (which would be really handy!!!)?]
However, I'm having a hard time using any of the learners ..... as they seem to almost all require labels. I understand which one is the id field. However, the eventual goal is to try and predict say, the next 30 salesOfDay periods. Are those columns labels, or attributes?
I'm sure I'll have more questions as we get through this ....
--
Anthony
I have some time series data (sales data). I've already (using Excel) ETL'd some of this data so that I have these columns:
dayOfYear (the id), month, dayOfMonth, monthOfYear, dayOfWeek, weekOfYear, year, salesOfDay
I've split it up this way, so that I can see, for example, if sales occur on Mondays more often, or perhaps on the 10th day of the month, etc... Is this necessary, or does one of the operators already do this?
I've normalized / scaled the values so that they are all between 0 ... 1, simply by dividing each column by the largest value in the column. [This step seems to be required by libSVM from the command line, is it required in Rapid Miner, or is there an operator to do this as well (which would be really handy!!!)?]
However, I'm having a hard time using any of the learners ..... as they seem to almost all require labels. I understand which one is the id field. However, the eventual goal is to try and predict say, the next 30 salesOfDay periods. Are those columns labels, or attributes?
I'm sure I'll have more questions as we get through this ....
--
Anthony
Tagged:
0
Answers
-
Hi Anthony,
thanks for your kind words. Unfortunately, time series predictions are not the easiest processes to set up and hence not the best way to start with RapidMiner. But I will try to sketch a few points you will have to consider:
1) For regression tasks, you should binominalize nominal attributes. This also applies to the date columns. Since you included a lot of different date values you should really think about which you will really need. This is due to the fact, that the binominalization creates one attribute per attribute value indicating if the attribute has that value.
2) Of course, if you deal with supervised learning (as you do with regressions and time series regressions in particular), you will always have to have a label. In time series regressions, you in general regress past on future values of the same time series. Hence you have to convert the time series contained in only one attribute to data where there are attributes containing past values of the series and a label containing a future value of the same series. This can done by the [tt]MultivariateSeries2WindowExamples[/tt] operator.
Note, that these points only consider the basic transformations which are part of time series regressions. Some other points will have to be addressed but this would exceed the possibilities of this forum meaning I would rather write a book ...
If you would like to gain knowledge in that area very fast, maybe one of our training courses with a special focus on time series predictions would be interesting for you?
Regards,
Tobias0 -
Thanks very much, I'll try and spend a few hours working on this today - as well as working around some of the other examples I found on the forums.
I can certainly understand that time series isn't the easiest way to start with RapidMiner -- but all of the potential problems that I'd like to apply data mining to are time series related.
Regarding the training series -- absolutely I'm interested. Unfortunately, I'm in the US, and getting to Germany would require a week or more + probably closer to around $8k for the course, by the time I've paid for the course / travel / etc... And with our lovely economy having a few problems .... that's not really feasible. Any plans for US training courses?
Or, possibly even better, recording them .... and selling them as DVD's or online training?
Thanks!0 -
Hi Anthony,
actually, we have been to the US this year and given some training courses there. But we won't come back at least before next year. Regarding video or online training, we are definitely planning to launch things like that. But this will also take some time ... if such things become available we will surely anounce this on the forum as well.
Speaking of the forum, you will certainly get some ideas on how to do time series predictions from the forum. If you have specific questions, you are of course welcome to ask.
Regards,
Tobias0 -
I've followed the Google / Youtube series on data mining (David Mease), and was actually able to get our data set up and going inside of R, using both an SVM (e1071) or a RandomForest learner. Specifically, using columns for:
dayofyear, dayofweek, monthofyear, weekday, sales
ie:
1,1,1,1,345.78
Would be Jan 1, on a Monday, with sales of 345.78
Using the RandomForest, in R, is using a Regression tree. Both learners are able to produce nice valid results that are within an acceptable limit of what the "real" values are .... ie, this process appears to be working OK.
I'm trying to now do the same thing inside of RapidMiner, but running into a number of problems. Apparently, the RandomForest implementation cannot handle a numeric label (and/or regression)? I've set the above columns to be:
dayofyear = nominal
dayofweek = nominal
monthofyear = nominal
weekday = nominal
sales = real (the label)
So, I've tried using the PolynomialRegression .... and that doesn't work with the above set up [perhaps it's not supposed to]. (Of interest, using the above set up, and running the PolynomialRegression ... the error is that "polynomial attributes not supported.")
If I do change them to all be numeric, PolynomialRegression does give me some results, but I'm not really sure what to do with them (it appears to be a formula I could use to predict new sales, which I suppose is useful).
Assuming that the PolynomialRegression is what I want .... how do I (or can I) actually get it to "predict" something (I assume I just take the formula and fill it in)?
Using the above set up, is there a better way to do this?
Thanks!0