RapidMiner Data Modeling CHALLENGE - $200 in cash and prizes
Hello RapidMiners -
I thought it would be fun (and useful) if we had some Kaggle-like challenges here on the community forum. So I am sponsoring the very first RapidMiner Data Modeling Challenge. This is a real training data set that is in need of a good model. It is not like the classic iris data set; it has missing data, errors, etc.. Welcome to the real world. Here's the challenge:
Goal: produce a model in RapidMiner 7.5 that will predict the label attribute given prior data in the series of the attached training set "RMChallengeDataSet" with the highest accuracy. This will be verified via the SLIDING WINDOW VALIDATION operator. As it a series of dates over an 18+ year span and no one wants to sit and watch their computer spin forever, I suggest the following parameters:
training window width: 1000 (about three years' worth)
training window step size: 3 (to cut down on iterations)
test window width: 1 (I only want one day at a time)
horizon: 1 (I want the next day)
cumulative training: yes
average performances only: yes
It is a SERIES - every day from 1968 to 1986 - with 6726 examples and 262 numerical attributes. The label is an A/B/C selection. You are welcome to do any feature selection, adding of attributes, etc... and use any model(s) as long as it's within RapidMiner and its publically-available extensions. No scripting or APIs allowed. The data are 1:1 hashed to protect the identity of the source - please do not try to reverse-engineer.
Winner: the winner of the competition is the one who can produce the highest accuracy % ≥ 60 as shown with the standard Performance operator within the cross-validation. Why 60? Because that's the highest I have gotten so far [honest disclaimer: I actually only got 60% accuracy with A/B labels but I know you are all smarter than I am...]
Submission: all submissions for this challenge must be in THIS THREAD so it is open for all to see. All you need to do is submit your process XML as a reply to this message (please use the "insert code" item so it does not get long) AND a screenshot of your performance. You can post as many submissions as you want (within reason).
Determination of winner: Hopefully the community will all agree on the winner (all submissions are public) but in case of some drama, I will be the sole judge and will verify the winner's submission. If there is more than one identical (and highest) accuracy, the one which was submitted first will be the winner.
Who can enter: anyone who is a registered user on the RapidMiner Community Forum. Yes even you, @IngoRM!
Due date: all entries must be posted in this forum by June 15, 2017 at 21:00 EST.
Notification: I will give myself three days to independently verify the winner and then post to this thread. I will then PM the winner to get a mailing address and mail a check for $100!
Good luck!
Scott
Answers
-
Hey RapidMiners,
First of all, let's thank Scott for his initiative here! This is really appreciated and will be a fun challenge! And this indeed is a challenge: I tried some first models in the last 15 minutes and I am still very far from the 60% accuracy threshold - but I will get there :smileytongue:
RapidMiner and I personally would like to support this initiative. Therefore, we will match the $100 price money with a $100 Amazon Gift Card. So now we have a total $200 of prices in the pool. So better fire up your RapidMiner and show us your modeling skills :smileyvery-happy:
Much success to all participants, and let us know from time to time where you are and if you have questions or ideas.
Have fun,
Ingo
4 -
Since RapidMiner now donates some price as well I can't really participate any longer. Or maybe I could use an anonymous account :smileywink:. Anyway, I will still try a bit to see where I can get to...
So here is a quick update: I am now at 45% accuracy which is still far away from your 60%. This is a good challenge indeed! I did not really optimize the model itself but focused on feature selection first... Let's see what else we can do :-)
1 -
Well done @IngoRM! Especially since I admitted that I got 60% from an A/B label, rather than A/B/C here. And I've been working on it on-and-off for three weeks. ? Tell you what - if a RM employee wins, I will ship a quart of my homemade maple syrup to the office where s/he works. Fair?2
-
Oh, I missed the A/B vs. A/B/C part. Then I don't feel too bad with my 45% :-)
You probably don't know it but I am a huge fan of maple syrup. So I will support the people in the Boston office (sorry London, Dortmund, Budapest :-)) You got a deal here!
1 -
Yes, 60% accuacy is much easier to achieve when trying to predict 2 classes versus 3 classes! ;-)
I'm wondering since there is no holdout/test set how much "sample tuning" is allowed here. For instance, after some exploratory EDA it is obvious that some attributes are missing for large portions of the date ranges. Is it acceptable then to partition the examples into date ranges and building different models on different attribute subsets based on date range availability?
1 -
Yay, a RapidMiner Challenge! As a former RapidMiner employee, do I qualify for maple syrup?
As this is time series data, you would like to be allowed to fill missing values with the latest known value, or for e.g., attributes 30-49 use the weekly value.
This is impossible to do inside the cross validation if you use shuffled sampling (which is the default of the X-Validation for classification problems if set to automatic). On the other hand, if you do it before the X-Validation, you leak information from the test data to the training data. Maybe we should change the rules to use Sliding Windows validation? Or do you want to get preditions day by day, without using information from previous data points?
Assuming we can use values of earlier days to fill missing values etc before the validation, 60% accuracy on A/B/C is possible to achieve on when the performance is estimated via shuffled-sampling cross validation
3 -
All good points, Marius. Here are my thoughts...
- Anyone who wants maple syrup in lieu of the $100 prize is welcome to it.
- When I did the modeling, I imputed the missing series data before the validation. I did not think about the "leaking" factor nor to use Sliding Windows validation instead of X-Validation. As the goal is to be able to predict the labeled column and use all previous data points, I would say YES, we need to change the rules to Sliding Windows validation.
I look forward to seeing your submission, Marius!
Scott0 -
Hi Brian -
All good points and yes, there are large gaps of missing examples for all sorts of reasons. But I would say no, it is not ok to partition. As it is a series, the goal is to predict the label with dates moving forward using the historical information (e.g. predict the label for a date, given all prior data in the series). As Marius pointed out, I believe a more valid way to show performance is with the Sliding Window validation instead of X-Validation. This is my error but I think he's right.
Make sense? GAME ON!
Scott0 -
**NOTE** I have pondered this a bit and suggest the following changes to the rules:
- Goal: produce a model in RapidMiner 7.5 that will predict the label attribute given prior data in the series of the attached training set "RMChallengeDataSet" with the highest accuracy. This will be verified via the SLIDING WINDOW VALIDATION operator. As it a series of dates over an 18+ year span and no one wants to sit and watch their computer spin forever, I suggest the following parameters:
training window width: 1000 (about three years' worth)
training window step size: 3 (to cut down on iterations)
test window width: 1 (I only want one day at a time)
horizon: 1 (I want the next day)
cumulative training: yes
average performances only: yes
These are rather unusual parameters but I think they make sense (at least to me). One thing I have found immediately is that the Sliding Window validation is not parallelized - it takes a while to go through the iterations given these parameters with most models.
As this is a FUN competition, and a great way to learn from one another, please give feedback if there is a better way to do this. If people concur, I will make the edits in the initial post.
Ok off to bed!
Scott
1 -
Here's what I would start with:
Impute Missing with k-nn optimization
Sliding Windown Validation
SVM with an RBF kernel
Optimize on Training/Testing widths, gamma, and C
1 -
Dear Scott,
i've tried to download your zip but i cannot open it. I always get errors. Any chance you could reupload it somewhere?
~Martin0 -
0
-
-
Yep, I'm also in the fifties with corrected validation oO
I'm curious if anyone comes up with something >60. Currenlty everything seems to level out at ~55...
1 -
I of course do not know what your preprocessing or your models look like, but I will say (as the only one who knows the data sets ) that I was getting a boost in performance when I created attributes that split out the date: day of the week and month seemed to help, quarter did not. Other tinkering with the dates may be helpful too...
Scott0 -
and although it is likely you are doing something similar, creating lag series attributes helped as well.
Scott
0 -
-
Hey D,
very nice. I do have a model with 63% but on a different validation. The validation @sgenzer proposes takes how many steps? 500? Even with simple models it takes an hour for me. Are you sure on comm. training? This makes parameter tuning fairly hard.
Best,
Martin
1 -
Well done Dan! Can't wait to see what you did.
Martin - yes it's a lot of steps. When I tried some simple modeling (e.g. Naive Bayes) it didn't take too long but yes anything else took a while. I am assuming that it's slow because the sliding window validation is not parallelized. I always keep a keen eye on my CPU/memory usage when I'm running big models like that, and this validation does not push my 6-core like X-validation does. Feature request?
Scott0 -
Hi Martin,
Yes I am using the sliding windows validation parameters @sgenzer proposed. It runs about 1925 iterations for me in about 14-17 minutes and about 10-12 minutes on my 24 core 64GB RAM desktop. Was toying with the idea of running on my hadoop cluster for more speed but probably overkill. Not sure what you mean by 'comm training'? Perhaps I missed something. I am getting up to 67.8 on my grid search parameter tuning but only using 80/20 split so some overfitting.
Will post my full code before I go on vacation next Saturday so others can recreate it.
Dan
1 -
Here is my best model so far. I've posted my code for others to use.
My first big hint: use the new Gradient Boosted Trees algorithm from H2O (similar to the popular XGBoost package). It's wicked fast and frankly is my go to algo these days.
My code was too big for the insert code window so I attached a docx file.
I'm sure someone will have a good idea how to reduce the size on the preprocessing part
(neep 2 loops that I couldn't quite get right).
Dan1 -
Dear Dan,
thanks for your insights. I had not a look on your process, because that's feels like cheating at the moment.
I indeed also use H20, but GLM at the moment. A validation with bigger stepsize yielded to this:
when i tried to ran it with scott's proposed step size it crashed after 3 hours. The reason is that i pipe a lot of attributes into the GLM. I am working on fixing this at the moment
in any case we might create an ensemble model of both our solutions afterwards. Just to give Scott the best model.
~Martin
1 -
Ok,
got it, i don't believe my results, possibly i made something wrong somewhere:
Edit, found my issue. I am now at 61 as well.
0