Workflow to predict complex dataset - Best Practices
Hello community,
What are the best practices to explore a complex and unknown database and predict with accuracy a numeric value? I mean "complex" considering that the dataset contains more than 100 columns including integer attributes, real numbers, and at least 10 polynomial columns.
>>> I have created a repository and loaded the trainning_data and test_data, setting the data type correctly to the columns (integer, real, polynomial and label)
>>> I am using the Sample Operator to reduce the amount of data to process and save some time when I am modeling. Which other techniques can be used to be more productive when dealing with large databases that requires a lot of time to run?
>>> Then I start trying to use the Learners and realized that I don't know which is the most applicable. It is more difficult especially because of the polynomial attributes. When I tried to use some Polynomial to Binomial, there was a lack of memory to process.
>>> Knowing that convert the polynomial attributes to binominal results in a lack of memory, I have splitted the data (using select attribute) to use partially with learners that works with polynomial attributes, and the others with a different learner - what is definitely not the correct way!
My *dream* plan is:
--> Load database
------> Set variables type
---------->Run some kind of Matrix Correlation (but there also polynominal fields) and Weighting
---------------> Select the most relative and important attributes to learning
------------------> Use sample operator to increase performance when modeling
---------------------> Include a Validation Operator
------------------------->Use performance operator to improve parameters.
----------------------------->Predict
Answers
-
Your dream plan looks good to me.
Have a look at the Weight By operators. Especially Weight By Gini Index and Weight by Information Gain might be helpful for your polynominal values.
~Martin
1 -
Thanks Martin!
Good to know that my plan is ok!
I have checked out the Weight By __ operators that you suggest, but both cannot handle with numeric label. In my tests, just Weight By Relief seemed to work to weight numerical and polynominal attributes with a numeric label.
0 -
You can also use Correlation for Numerical and Gini Index for polynominal attributes. You can use Select Attributes with value_type as option to split between numerical and polynominal.
~Martin
0 -
I have followed the workflow I planned plus your suggestions, yet my predictions are far away from "acceptable" (considering r2).
The last thing I tryed was :
0) Sample the data
1) Split data in nominal and numeric attributes
- for numerical --> Weight by Correlation -> Filter Numerical Attributes by Weight
- for nominal --> Select 2 (of 8*) attributes --> Convert to binominal -> Convert to numerical -> Weight by Releaf -> Filter Nominal Attributes by Weight
2) Join the "most relative" attributes in a new table
- I have tried manualy different setups to define the "most relative" attributes based on performance tests, I also tried differents weight operators
3) Connect this new table to the Forward Selection operator
- Inside of it I'm splitting my data in 70% to model/learning and 30% to performance test
4) Change parameters and test different regression operators.
5) Get bad predictions =(
* I have selected the ones with less than 200 distinct values. There are other 6 polynominal attributes that I dont know how to take advantage of them to predict a numeric label. They have hundreds of distinct values and conversions to binominal demands memory and processor that I dont have =/
How could I take advantage of these 6 extra nominal fields to predict a number regarding the memory limitation? What improvements/changes should I do in the process? should I start from scratch (again)?
Thanks
0 -
Hi,
have you tried the gradient boosted trees as learners? They are pretty nice because you do not need to do Nominal to Numerical to do regression.
~Martin
0 -
@mschmitz wrote:have you tried the gradient boosted trees as learners? (...)
Not really..
I was currently modeling with rapidminer 5.4 (favorite) and 6.X (often crashes on OSX) that I had already installed in my computer for years.
I downloading right now the latest version of rapidminer studio to check it out. Looks like gradient boosted trees was released 7.x right? I hope it (or another new learner) helps me to get "acceptable" predictions.
Thank you Martin!
Rafael
0 -
Hi,
Yes, Gradient Boosted Trees have been added in RapidMiner 7.2 along with some other nice new learners (incl. Deep Learning, a new Logistic Regression, and Generalized Linear Models). They all delivered very good results in the projects we have used them for.
Best,
Ingo
0