Hello community,
What are the best practices to explore a complex and unknown database and predict with accuracy a numeric value? I mean "complex" considering that the dataset contains more than 100 columns including integer attributes, real numbers, and at least 10 polynomial columns.
>>> I have created a repository and loaded the trainning_data and test_data, setting the data type correctly to the columns (integer, real, polynomial and label)
>>> I am using the Sample Operator to reduce the amount of data to process and save some time when I am modeling. Which other techniques can be used to be more productive when dealing with large databases that requires a lot of time to run?
>>> Then I start trying to use the Learners and realized that I don't know which is the most applicable. It is more difficult especially because of the polynomial attributes. When I tried to use some Polynomial to Binomial, there was a lack of memory to process.
>>> Knowing that convert the polynomial attributes to binominal results in a lack of memory, I have splitted the data (using select attribute) to use partially with learners that works with polynomial attributes, and the others with a different learner - what is definitely not the correct way!
My *dream* plan is:
--> Load database
------> Set variables type
---------->Run some kind of Matrix Correlation (but there also polynominal fields) and Weighting
---------------> Select the most relative and important attributes to learning
------------------> Use sample operator to increase performance when modeling
---------------------> Include a Validation Operator
------------------------->Use performance operator to improve parameters.
----------------------------->Predict