🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Replicating RapidMiner RandomForest Results in Python

User: "B00100719"
New Altair Community Member
Updated by Jocelyn
Hi,   I have a Random Forest binary classification model which, after dimensional reduction, I have 13 variables.  Most are numeric.  However, I also have a date and a couple of polynomial attributes (eg SIC code).  I am getting accuracy of almost 75% which, for the  complexity of the problem, I am reasonably pleased about.

However, I would like to now try to replicate the RapidMiner results in Python.  But, in order to do so, I would like to understand a little better about how RapidMiner is making calculations in the string data.  For example, one of my string attributes is a SIC code (Standard Industrial Classification).  These codes appear numeric but I am regarding them as polynomial to avoid the algorithm trying to assign an order of importance to them which wouldn't make sense. 

When it comes to attributes like these, I don't know how RapidMiner is using them.  Python libraries like sklearn require all Random Forest inputs to be numeric and suggest things like 'one hot encoding' for converting non numeric data to numeric.  However, there are over 800 unique SIC codes in my data and one-hot encoding is not practical in such a situation and the SIC code does appear to be an attribute of very high importance which I cannot just remove.

Is Rapidminer performing one hot encoding in the background here?
What Python library should I use to behave most like RapidMiner - allowing polynomials and dates?

Find more posts tagged with