Build decision tree using Python and embed in Rapid Miner
Hi guys,
I am doing a project where I need to create decision tree using Python and then embed it in Rapid Miner using Execute Python operator.
These are screenshots of my process:
Subprocess in Cross Validation
This is my code for the decision tree:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
# rm_main is a mandatory function,
# the number of arguments has to be the number of input ports (can be none)
def rm_main(data):
#import data
file = '04_Class_4.1_german-credit-decoded.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
#load a sheet into a DataFrame
gr_raw = xl.parse('RapidMiner Data')
#create arrays for the features, X, and response, y, variable
y = gr_raw['Credit Rating=Good'].values
X = gr_raw.drop('Credit Rating=Good', axis=1).values
#split data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)
#build decision tree classifier using gini index
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
return clf_gini
When executed it gives me an error, I am not sure which part of this code that I should ignore for a successfule execution.
Would appreciate any advice or help on this!
Thank you.
Regards,
Azmir F
Find more posts tagged with
Here's the building block I use for XValidation with Python. I have one that also works with the Compare Models operator, but that is very complex.
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="x_validation" compatibility="7.6.001" expanded="true" height="124" name="Validation" width="90" x="380" y="34">
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="BDT (sklearn)" width="90" x="112" y="34">
<parameter key="script" value=" import pandas as pd from sklearn.ensemble import GradientBoostingClassifier # This script creates a GradientBoostingClassifier from SKLearn on RM data # It can be used as a generic template for other sklearn classifiers or regressors # Author: mschmitz def rm_main(data): metadata = data.rm_metadata # Get the list of regular attributes and the label df = pd.DataFrame(metadata).T label = df[df[1]=="label"].index.values regular = df[df[1] != df[1]].index.values # Create the Tree, for more options see # For details see: clf = GradientBoostingClassifier( n_estimators=10, max_features="sqrt") # learn it clf.fit(data[regular], data[label]) # Return also the list of regulars and labels for later application return (clf,regular,label[0]), data "/>
</operator>
<connect from_port="training" to_op="BDT (sklearn)" to_port="input 1"/>
<connect from_op="BDT (sklearn)" from_port="output 1" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply Model (2)" width="90" x="45" y="34">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(clfinfo, data): clf = clfinfo[0] regular = clfinfo[1] label = clfinfo[2] meta = data.rm_metadata predictions = clf.predict(data[regular]) confidences = clf.predict_proba(data[regular]) predictions = pd.DataFrame(predictions, columns=["prediction("+label+")"]) confidences = pd.DataFrame(confidences, columns=["confidence(" + str(c) + ")" for c in clf.classes_]) data = data.join(predictions) data = data.join(confidences) data.rm_metadata = meta data.rm_metadata["prediction("+label+")"] = ("nominal","prediction") for c in clf.classes_: data.rm_metadata["confidence("+str(c)+")"] = ("numerical","confidence_"+str(c)) return data, clf "/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="input 1"/>
<connect from_port="test set" to_op="Apply Model (2)" to_port="input 2"/>
<connect from_op="Apply Model (2)" from_port="output 1" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>
Thanks guys for the solutions you have provided. I have managed to come up with my own solution.
I did not know that python needs numerical data to apply the model. So I have modified my process and used Execute Python operators twice, once in Training and once in Testing. I used the Numerical to Binominal operator after the second Excecute Python operator.
Note that I have renamed it to Build Model and Apply Model.
This is my updated process:
Cross Validation Subprocess
My Python script for Build Model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(data):
# build decision tree
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
y = data[['Credit Rating']]
clf = DecisionTreeClassifier(min_samples_split = 20, max_depth = 10, random_state = 99)
clf.fit(X, y)
return clf
My Python script for Apply model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(model, data):
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
data['prediction'] = model.predict(X)
#set role of prediction attribute to prediction
data.rm_metadata['prediction']=(None,'prediction')
return data
Let me know if you have other relevant solution or better script to produce a more stable model.
Thank you.
Regards,
Azmir F
I think the process is correct, there were similar processes with R in the forum.
As a side note, can I ask why do you need to use the Python decision tree? By using the Execute Python operator several times (2 times per CV fold) you are generating a huge overhead and also messing up with the parallelization features of RapidMiner. I would say that the smarter thing to do would be to use the Decision Tree operator or do CV inside the Execute Python operator.
@JEdward Thanks for sharing, your sample code is going to be a life saver for me!!
Thanks guys for the solutions you have provided. I have managed to come up with my own solution.
I did not know that python needs numerical data to apply the model. So I have modified my process and used Execute Python operators twice, once in Training and once in Testing. I used the Numerical to Binominal operator after the second Excecute Python operator.
Note that I have renamed it to Build Model and Apply Model.
This is my updated process:
Cross Validation Subprocess
My Python script for Build Model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(data):
# build decision tree
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
y = data[['Credit Rating']]
clf = DecisionTreeClassifier(min_samples_split = 20, max_depth = 10, random_state = 99)
clf.fit(X, y)
return clf
My Python script for Apply model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(model, data):
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
data['prediction'] = model.predict(X)
#set role of prediction attribute to prediction
data.rm_metadata['prediction']=(None,'prediction')
return data
Let me know if you have other relevant solution or better script to produce a more stable model.
Thank you.
Regards,
Azmir F
Hi Azmir
1. I think it's impossible to do only the model in Python inside the "Cross-validation" operator because the "Apply Model" operator (in the test part) expect a "RM model input" and recept a "Python object" and then the process fail.
Maybe someone has a solution to this problem. (if not rdv to the 2. ) However I have corrected some points in the process (i worked with the same datasets few weeks ago....) :
- add of a "nominal to numerical" operator (python need numerical value to perform model)
- Building the model with the entire dataset (you performed a split validation inside a cross validation, for me it's not relevant)
- suppression of the import of data in your "Execute python".(the parameter "data" of the python function is in fact the dataset which enter in the python operator).
Here this process :
2. I think the solution is to perform all the subprocess (building/applying/cross-validation/performance) with "Execute Python" operators
(only the data preprocessing is made with RM operator).
In the process below, in addition to the modifications described at 1., I have created an applying/cross validation/performance "Execute Python" operator with in exit :
- the y_prediction (applying the decision tree model at the training dataset) which is added to the dataset (last column)
- the associated accuracy (~70%)
- the feature importance
Here this process :
I hope this will be helpful,
Regards,
Lionel