Build decision tree using Python and embed in Rapid Miner
Hi guys,
I am doing a project where I need to create decision tree using Python and then embed it in Rapid Miner using Execute Python operator.
These are screenshots of my process:
Subprocess in Cross Validation
This is my code for the decision tree:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
# rm_main is a mandatory function,
# the number of arguments has to be the number of input ports (can be none)
def rm_main(data):
#import data
file = '04_Class_4.1_german-credit-decoded.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
#load a sheet into a DataFrame
gr_raw = xl.parse('RapidMiner Data')
#create arrays for the features, X, and response, y, variable
y = gr_raw['Credit Rating=Good'].values
X = gr_raw.drop('Credit Rating=Good', axis=1).values
#split data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)
#build decision tree classifier using gini index
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
return clf_gini
When executed it gives me an error, I am not sure which part of this code that I should ignore for a successfule execution.
Would appreciate any advice or help on this!
Thank you.
Regards,
Azmir F
Best Answer
-
Thanks guys for the solutions you have provided. I have managed to come up with my own solution.
I did not know that python needs numerical data to apply the model. So I have modified my process and used Execute Python operators twice, once in Training and once in Testing. I used the Numerical to Binominal operator after the second Excecute Python operator.
Note that I have renamed it to Build Model and Apply Model.
This is my updated process:
Cross Validation Subprocess
My Python script for Build Model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(data):
# build decision tree
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
y = data[['Credit Rating']]
clf = DecisionTreeClassifier(min_samples_split = 20, max_depth = 10, random_state = 99)
clf.fit(X, y)
return clfMy Python script for Apply model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(model, data):
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
data['prediction'] = model.predict(X)
#set role of prediction attribute to prediction
data.rm_metadata['prediction']=(None,'prediction')
return dataLet me know if you have other relevant solution or better script to produce a more stable model.
Thank you.
Regards,
Azmir F
2
Answers
-
Hi Azmir
1. I think it's impossible to do only the model in Python inside the "Cross-validation" operator because the "Apply Model" operator (in the test part) expect a "RM model input" and recept a "Python object" and then the process fail.
Maybe someone has a solution to this problem. (if not rdv to the 2. ) However I have corrected some points in the process (i worked with the same datasets few weeks ago....) :
- add of a "nominal to numerical" operator (python need numerical value to perform model)
- Building the model with the entire dataset (you performed a split validation inside a cross validation, for me it's not relevant)
- suppression of the import of data in your "Execute python".(the parameter "data" of the python function is in fact the dataset which enter in the python operator).
Here this process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-raw.xlsx"/>
<parameter key="imported_cell_range" value="A1:U1001"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Checking Account Status.true.polynominal.attribute"/>
<parameter key="1" value="Duration in month.true.integer.attribute"/>
<parameter key="2" value="Credit History.true.polynominal.attribute"/>
<parameter key="3" value="Purpose.true.polynominal.attribute"/>
<parameter key="4" value="Credit Amount.true.integer.attribute"/>
<parameter key="5" value="Savings Account/Bonds.true.polynominal.attribute"/>
<parameter key="6" value="Present Employment since.true.polynominal.attribute"/>
<parameter key="7" value="Installment rate in % of disposable income.true.integer.attribute"/>
<parameter key="8" value="Personal Status.true.polynominal.attribute"/>
<parameter key="9" value="Other debtors.true.polynominal.attribute"/>
<parameter key="10" value="Present residence since.true.integer.attribute"/>
<parameter key="11" value="Property.true.polynominal.attribute"/>
<parameter key="12" value="Age.true.integer.attribute"/>
<parameter key="13" value="Other installment plans.true.polynominal.attribute"/>
<parameter key="14" value="Housing.true.polynominal.attribute"/>
<parameter key="15" value="Number of existing credits.true.integer.attribute"/>
<parameter key="16" value="Job type.true.polynominal.attribute"/>
<parameter key="17" value="Number of dependents.true.integer.attribute"/>
<parameter key="18" value="Telephone.true.binominal.attribute"/>
<parameter key="19" value="Foreign worker.true.binominal.attribute"/>
<parameter key="20" value="Credit Rating.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="120">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="OldValue.true.polynominal.attribute"/>
<parameter key="1" value="NewValue.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="179" y="75">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Checking Account Status"/>
<parameter key="attributes" value="|Property|Other installment plans"/>
<parameter key="invert_selection" value="true"/>
<parameter key="from_attribute" value="OldValue"/>
<parameter key="to_attribute" value="NewValue"/>
</operator>
<operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="45" y="255">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification-chk-acc.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="OldValue.true.polynominal.attribute"/>
<parameter key="1" value="NewValue.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (2)" width="90" x="179" y="300">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Checking Account Status"/>
<parameter key="attributes" value="|Property|Other installment plans"/>
<parameter key="from_attribute" value="OldValue"/>
<parameter key="to_attribute" value="NewValue"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="345">
<parameter key="attribute_name" value="Credit Rating"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="6.0.003" expanded="true" height="82" name="Numerical to Binominal" width="90" x="447" y="345">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Credit Rating"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="min" value="1.0"/>
<parameter key="max" value="1.0"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Credit Rating"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="8.0.001" expanded="true" height="145" name="Cross Validation" width="90" x="514" y="34">
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
<parameter key="script" value="import numpy as np import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn import tree # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): 	 	#create arrays for the features, X, and response, y, variable 	y = data['Credit Rating'].values 	X = data.iloc[:,1:] 	#split data into training and testing set 	#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50) 	#build decision tree classifier using gini index 	clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5) 	#clf_gini.fit(X_train, y_train) 	clf_gini.fit(X, y) 	return clf_gini"/>
</operator>
<connect from_port="training set" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Replace (Dictionary)" to_port="example set input"/>
<connect from_op="Read CSV" from_port="output" to_op="Replace (Dictionary)" to_port="dictionary"/>
<connect from_op="Replace (Dictionary)" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
<connect from_op="Read CSV (2)" from_port="output" to_op="Replace (2)" to_port="dictionary"/>
<connect from_op="Replace (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
<connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
<connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
<connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>2. I think the solution is to perform all the subprocess (building/applying/cross-validation/performance) with "Execute Python" operators
(only the data preprocessing is made with RM operator).
In the process below, in addition to the modifications described at 1., I have created an applying/cross validation/performance "Execute Python" operator with in exit :
- the y_prediction (applying the decision tree model at the training dataset) which is added to the dataset (last column)
- the associated accuracy (~70%)
- the feature importance
Here this process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-raw.xlsx"/>
<parameter key="imported_cell_range" value="A1:U1001"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Checking Account Status.true.polynominal.attribute"/>
<parameter key="1" value="Duration in month.true.integer.attribute"/>
<parameter key="2" value="Credit History.true.polynominal.attribute"/>
<parameter key="3" value="Purpose.true.polynominal.attribute"/>
<parameter key="4" value="Credit Amount.true.integer.attribute"/>
<parameter key="5" value="Savings Account/Bonds.true.polynominal.attribute"/>
<parameter key="6" value="Present Employment since.true.polynominal.attribute"/>
<parameter key="7" value="Installment rate in % of disposable income.true.integer.attribute"/>
<parameter key="8" value="Personal Status.true.polynominal.attribute"/>
<parameter key="9" value="Other debtors.true.polynominal.attribute"/>
<parameter key="10" value="Present residence since.true.integer.attribute"/>
<parameter key="11" value="Property.true.polynominal.attribute"/>
<parameter key="12" value="Age.true.integer.attribute"/>
<parameter key="13" value="Other installment plans.true.polynominal.attribute"/>
<parameter key="14" value="Housing.true.polynominal.attribute"/>
<parameter key="15" value="Number of existing credits.true.integer.attribute"/>
<parameter key="16" value="Job type.true.polynominal.attribute"/>
<parameter key="17" value="Number of dependents.true.integer.attribute"/>
<parameter key="18" value="Telephone.true.binominal.attribute"/>
<parameter key="19" value="Foreign worker.true.binominal.attribute"/>
<parameter key="20" value="Credit Rating.true.integer.attribute"/>
</list>
</operator>
<operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="120">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="OldValue.true.polynominal.attribute"/>
<parameter key="1" value="NewValue.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="179" y="75">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Checking Account Status"/>
<parameter key="attributes" value="|Property|Other installment plans"/>
<parameter key="invert_selection" value="true"/>
<parameter key="from_attribute" value="OldValue"/>
<parameter key="to_attribute" value="NewValue"/>
</operator>
<operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="45" y="255">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification-chk-acc.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="OldValue.true.polynominal.attribute"/>
<parameter key="1" value="NewValue.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (2)" width="90" x="179" y="300">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Checking Account Status"/>
<parameter key="attributes" value="|Property|Other installment plans"/>
<parameter key="from_attribute" value="OldValue"/>
<parameter key="to_attribute" value="NewValue"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="340">
<parameter key="attribute_name" value="Credit Rating"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="6.0.003" expanded="true" height="82" name="Numerical to Binominal" width="90" x="447" y="345">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Credit Rating"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="min" value="1.0"/>
<parameter key="max" value="1.0"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Credit Rating"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
<parameter key="coding_type" value="unique integers"/>
<list key="comparison_groups"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="581" y="187"/>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Build model" width="90" x="782" y="34">
<parameter key="script" value="import numpy as np import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn import tree # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): 	 	#create arrays for the features, X, and response, y, variable 	y = data['Credit Rating'] 	X = data.drop('Credit Rating', axis=1) 	#split data into training and testing set 	#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50) 	#build decision tree classifier using gini index 	clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5) 	#clf_gini.fit(X_train, y_train) 	clf_gini.fit(X, y) 	return clf_gini"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="145" name="Apply/cross_validation/performance" width="90" x="849" y="289">
<parameter key="script" value="import numpy as np import pandas as pd from sklearn.cross_validation import cross_val_score from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn import tree # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(model,data): 	 	#create arrays for the features, X, and response, y, variable (the same as the training set) #y = data['Credit Rating'] y = data['Credit Rating'] X = data.drop('Credit Rating', axis=1) feature = list(X) 	#Apply the model y_pred = model.predict(X) 	#perform the cross validation and calculate the mean and std of accuracy accuracy_mean = 100*(cross_val_score(model,X,y,scoring = 'accuracy',cv = 10)).mean() accuracy_std = 100*(cross_val_score(model,X,y,scoring = 'accuracy',cv = 10)).std() accuracy = str(accuracy_mean) + " +/- " + str(accuracy_std) 	#Calculation of feature importance feat_importance = model.feature_importances_	 	 	#Write the results accuracy = pd.DataFrame(data = [accuracy],columns = ['accuracy']) y_prediction = pd.DataFrame(data = y_pred,columns = ['Credit Rating (prediction)']) feature_importance = pd.DataFrame(data = feat_importance,columns = ['feature importances']) features = pd.DataFrame(data = feature,columns = ['features']) data = data.join(y_prediction) features = features.join( feature_importance) 	 return data,accuracy,feature_importance,features "/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Replace (Dictionary)" to_port="example set input"/>
<connect from_op="Read CSV" from_port="output" to_op="Replace (Dictionary)" to_port="dictionary"/>
<connect from_op="Replace (Dictionary)" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
<connect from_op="Read CSV (2)" from_port="output" to_op="Replace (2)" to_port="dictionary"/>
<connect from_op="Replace (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Build model" to_port="input 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Apply/cross_validation/performance" to_port="input 2"/>
<connect from_op="Build model" from_port="output 1" to_op="Apply/cross_validation/performance" to_port="input 1"/>
<connect from_op="Apply/cross_validation/performance" from_port="output 1" to_port="result 1"/>
<connect from_op="Apply/cross_validation/performance" from_port="output 2" to_port="result 2"/>
<connect from_op="Apply/cross_validation/performance" from_port="output 3" to_port="result 3"/>
<connect from_op="Apply/cross_validation/performance" from_port="output 4" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>I hope this will be helpful,
Regards,
Lionel
3 -
Here's the building block I use for XValidation with Python. I have one that also works with the Compare Models operator, but that is very complex.
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="x_validation" compatibility="7.6.001" expanded="true" height="124" name="Validation" width="90" x="380" y="34">
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="BDT (sklearn)" width="90" x="112" y="34">
<parameter key="script" value=" import pandas as pd from sklearn.ensemble import GradientBoostingClassifier # This script creates a GradientBoostingClassifier from SKLearn on RM data # It can be used as a generic template for other sklearn classifiers or regressors # Author: mschmitz def rm_main(data): metadata = data.rm_metadata # Get the list of regular attributes and the label df = pd.DataFrame(metadata).T label = df[df[1]=="label"].index.values regular = df[df[1] != df[1]].index.values # Create the Tree, for more options see # For details see: clf = GradientBoostingClassifier( n_estimators=10, max_features="sqrt") # learn it clf.fit(data[regular], data[label]) # Return also the list of regulars and labels for later application return (clf,regular,label[0]), data "/>
</operator>
<connect from_port="training" to_op="BDT (sklearn)" to_port="input 1"/>
<connect from_op="BDT (sklearn)" from_port="output 1" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply Model (2)" width="90" x="45" y="34">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(clfinfo, data): clf = clfinfo[0] regular = clfinfo[1] label = clfinfo[2] meta = data.rm_metadata predictions = clf.predict(data[regular]) confidences = clf.predict_proba(data[regular]) predictions = pd.DataFrame(predictions, columns=["prediction("+label+")"]) confidences = pd.DataFrame(confidences, columns=["confidence(" + str(c) + ")" for c in clf.classes_]) data = data.join(predictions) data = data.join(confidences) data.rm_metadata = meta data.rm_metadata["prediction("+label+")"] = ("nominal","prediction") for c in clf.classes_: data.rm_metadata["confidence("+str(c)+")"] = ("numerical","confidence_"+str(c)) return data, clf "/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model (2)" to_port="input 1"/>
<connect from_port="test set" to_op="Apply Model (2)" to_port="input 2"/>
<connect from_op="Apply Model (2)" from_port="output 1" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
</process>3 -
Thanks guys for the solutions you have provided. I have managed to come up with my own solution.
I did not know that python needs numerical data to apply the model. So I have modified my process and used Execute Python operators twice, once in Training and once in Testing. I used the Numerical to Binominal operator after the second Excecute Python operator.
Note that I have renamed it to Build Model and Apply Model.
This is my updated process:
Cross Validation Subprocess
My Python script for Build Model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(data):
# build decision tree
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
y = data[['Credit Rating']]
clf = DecisionTreeClassifier(min_samples_split = 20, max_depth = 10, random_state = 99)
clf.fit(X, y)
return clfMy Python script for Apply model is as below:
from sklearn.tree import DecisionTreeClassifier
def rm_main(model, data):
X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
data['prediction'] = model.predict(X)
#set role of prediction attribute to prediction
data.rm_metadata['prediction']=(None,'prediction')
return dataLet me know if you have other relevant solution or better script to produce a more stable model.
Thank you.
Regards,
Azmir F
2 -
I think the process is correct, there were similar processes with R in the forum.
As a side note, can I ask why do you need to use the Python decision tree? By using the Execute Python operator several times (2 times per CV fold) you are generating a huge overhead and also messing up with the parallelization features of RapidMiner. I would say that the smarter thing to do would be to use the Decision Tree operator or do CV inside the Execute Python operator.
1 -
It is for our assignment to introduce the functionality of Execute Python in Rapid Miner.
Thanks for the info!
1 -
@JEdward Thanks for sharing, your sample code is going to be a life saver for me!!
1