Build decision tree using Python and embed in Rapid Miner

10383721
10383721 New Altair Community Member
edited November 2024 in Community Q&A

Hi guys, 

 

I am doing a project where I need to create decision tree using Python and then embed it in Rapid Miner using Execute Python operator. 

These are screenshots of my process:Screen Shot 2017-12-12 at 11.14.02.png

 

Screen Shot 2017-12-12 at 11.14.16.pngSubprocess in Cross Validation

 

 

This is my code for the decision tree:

import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

# rm_main is a mandatory function,
# the number of arguments has to be the number of input ports (can be none)
def rm_main(data):
#import data
file = '04_Class_4.1_german-credit-decoded.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)

#load a sheet into a DataFrame
gr_raw = xl.parse('RapidMiner Data')

#create arrays for the features, X, and response, y, variable
y = gr_raw['Credit Rating=Good'].values
X = gr_raw.drop('Credit Rating=Good', axis=1).values

#split data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)

#build decision tree classifier using gini index
clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)

return clf_gini

When executed it gives me an error, I am not sure which part of this code that I should ignore for a successfule execution. 

Would appreciate any advice or help on this! 

Thank you. 

 

Regards, 

Azmir F

Best Answer

  • 10383721
    10383721 New Altair Community Member
    Answer ✓

    Thanks guys for the solutions you have provided. I have managed to come up with my own solution. 

    I did not know that python needs numerical data to apply the model. So I have modified my process and used Execute Python operators twice, once in Training and once in Testing. I used the Numerical to Binominal operator after the second Excecute Python operator. 

    Note that I have renamed it to Build Model and Apply Model.

     

    This is my updated process:Screen Shot 2017-12-14 at 14.42.32.png

     

    Screen Shot 2017-12-14 at 14.42.45.pngCross Validation Subprocess

     

     

    My Python script for Build Model is as below:

    from sklearn.tree import DecisionTreeClassifier
    def rm_main(data):

    # build decision tree
    X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
    y = data[['Credit Rating']]
    clf = DecisionTreeClassifier(min_samples_split = 20, max_depth = 10, random_state = 99)
    clf.fit(X, y)

    return clf

    My Python script for Apply model is as below:

    from sklearn.tree import DecisionTreeClassifier
    def rm_main(model, data):
    X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
    data['prediction'] = model.predict(X)

    #set role of prediction attribute to prediction
    data.rm_metadata['prediction']=(None,'prediction')
    return data

    Let me know if you have other relevant solution or better script to produce a more stable model. 

    Thank you. 

     

    Regards,

    Azmir F

Answers

  • lionelderkrikor
    lionelderkrikor New Altair Community Member

    Hi Azmir

     

    1. I think it's impossible to do only the model in Python inside the "Cross-validation" operator because the "Apply Model" operator (in the test part) expect a "RM model input" and recept a "Python object" and then the process fail.

    Maybe someone has a solution to this problem. (if not rdv to the 2. ) However I have corrected some points in the process (i worked with the same datasets few weeks ago....) : 

     - add of a "nominal to numerical" operator (python need numerical value to perform model)

     - Building the model with the entire dataset (you performed a split validation inside a cross validation, for me it's not relevant)

     -  suppression of the import of data in your "Execute python".(the parameter "data" of the python function is in fact the dataset which enter in the python operator).

    Here this process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="30">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-raw.xlsx"/>
    <parameter key="imported_cell_range" value="A1:U1001"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Checking Account Status.true.polynominal.attribute"/>
    <parameter key="1" value="Duration in month.true.integer.attribute"/>
    <parameter key="2" value="Credit History.true.polynominal.attribute"/>
    <parameter key="3" value="Purpose.true.polynominal.attribute"/>
    <parameter key="4" value="Credit Amount.true.integer.attribute"/>
    <parameter key="5" value="Savings Account/Bonds.true.polynominal.attribute"/>
    <parameter key="6" value="Present Employment since.true.polynominal.attribute"/>
    <parameter key="7" value="Installment rate in % of disposable income.true.integer.attribute"/>
    <parameter key="8" value="Personal Status.true.polynominal.attribute"/>
    <parameter key="9" value="Other debtors.true.polynominal.attribute"/>
    <parameter key="10" value="Present residence since.true.integer.attribute"/>
    <parameter key="11" value="Property.true.polynominal.attribute"/>
    <parameter key="12" value="Age.true.integer.attribute"/>
    <parameter key="13" value="Other installment plans.true.polynominal.attribute"/>
    <parameter key="14" value="Housing.true.polynominal.attribute"/>
    <parameter key="15" value="Number of existing credits.true.integer.attribute"/>
    <parameter key="16" value="Job type.true.polynominal.attribute"/>
    <parameter key="17" value="Number of dependents.true.integer.attribute"/>
    <parameter key="18" value="Telephone.true.binominal.attribute"/>
    <parameter key="19" value="Foreign worker.true.binominal.attribute"/>
    <parameter key="20" value="Credit Rating.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="120">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="OldValue.true.polynominal.attribute"/>
    <parameter key="1" value="NewValue.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="179" y="75">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Checking Account Status"/>
    <parameter key="attributes" value="|Property|Other installment plans"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="from_attribute" value="OldValue"/>
    <parameter key="to_attribute" value="NewValue"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="45" y="255">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification-chk-acc.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="OldValue.true.polynominal.attribute"/>
    <parameter key="1" value="NewValue.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (2)" width="90" x="179" y="300">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Checking Account Status"/>
    <parameter key="attributes" value="|Property|Other installment plans"/>
    <parameter key="from_attribute" value="OldValue"/>
    <parameter key="to_attribute" value="NewValue"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="345">
    <parameter key="attribute_name" value="Credit Rating"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="numerical_to_binominal" compatibility="6.0.003" expanded="true" height="82" name="Numerical to Binominal" width="90" x="447" y="345">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Credit Rating"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="min" value="1.0"/>
    <parameter key="max" value="1.0"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Credit Rating"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="8.0.001" expanded="true" height="145" name="Cross Validation" width="90" x="514" y="34">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
    <parameter key="script" value="import numpy as np&#10;import pandas as pd&#10;from sklearn.cross_validation import train_test_split&#10;from sklearn.tree import DecisionTreeClassifier&#10;from sklearn.metrics import accuracy_score&#10;from sklearn import tree&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#9;&#10;&#9;#create arrays for the features, X, and response, y, variable&#10;&#9;y = data['Credit Rating'].values&#10;&#9;X = data.iloc[:,1:]&#10;&#10;&#9;#split data into training and testing set&#10;&#9;#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)&#10;&#10;&#9;#build decision tree classifier using gini index&#10;&#9;clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)&#10;&#9;#clf_gini.fit(X_train, y_train)&#10;&#9;clf_gini.fit(X, y)&#10;&#10;&#9;return clf_gini"/>
    </operator>
    <connect from_port="training set" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Replace (Dictionary)" to_port="example set input"/>
    <connect from_op="Read CSV" from_port="output" to_op="Replace (Dictionary)" to_port="dictionary"/>
    <connect from_op="Replace (Dictionary)" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
    <connect from_op="Read CSV (2)" from_port="output" to_op="Replace (2)" to_port="dictionary"/>
    <connect from_op="Replace (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
    <connect from_op="Numerical to Binominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
    <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    2. I think the solution is to perform all the subprocess (building/applying/cross-validation/performance) with "Execute Python" operators

    (only the data preprocessing is made with RM operator).

    In the process below, in addition to the modifications described at 1., I have created an applying/cross validation/performance  "Execute  Python" operator with in exit : 

     - the y_prediction (applying the decision tree model at  the training dataset) which is added to the dataset (last column)

     - the associated accuracy (~70%)

     - the feature importance

     

    Here this process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="68" name="Read Excel" width="90" x="45" y="30">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-raw.xlsx"/>
    <parameter key="imported_cell_range" value="A1:U1001"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Checking Account Status.true.polynominal.attribute"/>
    <parameter key="1" value="Duration in month.true.integer.attribute"/>
    <parameter key="2" value="Credit History.true.polynominal.attribute"/>
    <parameter key="3" value="Purpose.true.polynominal.attribute"/>
    <parameter key="4" value="Credit Amount.true.integer.attribute"/>
    <parameter key="5" value="Savings Account/Bonds.true.polynominal.attribute"/>
    <parameter key="6" value="Present Employment since.true.polynominal.attribute"/>
    <parameter key="7" value="Installment rate in % of disposable income.true.integer.attribute"/>
    <parameter key="8" value="Personal Status.true.polynominal.attribute"/>
    <parameter key="9" value="Other debtors.true.polynominal.attribute"/>
    <parameter key="10" value="Present residence since.true.integer.attribute"/>
    <parameter key="11" value="Property.true.polynominal.attribute"/>
    <parameter key="12" value="Age.true.integer.attribute"/>
    <parameter key="13" value="Other installment plans.true.polynominal.attribute"/>
    <parameter key="14" value="Housing.true.polynominal.attribute"/>
    <parameter key="15" value="Number of existing credits.true.integer.attribute"/>
    <parameter key="16" value="Job type.true.polynominal.attribute"/>
    <parameter key="17" value="Number of dependents.true.integer.attribute"/>
    <parameter key="18" value="Telephone.true.binominal.attribute"/>
    <parameter key="19" value="Foreign worker.true.binominal.attribute"/>
    <parameter key="20" value="Credit Rating.true.integer.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="120">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="OldValue.true.polynominal.attribute"/>
    <parameter key="1" value="NewValue.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (Dictionary)" width="90" x="179" y="75">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Checking Account Status"/>
    <parameter key="attributes" value="|Property|Other installment plans"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="from_attribute" value="OldValue"/>
    <parameter key="to_attribute" value="NewValue"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="6.0.003" expanded="true" height="68" name="Read CSV (2)" width="90" x="45" y="255">
    <parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\04_Class_4.1_german-credit-value-modification-chk-acc.csv"/>
    <parameter key="column_separators" value=","/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="OldValue.true.polynominal.attribute"/>
    <parameter key="1" value="NewValue.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="replace_dictionary" compatibility="7.1.001" expanded="true" height="103" name="Replace (2)" width="90" x="179" y="300">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Checking Account Status"/>
    <parameter key="attributes" value="|Property|Other installment plans"/>
    <parameter key="from_attribute" value="OldValue"/>
    <parameter key="to_attribute" value="NewValue"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="340">
    <parameter key="attribute_name" value="Credit Rating"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="numerical_to_binominal" compatibility="6.0.003" expanded="true" height="82" name="Numerical to Binominal" width="90" x="447" y="345">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Credit Rating"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="min" value="1.0"/>
    <parameter key="max" value="1.0"/>
    </operator>
    <operator activated="true" class="nominal_to_numerical" compatibility="8.0.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Credit Rating"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="coding_type" value="unique integers"/>
    <list key="comparison_groups"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="8.0.001" expanded="true" height="103" name="Multiply" width="90" x="581" y="187"/>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Build model" width="90" x="782" y="34">
    <parameter key="script" value="import numpy as np&#10;import pandas as pd&#10;from sklearn.cross_validation import train_test_split&#10;from sklearn.tree import DecisionTreeClassifier&#10;from sklearn.metrics import accuracy_score&#10;from sklearn import tree&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#9;&#10;&#9;#create arrays for the features, X, and response, y, variable&#10;&#9;y = data['Credit Rating']&#10;&#9;X = data.drop('Credit Rating', axis=1)&#10;&#10;&#9;#split data into training and testing set&#10;&#9;#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)&#10;&#10;&#9;#build decision tree classifier using gini index&#10;&#9;clf_gini = DecisionTreeClassifier(criterion='gini', random_state=50, max_depth=10, min_samples_leaf=5)&#10;&#9;#clf_gini.fit(X_train, y_train)&#10;&#9;clf_gini.fit(X, y)&#10;&#10;&#9;return clf_gini"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="145" name="Apply/cross_validation/performance" width="90" x="849" y="289">
    <parameter key="script" value="import numpy as np&#10;import pandas as pd&#10;from sklearn.cross_validation import cross_val_score&#10;from sklearn.tree import DecisionTreeClassifier&#10;from sklearn.metrics import accuracy_score&#10;from sklearn import tree&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(model,data):&#10;&#9;&#10;&#9;#create arrays for the features, X, and response, y, variable (the same as the training set)&#10; #y = data['Credit Rating']&#10; y = data['Credit Rating']&#10; X = data.drop('Credit Rating', axis=1)&#10;&#10; feature = list(X)&#10;&#10;&#9;#Apply the model&#10; y_pred = model.predict(X)&#10;&#10;&#9;#perform the cross validation and calculate the mean and std of accuracy&#10; accuracy_mean = 100*(cross_val_score(model,X,y,scoring = 'accuracy',cv = 10)).mean()&#10; accuracy_std = 100*(cross_val_score(model,X,y,scoring = 'accuracy',cv = 10)).std()&#10; accuracy = str(accuracy_mean) + &quot; +/- &quot; + str(accuracy_std)&#10;&#10;&#9;#Calculation of feature importance&#10;&#10; feat_importance = model.feature_importances_&#9;&#10;&#9;&#10;&#9;#Write the results&#10;&#10; accuracy = pd.DataFrame(data = [accuracy],columns = ['accuracy'])&#10; y_prediction = pd.DataFrame(data = y_pred,columns = ['Credit Rating (prediction)']) &#10; feature_importance = pd.DataFrame(data = feat_importance,columns = ['feature importances']) &#10; features = pd.DataFrame(data = feature,columns = ['features'])&#10; &#10; data = data.join(y_prediction)&#10; features = features.join( feature_importance)&#10;&#10;&#9;&#10; return data,accuracy,feature_importance,features "/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Replace (Dictionary)" to_port="example set input"/>
    <connect from_op="Read CSV" from_port="output" to_op="Replace (Dictionary)" to_port="dictionary"/>
    <connect from_op="Replace (Dictionary)" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
    <connect from_op="Read CSV (2)" from_port="output" to_op="Replace (2)" to_port="dictionary"/>
    <connect from_op="Replace (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
    <connect from_op="Numerical to Binominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
    <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Build model" to_port="input 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Apply/cross_validation/performance" to_port="input 2"/>
    <connect from_op="Build model" from_port="output 1" to_op="Apply/cross_validation/performance" to_port="input 1"/>
    <connect from_op="Apply/cross_validation/performance" from_port="output 1" to_port="result 1"/>
    <connect from_op="Apply/cross_validation/performance" from_port="output 2" to_port="result 2"/>
    <connect from_op="Apply/cross_validation/performance" from_port="output 3" to_port="result 3"/>
    <connect from_op="Apply/cross_validation/performance" from_port="output 4" to_port="result 4"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    </process>
    </operator>
    </process>

    I hope this will be helpful,

     

    Regards,

     

    Lionel

     

     

  • JEdward
    JEdward New Altair Community Member

    Here's the building block I use for XValidation with Python.  I have one that also works with the Compare Models operator, but that is very complex. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="x_validation" compatibility="7.6.001" expanded="true" height="124" name="Validation" width="90" x="380" y="34">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="BDT (sklearn)" width="90" x="112" y="34">
    <parameter key="script" value="&#10;import pandas as pd&#10;from sklearn.ensemble import GradientBoostingClassifier&#10;&#10;# This script creates a GradientBoostingClassifier from SKLearn on RM data&#10;# It can be used as a generic template for other sklearn classifiers or regressors&#10;&#10;# Author: mschmitz&#10;&#10;def rm_main(data):&#10; metadata = data.rm_metadata&#10;&#10; # Get the list of regular attributes and the label&#10; &#10; df = pd.DataFrame(metadata).T&#10; label = df[df[1]==&quot;label&quot;].index.values&#10; regular = df[df[1] != df[1]].index.values&#10; &#10; # Create the Tree, for more options see&#10; # For details see:&#10;&#10; clf = GradientBoostingClassifier(&#10; n_estimators=10,&#10; max_features=&quot;sqrt&quot;)&#10; &#10; # learn it&#10; clf.fit(data[regular], data[label])&#10;&#10; # Return also the list of regulars and labels for later application&#10; &#10; return (clf,regular,label[0]), data&#10;"/>
    </operator>
    <connect from_port="training" to_op="BDT (sklearn)" to_port="input 1"/>
    <connect from_op="BDT (sklearn)" from_port="output 1" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply Model (2)" width="90" x="45" y="34">
    <parameter key="script" value="import pandas as pd&#10;&#10;&#10;# rm_main is a mandatory function,&#10;# the number of arguments has to be the number of input ports (can be none)&#10;&#10;&#10;def rm_main(clfinfo, data):&#10; clf = clfinfo[0]&#10; regular = clfinfo[1]&#10; label = clfinfo[2]&#10; meta = data.rm_metadata&#10; predictions = clf.predict(data[regular])&#10; confidences = clf.predict_proba(data[regular])&#10;&#10;&#10; predictions = pd.DataFrame(predictions, columns=[&quot;prediction(&quot;+label+&quot;)&quot;])&#10; confidences = pd.DataFrame(confidences,&#10; columns=[&quot;confidence(&quot; + str(c) + &quot;)&quot; for c in clf.classes_])&#10;&#10; data = data.join(predictions)&#10; data = data.join(confidences)&#10; data.rm_metadata = meta&#10; data.rm_metadata[&quot;prediction(&quot;+label+&quot;)&quot;] = (&quot;nominal&quot;,&quot;prediction&quot;)&#10;&#10; for c in clf.classes_:&#10; data.rm_metadata[&quot;confidence(&quot;+str(c)+&quot;)&quot;] = (&quot;numerical&quot;,&quot;confidence_&quot;+str(c))&#10;&#10; return data, clf&#10;"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model (2)" to_port="input 1"/>
    <connect from_port="test set" to_op="Apply Model (2)" to_port="input 2"/>
    <connect from_op="Apply Model (2)" from_port="output 1" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_averagable 1" spacing="0"/>
    <portSpacing port="sink_averagable 2" spacing="0"/>
    </process>
    </operator>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    </operator>
    </process>
  • 10383721
    10383721 New Altair Community Member
    Answer ✓

    Thanks guys for the solutions you have provided. I have managed to come up with my own solution. 

    I did not know that python needs numerical data to apply the model. So I have modified my process and used Execute Python operators twice, once in Training and once in Testing. I used the Numerical to Binominal operator after the second Excecute Python operator. 

    Note that I have renamed it to Build Model and Apply Model.

     

    This is my updated process:Screen Shot 2017-12-14 at 14.42.32.png

     

    Screen Shot 2017-12-14 at 14.42.45.pngCross Validation Subprocess

     

     

    My Python script for Build Model is as below:

    from sklearn.tree import DecisionTreeClassifier
    def rm_main(data):

    # build decision tree
    X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
    y = data[['Credit Rating']]
    clf = DecisionTreeClassifier(min_samples_split = 20, max_depth = 10, random_state = 99)
    clf.fit(X, y)

    return clf

    My Python script for Apply model is as below:

    from sklearn.tree import DecisionTreeClassifier
    def rm_main(model, data):
    X = data[['Age', 'Duration in month', 'Installment rate in % of disposable income','Credit Amount', 'Present residence since', 'Number of existing credits', 'Number of dependents']]
    data['prediction'] = model.predict(X)

    #set role of prediction attribute to prediction
    data.rm_metadata['prediction']=(None,'prediction')
    return data

    Let me know if you have other relevant solution or better script to produce a more stable model. 

    Thank you. 

     

    Regards,

    Azmir F

  • SGolbert
    SGolbert New Altair Community Member

    I think the process is correct, there were similar processes with R in the forum.

     

    As a side note, can I ask why do you need to use the Python decision tree? By using the Execute Python operator several times (2 times per CV fold) you are generating a huge overhead and also messing up with the parallelization features of RapidMiner. I would say that the smarter thing to do would be to use the Decision Tree operator or do CV inside the Execute Python operator.

  • 10383721
    10383721 New Altair Community Member

    It is for our assignment to introduce the functionality of Execute Python in Rapid Miner. 

    Thanks for the info!

  • Thomas_Ott
    Thomas_Ott New Altair Community Member

    @JEdward Thanks for sharing, your sample code is going to be a life saver for me!!