Find more posts tagged with
Sort by:
31 - 60 of
702
@varunm1 @sgenzer
it works but cross validation doesnt show accuracy or kappa

please help me to solve it
thank you
it works but cross validation doesnt show accuracy or kappa


please help me to solve it
thank you

@varunm1
About "getting a label with a single sample" is possible in predicting of cancer because cancer cell is unique between cells
About "getting a label with a single sample" is possible in predicting of cancer because cancer cell is unique between cells
@varunm1 @sgenzer mschmitz
please look at the screenshot it doesnt calculate kappa or accuracy.
please help me to solve that
please look at the screenshot it doesnt calculate kappa or accuracy.
please help me to solve that
Hello @mbs
Simple mistake, you did not connect "per" port of cross validation to the results. Per means performance
Simple mistake, you did not connect "per" port of cross validation to the results. Per means performance
varunm1 Hi
I will try it now
I will try it now
varunm1
yes it works
thank you 
I did all the points that you told me about data
but the result is fun
some of the algorithm results changed to better some of the is not. Logically the result is fun 
any way thank you very much my kind friend
yes it works



I did all the points that you told me about data
but the result is fun


any way thank you very much my kind friend

One last thing, cross validation results can be lower (bad) than your random split results. The reason is it tests and trains on all data and averages the performance. If there is some bad data, that will reduce the performance. For more details, read about cross validation and you will get to know it. But cross validation will give you reliable results.
Thanks
Thanks
hi @mbs I'd recommend this from the Academy: https://academy.rapidminer.com/learn/video/validating-a-model
You could also try searching the community...there is a rather well-written article on this topic from March 4th.

You could also try searching the community...there is a rather well-written article on this topic from March 4th.


Search for answers in this community or academy. Finally, Google is your best friend. Try searching until you find something you can understand because we cannot know which is the best one for you. Read different things, and you will get to know easily. As our time is limited, we recommend you try hard first and then ask us questions in case you have any. This the way we learn as well.
@varunm1
according to this link:
https://community.rapidminer.com/discussion/54621/cross-validation-and-its-outputs-in-rm-studio
because of the 2000 number of excel that I have ( large data) split data work better than cross validation.
During the test I understand that if I combine 3 or 4 algorithm and use cross validation the result is better than split data.
Regards
mbs
according to this link:
https://community.rapidminer.com/discussion/54621/cross-validation-and-its-outputs-in-rm-studio
because of the 2000 number of excel that I have ( large data) split data work better than cross validation.
During the test I understand that if I combine 3 or 4 algorithm and use cross validation the result is better than split data.
Regards
mbs
Yep, you can select whatever works in your case. If you ask me, 2000 samples is a normal data and I cross-validate data with 100000 samples to get confident results. Again, this might be subjective. Getting good performance and reliable performance are two different things. Try different things and see what is good for your thesis.
@varunm1
Thank you for all the points that you mentioned.
With your perfect suggestion my thesis doesnt have any problem and I'm sure that I will pass it easily.
Regards
mbs
Thank you for all the points that you mentioned.
With your perfect suggestion my thesis doesnt have any problem and I'm sure that I will pass it easily.
Regards
mbs
For reason 2, you need to start from smaller networks and then build more complex networks based on data and test performance. There is no use of building networks with more hidden layers when a simple neural network can achieve your task.
For reason 3, use AUC values as performance metric instead of accuracy.
Reason 2: The complex algorithms overfit some times (depends on data). A deep learning algorithm is the one which has more hidden layers. In my statement, I am saying to train and test a model with a single hidden layer first and then note the performance parameters like accuracy, kappa, etc. Then you can build another model with more hidden layers and see the performances. If your simple model is giving the best performance there is no need to use a complex model with multiple hidden layers.
@varunm1
These are your suggestion but I couldnt understand them and they are important. so please make an example with them and share your xml.
Thank you very much
mbs
For reason 3, use AUC values as performance metric instead of accuracy.
Reason 2: The complex algorithms overfit some times (depends on data). A deep learning algorithm is the one which has more hidden layers. In my statement, I am saying to train and test a model with a single hidden layer first and then note the performance parameters like accuracy, kappa, etc. Then you can build another model with more hidden layers and see the performances. If your simple model is giving the best performance there is no need to use a complex model with multiple hidden layers.
@varunm1
These are your suggestion but I couldnt understand them and they are important. so please make an example with them and share your xml.
Thank you very much
mbs
Sorry @mbs, I am swamped working on paper. I can explain it to you. I know you got confused with "smaller network" and "complex network". A neural network can have multiple layers. So, a simple neural network, in my view, is with one hidden layer and a few neurons in it. If you increase the number of hidden layers with a different number of neurons and different activation functions the network is becoming more complex. You can build models with a different number of layers with a neural network operator or deep learning operator or deep learning extension in Rapidminer. I recommend you get a general understanding of neural networks and deep learning (deep neural networks) and try with relevant operators in rapidminer.
If you have any specific question or need clarification I can help in that case but building models take time. I recommend you watch videos and tutorials from rapidminer or any other source that makes you understand easily.
If you have any specific question or need clarification I can help in that case but building models take time. I recommend you watch videos and tutorials from rapidminer or any other source that makes you understand easily.
Hi
@varunm1
According to your previous help please tell me how can I use more than 1 algorithm and combine them then use cross validation without using group model?
According to the points that @varunm1 said if we have a data with label we dont need to separate dataset in to traning and testing. And also RM with cross validation is able to separte it automatically to the train and test parts And for the testing part it will not use the label like the training part.
Are these points correct?
Thank you
@varunm1
According to your previous help please tell me how can I use more than 1 algorithm and combine them then use cross validation without using group model?
According to the points that @varunm1 said if we have a data with label we dont need to separate dataset in to traning and testing. And also RM with cross validation is able to separte it automatically to the train and test parts And for the testing part it will not use the label like the training part.
Are these points correct?
Thank you
Hello @mbs
Which models are you trying to combine? I am not sure if there is a way to combine models without group model operator.
Yes, Cross-validation operator will divide your data into multiple subsets that are used for training and testing of algorithms. Yes, testing is done without labels as the trained model is trying to predict the output for a given sample.
Thanks
Which models are you trying to combine? I am not sure if there is a way to combine models without group model operator.
Yes, Cross-validation operator will divide your data into multiple subsets that are used for training and testing of algorithms. Yes, testing is done without labels as the trained model is trying to predict the output for a given sample.
Thanks
@varunm1
Thank you for your great answer again.
the algorithms are:
1. deep learning
2. j48
3. random forest
4. knn
5. gradient boosted tree
6. neural network
7. svm
Thank you for the time that you spend on my questions
Thank you for your great answer again.

the algorithms are:
1. deep learning
2. j48
3. random forest
4. knn
5. gradient boosted tree
6. neural network
7. svm
Thank you for the time that you spend on my questions

@mbs
Are you trying to combine all these models into a single model? (or) are you trying to get cross-validation performance of each model separately?
I never tried combining these many models into a single model. You can try using group models but not sure how it works.
Are you trying to combine all these models into a single model? (or) are you trying to get cross-validation performance of each model separately?
I never tried combining these many models into a single model. You can try using group models but not sure how it works.
@varunm1
the result of them are perfect.The accuracy of them is around 99.5. this is "Ensemble learning".
Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).
look at this link please.
https://en.wikipedia.org/wiki/Ensemble_learning
the result of them are perfect.The accuracy of them is around 99.5. this is "Ensemble learning".
Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (bagging), bias (boosting), or improve predictions (stacking).
look at this link please.
https://en.wikipedia.org/wiki/Ensemble_learning
Yep, ensemble models work but you should be careful in analyzing higher performance. For this, you need to set aside some data from testing after the model is trained and tested using cross-validation. If the performance of this hold out a dataset is good then your model might be good.
PS: Cross-validation reduces overfitting but complex models tend to overfit even in cross-validation so we should be careful in analyzing very good results.
PS: Cross-validation reduces overfitting but complex models tend to overfit even in cross-validation so we should be careful in analyzing very good results.
Sorry, don't get confused. What I am saying is a validation process we do when we observe high performances like 99 percent accuracy etc., We split the dataset into 0.8 to 0.2 ratio and cross-validate the 0.8 portions of the dataset and then we connect the model output of cross-validation operator to test the 0.2 percent dataset. Now we can have a performance from cross-validation and hold out (0.2) dataset as well.
If you think this is confusing you can go with your current results.
If you think this is confusing you can go with your current results.
31 - 60 of
702
Sort by:
1 - 22 of
221
@mbs
The simple debugging technique would be to disable all operators after Nominal to numerical operator and connect the output of nominal to numerical operator to results and run your program. Then you can check the statistics of your example set to see if there are any missing values in any attribute of your dataset.
Secondly, adding neural net to a deep learning algorithm is redundant as mentioned by @hughesfleming68, You can check the layer information from neural net operator parameters and add this as a new layer in deep learning operator.
The simple debugging technique would be to disable all operators after Nominal to numerical operator and connect the output of nominal to numerical operator to results and run your program. Then you can check the statistics of your example set to see if there are any missing values in any attribute of your dataset.
Secondly, adding neural net to a deep learning algorithm is redundant as mentioned by @hughesfleming68, You can check the layer information from neural net operator parameters and add this as a new layer in deep learning operator.
Then if you want to use the same architecture, I think you can use group models operator and try.
It can take multiple models and combine them into a single model. More detailed explanation below.
https://docs.rapidminer.com/8.0/studio/operators/modeling/predictive/group_models.html
https://docs.rapidminer.com/8.0/studio/operators/modeling/predictive/group_models.html
When this combined model is applied, it is equivalent to applying the original models in their respective order
Here is a working example, XML below. Analyze my inputs and outputs carefully and how I am connecting them.
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="136">
<parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="9.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="34">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="coding_type" value="dummy coding"/>
<parameter key="use_comparison_groups" value="false"/>
<list key="comparison_groups"/>
<parameter key="unexpected_value_handling" value="all 0 and warning"/>
<parameter key="use_underscore_in_name" value="false"/>
</operator>
<operator activated="true" class="split_data" compatibility="9.2.001" expanded="true" height="103" name="Split Data" width="90" x="313" y="34">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="automatic"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply" width="90" x="447" y="136"/>
<operator activated="true" class="h2o:deep_learning" compatibility="9.2.000" expanded="true" height="82" name="Deep Learning" width="90" x="581" y="238">
<parameter key="activation" value="Rectifier"/>
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="50"/>
<parameter key="hidden_layer_sizes" value="50"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<parameter key="reproducible_(uses_1_thread)" value="false"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="epochs" value="10.0"/>
<parameter key="compute_variable_importances" value="false"/>
<parameter key="train_samples_per_iteration" value="-2"/>
<parameter key="adaptive_rate" value="true"/>
<parameter key="epsilon" value="1.0E-8"/>
<parameter key="rho" value="0.99"/>
<parameter key="learning_rate" value="0.005"/>
<parameter key="learning_rate_annealing" value="1.0E-6"/>
<parameter key="learning_rate_decay" value="1.0"/>
<parameter key="momentum_start" value="0.0"/>
<parameter key="momentum_ramp" value="1000000.0"/>
<parameter key="momentum_stable" value="0.0"/>
<parameter key="nesterov_accelerated_gradient" value="true"/>
<parameter key="standardize" value="true"/>
<parameter key="L1" value="1.0E-5"/>
<parameter key="L2" value="0.0"/>
<parameter key="max_w2" value="10.0"/>
<parameter key="loss_function" value="Automatic"/>
<parameter key="distribution_function" value="AUTO"/>
<parameter key="early_stopping" value="false"/>
<parameter key="stopping_rounds" value="1"/>
<parameter key="stopping_metric" value="AUTO"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<operator activated="true" class="neural_net" compatibility="9.2.001" expanded="true" height="82" name="Neural Net" width="90" x="581" y="136">
<list key="hidden_layers"/>
<parameter key="training_cycles" value="200"/>
<parameter key="learning_rate" value="0.01"/>
<parameter key="momentum" value="0.9"/>
<parameter key="decay" value="false"/>
<parameter key="shuffle" value="true"/>
<parameter key="normalize" value="true"/>
<parameter key="error_epsilon" value="1.0E-4"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="group_models" compatibility="9.2.001" expanded="true" height="103" name="Group Models" width="90" x="715" y="340"/>
<operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="9.2.001" expanded="true" height="82" name="Performance" width="90" x="715" y="136">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Retrieve Titanic Training" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Multiply" to_port="input"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Multiply" from_port="output 1" to_op="Neural Net" to_port="training set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Deep Learning" to_port="training set"/>
<connect from_op="Deep Learning" from_port="model" to_op="Group Models" to_port="models in 2"/>
<connect from_op="Neural Net" from_port="model" to_op="Group Models" to_port="models in 1"/>
<connect from_op="Group Models" from_port="model out" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Sample screenshot

Hope this helps.
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"><context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="136">
<parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
</operator>
<operator activated="true" class="nominal_to_numerical" compatibility="9.2.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="179" y="34">
<parameter key="return_preprocessing_model" value="false"/>
<parameter key="create_view" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="coding_type" value="dummy coding"/>
<parameter key="use_comparison_groups" value="false"/>
<list key="comparison_groups"/>
<parameter key="unexpected_value_handling" value="all 0 and warning"/>
<parameter key="use_underscore_in_name" value="false"/>
</operator>
<operator activated="true" class="split_data" compatibility="9.2.001" expanded="true" height="103" name="Split Data" width="90" x="313" y="34">
<enumeration key="partitions">
<parameter key="ratio" value="0.7"/>
<parameter key="ratio" value="0.3"/>
</enumeration>
<parameter key="sampling_type" value="automatic"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.2.001" expanded="true" height="103" name="Multiply" width="90" x="447" y="136"/>
<operator activated="true" class="h2o:deep_learning" compatibility="9.2.000" expanded="true" height="82" name="Deep Learning" width="90" x="581" y="238">
<parameter key="activation" value="Rectifier"/>
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="50"/>
<parameter key="hidden_layer_sizes" value="50"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<parameter key="reproducible_(uses_1_thread)" value="false"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<parameter key="epochs" value="10.0"/>
<parameter key="compute_variable_importances" value="false"/>
<parameter key="train_samples_per_iteration" value="-2"/>
<parameter key="adaptive_rate" value="true"/>
<parameter key="epsilon" value="1.0E-8"/>
<parameter key="rho" value="0.99"/>
<parameter key="learning_rate" value="0.005"/>
<parameter key="learning_rate_annealing" value="1.0E-6"/>
<parameter key="learning_rate_decay" value="1.0"/>
<parameter key="momentum_start" value="0.0"/>
<parameter key="momentum_ramp" value="1000000.0"/>
<parameter key="momentum_stable" value="0.0"/>
<parameter key="nesterov_accelerated_gradient" value="true"/>
<parameter key="standardize" value="true"/>
<parameter key="L1" value="1.0E-5"/>
<parameter key="L2" value="0.0"/>
<parameter key="max_w2" value="10.0"/>
<parameter key="loss_function" value="Automatic"/>
<parameter key="distribution_function" value="AUTO"/>
<parameter key="early_stopping" value="false"/>
<parameter key="stopping_rounds" value="1"/>
<parameter key="stopping_metric" value="AUTO"/>
<parameter key="stopping_tolerance" value="0.001"/>
<parameter key="missing_values_handling" value="MeanImputation"/>
<parameter key="max_runtime_seconds" value="0"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<operator activated="true" class="neural_net" compatibility="9.2.001" expanded="true" height="82" name="Neural Net" width="90" x="581" y="136">
<list key="hidden_layers"/>
<parameter key="training_cycles" value="200"/>
<parameter key="learning_rate" value="0.01"/>
<parameter key="momentum" value="0.9"/>
<parameter key="decay" value="false"/>
<parameter key="shuffle" value="true"/>
<parameter key="normalize" value="true"/>
<parameter key="error_epsilon" value="1.0E-4"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="group_models" compatibility="9.2.001" expanded="true" height="103" name="Group Models" width="90" x="715" y="340"/>
<operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model" width="90" x="715" y="34">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="9.2.001" expanded="true" height="82" name="Performance" width="90" x="715" y="136">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_op="Retrieve Titanic Training" from_port="output" to_op="Nominal to Numerical" to_port="example set input"/>
<connect from_op="Nominal to Numerical" from_port="example set output" to_op="Split Data" to_port="example set"/>
<connect from_op="Split Data" from_port="partition 1" to_op="Multiply" to_port="input"/>
<connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Multiply" from_port="output 1" to_op="Neural Net" to_port="training set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Deep Learning" to_port="training set"/>
<connect from_op="Deep Learning" from_port="model" to_op="Group Models" to_port="models in 2"/>
<connect from_op="Neural Net" from_port="model" to_op="Group Models" to_port="models in 1"/>
<connect from_op="Group Models" from_port="model out" to_op="Apply Model" to_port="model"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Hope this helps.
Open a new process, copy this code and paste it in XML window (View --> Show Panel --> XML) then click on a green tick mark. you can see the process and run it as well. I am also attaching process for importing but try the XML first so that you will get familiar with ways to use Rapidminer.
@mbs
Open a new process, copy this code and paste it in XML window (View --> Show Panel --> XML) then click on a green tick mark. you can see the process and run it as well. I am also attaching .rmp process for importing (File --> Import Process) but try the XML first so that you will get familiar with ways to use Rapidminer.
Open a new process, copy this code and paste it in XML window (View --> Show Panel --> XML) then click on a green tick mark. you can see the process and run it as well. I am also attaching .rmp process for importing (File --> Import Process) but try the XML first so that you will get familiar with ways to use Rapidminer.
Did you check if read excel operator is getting your label column into rapidminer? As I said earlier you can check by disabling all other operators and connecting read excel to output. If there is some issue try importing into RM repository and check. There might be some simple mistake.
@mbs one more thing, you should set role to a label after read excel in case you are splitting data as train and test sets.
@mbs one more thing, you should set role to a label after read excel in case you are splitting data as train and test sets.
The can be many reasons,
1. How are you splitting data? If it is a random split then "are your setting local random seed parameter" that will give you the same train and test sets all the time.
2. Your algorithm might be overfitting due to a more complex neural net which might be in case of a neural net + deep learning algorithm.
3. Is your dataset balanced (a simlilar number for samples for each output label)? If not, accuracy is not a good performance measure.
For reason 1, I recommend you use cross-validation instead of the random splitting of the train and test datasets. This will give you much reliable results.
For reason 2, you need to start from smaller networks and then build more complex networks based on data and test performance. There is no use of building networks with more hidden layers when a simple neural network can achieve your task.
For reason 3, use AUC and kappa values as performance metrics instead of accuracy.
1. How are you splitting data? If it is a random split then "are your setting local random seed parameter" that will give you the same train and test sets all the time.
2. Your algorithm might be overfitting due to a more complex neural net which might be in case of a neural net + deep learning algorithm.
3. Is your dataset balanced (a simlilar number for samples for each output label)? If not, accuracy is not a good performance measure.
For reason 1, I recommend you use cross-validation instead of the random splitting of the train and test datasets. This will give you much reliable results.
For reason 2, you need to start from smaller networks and then build more complex networks based on data and test performance. There is no use of building networks with more hidden layers when a simple neural network can achieve your task.
For reason 3, use AUC and kappa values as performance metrics instead of accuracy.
For question 1 and reason 1, even if you use split validation, a label which has only one sample can be either in Training or testing dataset, so I don't understand why data with a single sample is in data set.
The reason I am saying this is, your single sample data can either be in training or test dataset. If it is in training, it is never useful in the test dataset to check performance. If it is in test set then it never saw in training which means it will be predicted wrong all the time as its label never existed while the algorithm trained. So I think labels with single samples might not be useful for performance measure. You can use cross-validation with stratified sampling that will try to keep samples related to every class in all subsets.
If you use split validation, you should check "use local random seed" in parameters of split data. This will always create the same train and test subsets even if you use different models.
Reason 2: The complex algorithms overfit some times (depends on data). A deep learning algorithm is the one which has more hidden layers. In my statement, I am saying to train and test a model with a single hidden layer first and then note the performance parameters like accuracy, kappa, etc. Then you can build another model with more hidden layers and see the performances. If your simple model is giving the best performance there is no need to use a complex model with multiple hidden layers.
Reason 3: A kappa value can be between -1 to 1. A positive kappa value between 0 and 1 with higher the better. A negative kappa value between -1 and 0 represent your algorithm is predicting exactly opposite classes for data. For example, if you have 20 samples with 10 labeled as male and 10 labeled as female. A kappa value of zero means, your algorithm is predicting all 20 samples as either male or female. A negative kappa value means, your algorithms are predicting opposite classes, this means male samples are predicted as female and female samples are predicted as male. A positive kappa value means it is trying to predict correct classes for the given samples. Higher kappa means better predictions.
Hope this helps.
The reason I am saying this is, your single sample data can either be in training or test dataset. If it is in training, it is never useful in the test dataset to check performance. If it is in test set then it never saw in training which means it will be predicted wrong all the time as its label never existed while the algorithm trained. So I think labels with single samples might not be useful for performance measure. You can use cross-validation with stratified sampling that will try to keep samples related to every class in all subsets.
If you use split validation, you should check "use local random seed" in parameters of split data. This will always create the same train and test subsets even if you use different models.
Reason 2: The complex algorithms overfit some times (depends on data). A deep learning algorithm is the one which has more hidden layers. In my statement, I am saying to train and test a model with a single hidden layer first and then note the performance parameters like accuracy, kappa, etc. Then you can build another model with more hidden layers and see the performances. If your simple model is giving the best performance there is no need to use a complex model with multiple hidden layers.
Reason 3: A kappa value can be between -1 to 1. A positive kappa value between 0 and 1 with higher the better. A negative kappa value between -1 and 0 represent your algorithm is predicting exactly opposite classes for data. For example, if you have 20 samples with 10 labeled as male and 10 labeled as female. A kappa value of zero means, your algorithm is predicting all 20 samples as either male or female. A negative kappa value means, your algorithms are predicting opposite classes, this means male samples are predicted as female and female samples are predicted as male. A positive kappa value means it is trying to predict correct classes for the given samples. Higher kappa means better predictions.
Hope this helps.
Your process seems correct, instead of split data try using cross-validation. You can use anything but try these things as well so that you can defend your thesis well as you might get questions like "How reliable are your performances?" etc.
you can remove labels with a single sample from my understanding, I am also not exactly sure on this as this is the first time I am getting a label with a single sample.
Hello @mbs
Simple mistake, you did not connect "per" port of cross validation to the results. Per means performance
Simple mistake, you did not connect "per" port of cross validation to the results. Per means performance
One last thing, cross validation results can be lower (bad) than your random split results. The reason is it tests and trains on all data and averages the performance. If there is some bad data, that will reduce the performance. For more details, read about cross validation and you will get to know it. But cross validation will give you reliable results.
Thanks
Thanks
hi @mbs I'd recommend this from the Academy: https://academy.rapidminer.com/learn/video/validating-a-model
You could also try searching the community...there is a rather well-written article on this topic from March 4th.

You could also try searching the community...there is a rather well-written article on this topic from March 4th.


Yep, you can select whatever works in your case. If you ask me, 2000 samples is a normal data and I cross-validate data with 100000 samples to get confident results. Again, this might be subjective. Getting good performance and reliable performance are two different things. Try different things and see what is good for your thesis.
Sorry @mbs, I am swamped working on paper. I can explain it to you. I know you got confused with "smaller network" and "complex network". A neural network can have multiple layers. So, a simple neural network, in my view, is with one hidden layer and a few neurons in it. If you increase the number of hidden layers with a different number of neurons and different activation functions the network is becoming more complex. You can build models with a different number of layers with a neural network operator or deep learning operator or deep learning extension in Rapidminer. I recommend you get a general understanding of neural networks and deep learning (deep neural networks) and try with relevant operators in rapidminer.
If you have any specific question or need clarification I can help in that case but building models take time. I recommend you watch videos and tutorials from rapidminer or any other source that makes you understand easily.
If you have any specific question or need clarification I can help in that case but building models take time. I recommend you watch videos and tutorials from rapidminer or any other source that makes you understand easily.
Hello @mbs
Which models are you trying to combine? I am not sure if there is a way to combine models without group model operator.
Yes, Cross-validation operator will divide your data into multiple subsets that are used for training and testing of algorithms. Yes, testing is done without labels as the trained model is trying to predict the output for a given sample.
Thanks
Which models are you trying to combine? I am not sure if there is a way to combine models without group model operator.
Yes, Cross-validation operator will divide your data into multiple subsets that are used for training and testing of algorithms. Yes, testing is done without labels as the trained model is trying to predict the output for a given sample.
Thanks
Yep, ensemble models work but you should be careful in analyzing higher performance. For this, you need to set aside some data from testing after the model is trained and tested using cross-validation. If the performance of this hold out a dataset is good then your model might be good.
PS: Cross-validation reduces overfitting but complex models tend to overfit even in cross-validation so we should be careful in analyzing very good results.
PS: Cross-validation reduces overfitting but complex models tend to overfit even in cross-validation so we should be careful in analyzing very good results.
Sorry, don't get confused. What I am saying is a validation process we do when we observe high performances like 99 percent accuracy etc., We split the dataset into 0.8 to 0.2 ratio and cross-validate the 0.8 portions of the dataset and then we connect the model output of cross-validation operator to test the 0.2 percent dataset. Now we can have a performance from cross-validation and hold out (0.2) dataset as well.
If you think this is confusing you can go with your current results.
If you think this is confusing you can go with your current results.
Dear friend, here is the famous Titanic dataset used for cross-validation and verification of results. You can change the process as you like.
Please be more specific about graphs. Which graphs are you talking about?
Also, this thread has become so long I guess.
Please be more specific about graphs. Which graphs are you talking about?
Also, this thread has become so long I guess.
Hello @mbs
There are two performances you can look from my process. One is related to cross-validation performance and the other is related to Hold-out data set performance. Can you inform these two performances (both accuracies and kappa's)?
There are two performances you can look from my process. One is related to cross-validation performance and the other is related to Hold-out data set performance. Can you inform these two performances (both accuracies and kappa's)?
They are separate. The cross-validation operator I used in the process will give accuracy and kappa values as this is applied to 80% of data. Then you have a test data (I named as hold out data) of 20% which is used to test the model if it is showing consistent performance or not, you will have performances for this 20 % data as well (I named that performance operator as hold-out performance in your process).
you are perfect teacher
I will try all the points that you mentioned