Choose Predictive Models

dome
dome New Altair Community Member
edited November 2024 in Community Q&A
Hello,

I have a classification Problem with binary target attribute. All other Atrributes are numerical. In the Rapidminer are about 80 Operators, which can be used for classification. It is nearly impossible to try all of them...

I found the ROC as a tool to choose the Operator to use. I just dont unterstand how it works and how it can provide a "perfect" Model for my Problem. For example if i put 10 Operators in the compare ROC Operator they are all with standard settings in there. The result are curves and the curve which comes the closest to the top left is the best and therefore this Operator ist the best. But what is when i change the Parameters from the 10 Operators? Then i get a total different ROC. So its just try and error right?

Is there any Method to find the best Operator for my Problem? Or does it all come down to use one Operator within the optimize Parameters and find so the best Operator with the best accuracy?

I hope i explained my question well...
Thanks!
Tagged:

Best Answers

  • varunm1
    varunm1 New Altair Community Member
    Answer ✓
    Hello @dome,

    Did you try Automodel? Automodel has the capability to provide you with the algorithms that are suitable for your data. I will definitely try that first and then look for more models to see which one might do a better job.

    As you said, there are a number of predictive algorithms, which is the reason it is good to visualize your data using t-sne to see if there are any patterns that can be identified from the dataset based on data distribution. This is one way I narrow down my algorithms, but this needs some expertise.

    You can use roc curves but as you said, it might change based on the settings of models.

    @IngoRM might suggest more on this.
  • kypexin
    kypexin New Altair Community Member
    edited July 2019 Answer ✓
    Hi @dome

    I suggest you to start with an analysis of an underlying problem. What exactly is the data you are working with? What metric is the most important from practical (or business) point of view? Are there different costs of misclassification for positive and negative class? This all is very important for model optimisation process. 

    However, couple of advices:
    • Narrow down possible list of models by using this online tool: http://mod.rapidminer.com/#app
    • If you are sure that AUC is the metric you want to optimise first place, use COMPARE ROCS operator which would help you to compare different models. You said you tried it, and I think it is completely fine for the first step just to have default settings for all learners.
    • After you have chosen a final model, you need to be sure you understand the most important parameters of the model and then use OPTIMIZE PARAMETERS operator to find the best combination. Usually there's no need to cycle through them all, most models have just a few parameters that are most important.
    But again, without knowing the nature of the data and the certain problem behind it, it's not always easy to suggest a best solution based only on statistical performance metrics.

  • hughesfleming68
    hughesfleming68 New Altair Community Member
    edited July 2019 Answer ✓
    Keep in mind that you don't have a lot of data so you will need to be very careful how you validate your models. While you have a lot of choice as far as operators for binary classification you can narrow them down quite significantly. Look at linear models first...SVM,GLM and then trees...random forest and Gradient boosted. You should be able to get a good feel for your data and its predictability from these four.

Answers

  • varunm1
    varunm1 New Altair Community Member
    Answer ✓
    Hello @dome,

    Did you try Automodel? Automodel has the capability to provide you with the algorithms that are suitable for your data. I will definitely try that first and then look for more models to see which one might do a better job.

    As you said, there are a number of predictive algorithms, which is the reason it is good to visualize your data using t-sne to see if there are any patterns that can be identified from the dataset based on data distribution. This is one way I narrow down my algorithms, but this needs some expertise.

    You can use roc curves but as you said, it might change based on the settings of models.

    @IngoRM might suggest more on this.
  • dome
    dome New Altair Community Member
    Thank you very much. I cant use the Auto model since i only have 84 rows.
  • kypexin
    kypexin New Altair Community Member
    edited July 2019 Answer ✓
    Hi @dome

    I suggest you to start with an analysis of an underlying problem. What exactly is the data you are working with? What metric is the most important from practical (or business) point of view? Are there different costs of misclassification for positive and negative class? This all is very important for model optimisation process. 

    However, couple of advices:
    • Narrow down possible list of models by using this online tool: http://mod.rapidminer.com/#app
    • If you are sure that AUC is the metric you want to optimise first place, use COMPARE ROCS operator which would help you to compare different models. You said you tried it, and I think it is completely fine for the first step just to have default settings for all learners.
    • After you have chosen a final model, you need to be sure you understand the most important parameters of the model and then use OPTIMIZE PARAMETERS operator to find the best combination. Usually there's no need to cycle through them all, most models have just a few parameters that are most important.
    But again, without knowing the nature of the data and the certain problem behind it, it's not always easy to suggest a best solution based only on statistical performance metrics.

  • hughesfleming68
    hughesfleming68 New Altair Community Member
    edited July 2019 Answer ✓
    Keep in mind that you don't have a lot of data so you will need to be very careful how you validate your models. While you have a lot of choice as far as operators for binary classification you can narrow them down quite significantly. Look at linear models first...SVM,GLM and then trees...random forest and Gradient boosted. You should be able to get a good feel for your data and its predictability from these four.
  • dome
    dome New Altair Community Member
    Thank you very much for the answers. This helped a lot.