Automatized pipeline for attribute and model selection

zmk
zmk New Altair Community Member
edited November 2024 in Community Q&A

Hi there,

I am new to Rapidminer, but I already tried some models on my data out and it worked perfectly.

I have the following data set:

40 examples and about 100 attributes (numbers: real), 1 label (binominal).

 

My aim is to find a good model that predicts the label using just a few attributes like eg 4.

I tried using the tutorial “Finding the right Model” (https://www.youtube.com/watch?v=uN1I4yrNNuQ) multiple models (Decision Tree, Naive Bayes, k-NN, Neural Net, Linear Regression…) using “Compare ROCs”. This worked good and inspired me for this question.

 

I want to set up a pipeline that does the following tasks:

  1. Randomly, or weight based selects attribute combinations (eg only 4 attributes: 2 attributes that I manually selected and the other two are randomly selected)
  2. Forwards them to a X-Validation function that uses multiple models on the data (Decision Tree, Naive Bayes, k-NN, Neural Net, Linear Regression...)

At the end I get a report for each tested attribute combination, used model and the performance measurements of the model (eg. accuracy). At best ordered by accuracy.

Is there such a pipeline?

Does anyone know what I have to put together to realize such a pipeline?

 

Thanks for your help.

Answers

  • Telcontar120
    Telcontar120 New Altair Community Member

    This certainly isn't anything that is already built into the software.  But what you describe is something that could be built, using loops and macros.  It's probably more than a trivial effort, and if the reports you are talking about are outputs to external software packages, that might be tricky.  But in principle everyone you have requested is something that RapidMiner can do.

     

  • zmk
    zmk New Altair Community Member

    Great. Sounds good.

    So can you give me some hints how to tackle this problem?

    I will be more specific. This is what I need:

    A function that performs a loop: "take n(eg 4) randommly features" and "forward" them to a set of models with saving the name of the features and the result of each algorith into eg. a file (eg CSV).

    Stop when there are no more features to select.

     

    Or even simpler: Using just one model (eg logistic regression): take random 4 features out of 100 and test them on the model (X-Validation) and save the performance in correlation to the features taken.