Applying New Dataset on the Model

I have built a decision tree model on RapidMiner. I get an accuracy of 96.06%. Now, I have got a new dataset and I want to apply this decision tree model on my new dataset. How should I do it to confirm that my accuracy is still at least 95% with a confidence of at least 90% ?
Please advise ASAP!

Find more posts tagged with

AI Studio

Accepted answers

All comments

varunm1

Hello @JaspreetKaur

You need to store the trained model in your repository using store operator.

Then you can retrieve the stored model by dragging and dropping it to the process window and connect the new dataset and this model to apply model and performance operators.

JaspreetKaur

Will I get the accuracy by doing so?

JaspreetKaur

The Alarm file is what I used to train my dataset and build the dataset. Now, I have the New Alarm file which does not have True Labels and I want to apply my model on this new dataset and check the accuracy.
Could you help me understand how I should do this now?

Alarm File.xlsx

New Alarm unscored file.xlsx

varunm1

Hello @JaspreetKaur

You cannot get perfomance metrics without true labels. You can just make predictions on this new dataset using trained model by using apply model operator.

You can simply connect the dataset to apply model and the trained model to mod port of apply model and make prediction on new dataset.

JaspreetKaur

So, what should I do to get the accuracy results? Like a performance classification matrix?

JaspreetKaur

How will I know if my model still would give me at least 95% accuracy?

varunm1

You need to rely on your validated model performance.

JaspreetKaur

Okay, wait, I have been given a hint here.
The new Alarm file contains 3464 records, with 453 true alarms. Now, can you help me how I should proceed?

JaspreetKaur

But, the trick is I don't know which records are the 453 true ones. How do I find that? PLEASE HELP!

BalazsBaranyRM

Hi @JaspreetKaur,

if you have labeled data, you can validate the model predictions.
If you have unlabeled data, there is no machine learning process to validate the predictions. They are often validated in real world later.

In validation, you compare the model prediction to the actual label. If you don't have a label, you can't compare.

As @varunm1 mentioned, you're doing a validation during model building. Experience shows that this validation result is applicable to future predictions with the same model if the data doesn't change too much (e. g. there is no concept shift). If the data generating process changes (e. g. new machines are introduced, the weather becomes warmer, ... depends on your scenario), the model starts to get worse. In this case you would retrain the model with recent data when you got the labels.

Best regards,

Balázs

JaspreetKaur

But my question is how will I get the accuracy on the new data set?

BalazsBaranyRM

If you have labels, apply your model to the new data set. You will then have a column with the prediction and one with the label. (Make sure they have the appropriate roles.) Then use Performance (or a more specific operator like Performance (Binominal Classification)) to calculate the accuracy.

JaspreetKaur

I had uploaded the dataset earlier. I don't have the True labels. But I have been given this information that my new dataset contains 3464 records with 453 True values. Now, how should I find out which ones are the true values?

JaspreetKaur

BalazsBarany @varunm1 I still didn't get my question answered. I have been asked to find out the accuracy basis this information and the new dataset of course. Is there any other Rapidminer tool that could help me do so?
With the fact that I have 453 true values in the new dataset, how can I use this info to find out which records have 453 true values?

lionelderkrikor

Hi @JaspreetKaur,

As mentioned before by @BalazsBarany and @varunm1, the usual methodology in a data science project is :
1/ to train and validate a model by using a LABELLED dataset which allows to calculate the accuracy of the model.
2/ Then apply the validated model on the new UNLABELLED dataset to perform some predictions. BUT you can not determine the exact accuracy of the model on this UNLABELLED dataset .
Anyways, I think there is a misunderstanding with the word "True", by "True" you mean the examples which have the value "True" for your predicted label ("Alarm") right ?
Thus I have applied this methodology and by training a model (Decision tree) with your LABELLED dataset (called "Alarm file") and then I have applied this model to your UNLABELLED dataset (called " New Alarm unscored file") and I have obtained the prediction for your label "Alarm" : There are 410 values equal to "True" (maybe it is from these values you are talking about) and 3054 values equal to "False". These results were obtained with a Decision tree model but with an other model you will maybe obtain 453 values equal to " True".

In attached file the process that you need from my point of view.

Hope it is clear for you now,

Regards,

Lionel

Process_Alarm.rmp

JaspreetKaur

Thanks so much @lionelderkrikor . This helps me in better understanding the answer.
Thank you @BalazsBarany and @varunm1 !