Home
Discussions
Community Q&A
Applying New Dataset on the Model
JaspreetKaur
I have built a decision tree model on RapidMiner. I get an accuracy of 96.06%. Now, I have got a new dataset and I want to apply this decision tree model on my new dataset. How should I do it to confirm that my accuracy is still at least 95% with a confidence of at least 90% ?
Please advise ASAP!
Find more posts tagged with
AI Studio
Accepted answers
All comments
varunm1
Hello
@JaspreetKaur
You need to store the trained model in your repository using store operator.
Then you can retrieve the stored model by dragging and dropping it to the process window and connect the new dataset and this model to apply model and performance operators.
JaspreetKaur
Will I get the accuracy by doing so?
JaspreetKaur
The Alarm file is what I used to train my dataset and build the dataset. Now, I have the New Alarm file which does not have True Labels and I want to apply my model on this new dataset and check the accuracy.
Could you help me understand how I should do this now?
Alarm File.xlsx
New Alarm unscored file.xlsx
varunm1
Hello
@JaspreetKaur
You cannot get perfomance metrics without true labels. You can just make predictions on this new dataset using trained model by using apply model operator.
You can simply connect the dataset to apply model and the trained model to mod port of apply model and make prediction on new dataset.
JaspreetKaur
So, what should I do to get the accuracy results? Like a performance classification matrix?
JaspreetKaur
How will I know if my model still would give me at least 95% accuracy?
varunm1
You need to rely on your validated model performance.
JaspreetKaur
Okay, wait, I have been given a hint here.
The new Alarm file contains 3464 records, with 453 true alarms. Now, can you help me how I should proceed?
JaspreetKaur
But, the trick is I don't know which records are the 453 true ones. How do I find that? PLEASE HELP!
BalazsBaranyRM
Hi
@JaspreetKaur
,
if you have labeled data, you can validate the model predictions.
If you have unlabeled data, there is no machine learning process to validate the predictions. They are often validated in real world later.
In validation, you compare the model prediction to the actual label. If you don't have a label, you can't compare.
As
@varunm1
mentioned, you're doing a validation during model building. Experience shows that this validation result is applicable to future predictions with the same model if the data doesn't change too much (e. g. there is no concept shift). If the data generating process changes (e. g. new machines are introduced, the weather becomes warmer, ... depends on your scenario), the model starts to get worse. In this case you would retrain the model with recent data
when you got the labels
.
Best regards,
Balázs
JaspreetKaur
But my question is how will I get the accuracy on the new data set?
BalazsBaranyRM
If you have labels, apply your model to the new data set. You will then have a column with the prediction and one with the label. (Make sure they have the appropriate roles.) Then use Performance (or a more specific operator like Performance (Binominal Classification)) to calculate the accuracy.
JaspreetKaur
I had uploaded the dataset earlier. I don't have the True labels. But I have been given this information that my new dataset contains 3464 records with 453 True values. Now, how should I find out which ones are the true values?
JaspreetKaur
BalazsBarany
@varunm1
I still didn't get my question answered. I have been asked to find out the accuracy basis this information and the new dataset of course. Is there any other Rapidminer tool that could help me do so?
With the fact that I have 453 true values in the new dataset, how can I use this info to find out which records have 453 true values?
lionelderkrikor
Hi
@JaspreetKaur
,
As mentioned before by
@BalazsBarany
and
@varunm1
, the usual methodology in a data science project is :
1/ to train and validate a model by using a LABELLED dataset which allows to calculate the accuracy of the model.
2/ Then apply the validated model on the new UNLABELLED dataset to perform some predictions. BUT you can not determine the exact accuracy of the model on this UNLABELLED dataset .
Anyways, I think there is a misunderstanding with the word "True", by "True" you mean the examples which have the value "True" for your predicted label ("Alarm") right ?
Thus I have applied this methodology and by training a model (Decision tree) with your LABELLED dataset (called "Alarm file") and then I have applied this model to your UNLABELLED dataset (called " New Alarm unscored file") and I have obtained the prediction for your label "Alarm" : There are
410
values equal to "True" (maybe it is from these values you are talking about) and
3054
values equal to "False". These results were obtained with a Decision tree model but with an other model you will maybe obtain
453
values equal to " True".
In attached file the process that you need from my point of view.
Hope it is clear for you now,
Regards,
Lionel
Process_Alarm.rmp
JaspreetKaur
Thanks so much
@lionelderkrikor
. This helps me in better understanding the answer.
Thank you
@BalazsBarany
and
@varunm1
!
Quick Links
All Categories
Recent Discussions
Activity
Unanswered
日本語 (Japanese)
한국어(Korean)