Not normally distributed data

Hi,
I'm trying to find a model to make a prediction for the execution time of a process step. I've data from over 200 different recurring process steps from the past 2 years (160.000 rows in excel sheet). When I plot the execution-time data per event, the data is not normally distributed but more like a Poisson distribution. Just loading the data in Rapidminer Studio and applying the models do not return a good fit. What can i do? (for data pre-processing in Python or R I would need a step-by-step guide because I'm pretty new in all of this)
Some help would really be appreciated!
Best regards
Jeroen

Find more posts tagged with

AI Studio

Model Management

Accepted answers

All comments

lionelderkrikor

Hi @jeroenheijlen,

Have you tried to submit your data to Auto-Model (the AutoML tool of RapidMiner) ?

Regards,

Lionel

jeroenheijlen

Hi @lionelderkrikor , thanks for your reply.
Yes sure, I tried auto model but even when I already seriously reduced the variation in the inputdata, no model but do a good job for my data:

Image: https://us.v-cdn.net/6030995/uploads/editor/n3/cx6t4k887wx5.png

Image: https://us.v-cdn.net/6030995/uploads/editor/60/zmxdgwaig5km.png

lionelderkrikor

Hi @jeroenheijlen,

Maybe there are not relationships between your independent features and your label (your target).
In this case, it is impossible to find a good model and machine learning is of no use...
In the meantime, you can try to :
- enable feature selection / feature generation in the options of AutoModel
- for your best models, you can tune hyper-parameters to try to increase the accuracy/decrease the error rate.

Regards,

Lionel

jeroenheijlen

Hi @lionelderkrikor,
I'm indeed afraid the variation within each of the process step is too large and therefor no model can find a correlation or prediction fit.
Thanks for your advise.
I will try a few more things (auto feature selection fails) such as starting with a smaller dataset (info of only a few of the process steps, remove more of the outliers, but still the data will never be normally distributed) and also once create the set like a binomial outcome (more than 2 hours, less than 2 hours, or so).

If I ever will succeed, I will post the outcome ;-).
Best regards
Jeroen

lionelderkrikor

You're welcome, @jeroenheijlen.

Good luck !

regards,

Lionel