"Can Rapid Miner be used to do predictions using aggregated data"

Question

Hi,

I would like to enquire if Rapid Miner can use aggregated data to develop failure prediction models?

I have aggregated vehicle test data for cars that were tested in 2016 by the national vehicle testing authority.

I would like to analyse the aggregated data to see if I can produce a prediction model that will allow me to predict the failure rates (and reasons for failures) for vehicles based on the manufacturer (brand), model and year of manufacture.

For example, what is the liklihood that a 2008 Toyota Camry will fail the national vehicle test and if it does fail, what are the reason that it will fail, e.g. Brakes, Lights, Emissions, etc.

I have aggregated test data that shows the total number tested, the number that passed, failed and the reasons for failure

See sample data below.

Test result        Reason for failure

Manufacturer, Model, Year of Manuf,  Total tested,     Pass,  Fail,      Brakes, Lights, Electrical, Emissions, etc.

Toyato,          Camry,      2010,            1600,              1000,  600,       100,      600,       250,           120

Toyato,          Camry,      2009,            2000,               800,  1200,       500,    1200,      200,           100

Vehicles can fail for multiple reason. Whichever reason produces the highest failure rate, this will determines the over all failure rate.

For example, in the table above, for test year 2016, a total of 2,000 Camrys were tested. These vehilces manufactured in 2009, i.e. they were 7 years old.

Of 2,000 that were tested, the highest falure rate was for lighting where 1,200 Camry's failed.

This means that overall, 1,200 (60%) of the 2,000 Camrys tested failed.

Would you anyone be able to advise or assist in developing a Raipd Miner process that would allow me to use the aggregated test results to predict the failure rates (and reasons) for vehicles that will be tested in future.

Regards

Tom

Telcontar120 · Answer

I think we're just using the terms differently.  If that is all the data you have, as you suggest, you can easily use RapidMiner to calculate the failure rates by reason for every combination of year, make and model.  Then your prediction is basically just a giant lookup table to retrieve that specific data (not substantially different than what you could accomplish using a spreadsheet or a database table).  Since there is no learning algorithm involved, I wouldn't really call that a predictive model, but if you don't have any additional data, I don't think you'll be able to do much more than that.

To generate the relevant failure rates in RapidMiner, you'll want to utilize the "Generate Aggregation" operator, which allows you to do calculations across attributes (columns) of data.  You can check out that operator's tutorial process to see a simple example and then modify it to suit your data file.  It should be very straightforward.

hawthorn_33 · Answer

Thanks for your response.

I agree it is not a classical prediction problem.

Ultimately, what I am trying to do is build a model whereby I can select the Vehicle Manufacturer, Vehicle Model and Year of Manufacture and use this information to predict the chances of the vehicle failing the national vehicle test which is mandatory for all vehicles over four years of age.

When predicting the failure, I also wish to predict the reasons for failure, e.g brake failure, lighting, electrical, etc.

in realtion the data that's available, I have an aggregated data file of vehicles tested each year, comprising over 8,000 lines representing over 750,000 vehicles. I have selected 2016 as my analysis year for this exercise.

Each line of data includeds the following:

Vehicle Manufacturer

Vehicle Model

Year of manufacture

Total number of vehicles tested (for each combination of Manufacturer, Vehicle Model, Year of manufacture)

Total that passed the test

Total that failed the test

For the vehicles that failed (above), I have the number that failed under each test category, e.g. Brakes, Lighting, Emissons, etc, so I can work out the failure rates by reason type (for each combination of Manufacturer, Vehicle Model, Year of manufacture).

Perhaps I should use a different modelling tool for this application or would you have a suggestion as to how I could use Rapid Miner to do it.

Thanks

Telcontar120 · Answer

Aggregated data is no problem for RapidMiner.  I think what you have here is more of a predictive data timeframe problem.  Typically for a predictive model, you use data from some prior period (the observations period) to predict the outcome in a subsequent period (the performance period).  But your dataset (at least the part you have shown here) only contains information on the test outcomes.  So in this framework, what would you have to predict failure?  Do you have any other attributes on the cars?

You could easily use RapidMiner to summarize the average failure rate for each car make and model (presumably aggregating across year) but that's not really a predictive model--it's just a summary of past performance.  If you had no additional information to use, that would represent a better prediction than the overall average failure rate, but it's not a predictive model in the classical sense.