Processing high volumes

Hello fellows.

We need to process a considerable volume of data, about 1 million retail ticket lines per day. Altough this is a high value, maybe it does not deserve to be considered actually as a 'big data' scenario.

Can anyone assert or deny this assumption? And if this is should be considered big data, which would be the recommended approach using Rapidminer?

Thanks

Find more posts tagged with

AI Studio

Accepted answers

All comments

MartinLiebig

Hi!

There are different points to consider

1. What is the actual datasize? Smaller than 32GB?
2. What do you want to do it? Aggregate? Or learn on 1 million examples?

If the data set is smaller than your RAM everything should be fine, as long as the actual #examples is low enough for reasonable runtimes. Otherwise you might simply sample before hand.

Cheers,
Martin

peleitor

Hello, thanks for your reply.

1. We might take representative samples that could fit into 32 Gb. Full data set size largely exceeds that.

2. Aggregation could be solved right by SQL -this is a relational database. But for mining purposes -association detection like MBA, or other predictive methods like decision trees or lineal/logistic regression-

The big question here is if we would need some big data processing architecture (eg. Hadoop based) standing between the RDBMS and the mining software.

Regards

MartinLiebig

Hi,

there are a few ways to handle this. since the total datasize is most likely > your RAM you need a special infrastructure

Way 1: Use a Hadoop cluster, sample your data, learn on the sampled data in-memory and apply in-hadoop
Way 2: Use a Hadoop cluster and learn directly in-hadoop. Radoop currently supports quite some algorithms (Decision Tree, Naive Bayes, Logistic Regression) and some more are to come
Way 3: Use either a Hadoop Cluster or some SQL DWH to just use aggregates / representatives to work on.

I think Way 3 might not be suited for you. Since it is about Radoop i would ask you to contact our sales team ( e.g. here: https://rapidminer.com/contact-sales-request-demo/ ). Then we (or one of my colleagues) might have a Webex or so about it.

Cheers,
Martin

peleitor

Thanks Martin!

Do you think solving this via Hadoop/Radoop is a typical situation in the reatil industry? (Eg. one retail store with 20 branches on 2M potential customers)

Regards

MartinLiebig

Hi,

since i am consultant in Germany, i can hardly speak about the non-german market. What i experienced is, that more and more companies are shifting towards such an infrastructure. However in germany it is really a "still shifting". It is visible that the usage of data gets more and more a requirement instead of a nice to have.
From what i heard the U.S companies are faster in the process of adapting.

Cheers,
Martin

peleitor

Thanks for the feedback!