Processing high volumes

peleitor
peleitor New Altair Community Member
edited November 2024 in Community Q&A
Hello fellows.

We need to process a considerable volume of data, about 1 million retail ticket lines per day. Altough this is a high value, maybe it does not deserve to be considered actually as a 'big data' scenario.

Can anyone assert or deny this assumption? And if this is should be considered big data, which would be the recommended approach using Rapidminer?

Thanks
Tagged:

Answers

  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi!

    There are different points to consider

    1. What is the actual datasize? Smaller than 32GB?
    2. What do you want to do it? Aggregate? Or learn on 1 million examples?

    If the data set is smaller than your RAM everything should be fine, as long as the actual #examples is low enough for reasonable runtimes. Otherwise you might simply sample before hand.

    Cheers,
    Martin
  • peleitor
    peleitor New Altair Community Member
    Hello, thanks for your reply.

    1. We might take representative samples that could fit into 32 Gb. Full data set size largely exceeds that.

    2. Aggregation could be solved right by SQL -this is a relational database. But for mining purposes -association detection like MBA, or other predictive methods like decision trees or lineal/logistic regression-

    The big question here is if we would need some big data processing architecture (eg. Hadoop based) standing between the RDBMS and the mining software.

    Regards
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,

    there are a few ways to handle this. since the total datasize is most likely > your RAM you need a special infrastructure

    Way 1: Use a Hadoop cluster, sample your data, learn on the sampled data in-memory and apply in-hadoop
    Way 2: Use a Hadoop cluster and learn directly in-hadoop. Radoop currently supports quite some algorithms (Decision Tree, Naive Bayes, Logistic Regression) and some more are to come
    Way 3: Use either a Hadoop Cluster or some SQL DWH to just use aggregates / representatives to work on.

    I think Way 3 might not be suited for you. Since it is about Radoop i would ask you to contact our sales team ( e.g. here: https://rapidminer.com/contact-sales-request-demo/ ). Then we (or one of my colleagues) might have a Webex or so about it.

    Cheers,
    Martin
  • peleitor
    peleitor New Altair Community Member
    Thanks Martin!

    Do you think solving this via Hadoop/Radoop is a typical situation in the reatil industry? (Eg. one retail store with 20 branches on 2M potential customers)

    Regards
  • MartinLiebig
    MartinLiebig
    Altair Employee
    Hi,

    since i am consultant in Germany, i can hardly speak about the non-german market. What i experienced is, that more and more companies are shifting towards such an infrastructure. However in germany it is really a "still shifting". It is visible that the usage of data gets more and more a requirement instead of a nice to have.
    From what i heard the U.S companies are faster in the process of adapting.

    Cheers,
    Martin
  • peleitor
    peleitor New Altair Community Member
    Thanks for the feedback!