Mining Large data ?

borgg
borgg New Altair Community Member
edited November 5 in Community Q&A
Hi there,

i'm going to analyze a big set of data. Actually i got like 13.000.000 features on ca 1200 samples (positive and negative). This makes all together a file of 16.000.000.000 values when i create a big matrix (in float values = 64GB  - from my point of view this is a lot ;) ). I want to throw some machine learning algorithms on that (to find any classification into +/-). What about rapidminer's behavior on such big data? Can it handle this without running into death? I'd just like to know before i start to work into this software.

Anyone has experience with datasets which are.... lets say "much larger than the RAM" ?

Thanks
Tagged:

Answers

  • land
    land New Altair Community Member
    Hi,
    on the first sight, this seems to impossible for two reasons: You could either store all the data into RAM, what obviously won't work (at least with normal computers), or store the data in a database and use the cached database access, which will hold always only a small subset of the data in main memory. But the second option will also fail, because todays databases cannot store more than around 1000 columns and it seems to me, 13.000.000 exceeds this limit a bit.

    But despite of that it might be possible depending on the nature of your data. For example if your data represents something like time series data and the samples are independent of each other, one might split the data and extract a smaller number of attributes, containing informations for discriminating the two classes, it will work.

    So for a final decision if data mining is applicable at all, I need more information about your data and tasks at hand.

    Greetings,
      Sebastian
  • borgg
    borgg New Altair Community Member
    Well. what is this data about...

    i have 400 images showing an object, which i like to detect. the number of false positive samples is up to my choice. they are easy to get ;). There are a lot of features that i can use to describe such object in a image (gabor-wavelets, haar-like-features etc...). each of them is weak, because it can only describe a small piece of the object in the image. combined into any cascade and/or decision tree they work very fine (if you are interested watch http://research.microsoft.com/en-us/um/people/viola/pubs/detect/violajones_cvpr2001.pdf., they used haar-like features for example).

    Now I want to find out, which kind or classes of features (among all published, so far i can find them) are best to describe my object class in images. I want to compare different kinds of feature-classes and want to try out which ones are chosen most in a machine learning, when the algorithm can chose from all classes.

    All features, in all different scales, at all possible positions result in my huge amount of features. I know this is not very practical, but this work has rather an academic background than a practical ;).

    My option 2 is to make my own learning framework which i can run on a grid-computer (approx. 60 x2-opterons with 2 GB ram each). But to do this i have to spend a lot of time to implement all this, worrying about MPI (communication and work balancing of the computers in the grid). There, of course i wont suffer the RAM problem ;). But i go from scratch with coding my machine learning/ decision-tree-building etc. I did a lot of work on this already but its not finished by far. Now i found this lovely toolbox here and asked myself if this could be my solution.

    But from your comments it seems i better go on with my "option 2".  :-\
  • land
    land New Altair Community Member
    Hi,
    ok, I think I understood your problem and why you have so much features. But there are a few problems, I'm not quite sure you are aware of. On the one hand, you're extremely large attribute space will make the selection of useful features extremely fragile, because of overfitting on your training data. The chances are not too bad, that one or two attributes in combination of all those 13.000.000 might discriminate 1000 examples perfectly. But if you fetch your 1001st example, it doesn't match this schema. So you will have to be extremely carefully.
    The second thing you might overlooked is the runtime of usual data mining techniques. SVMs for example have a cubic runtime in number of examples multiplied by the number of attributes. Tree Learner are only quadratical in examples, but suffer from high number of attributes. In general it is difficult to parallelize learning algorithms, so your cloud framework will pose completely different problems, possibly making your scientific focus shift more and more to data mining tasks.

    If you would like to have a simpler solution, I would rather suggest extending RapidMiner in a way, that exactly solves your problems. So I would start write a disc cached ExampleSet and a special operator loading all your data by returning such an example set. If no other operator is used, creating a new example set, this will work. It might be slow, but it should work and is rather straight forward to implement.

    One question I still have: Do you have already extracted all of this features?

    Greetings,
      Sebastian
  • borgg
    borgg New Altair Community Member
    Hey thanks for your answer,

    i am aware of the overfitting problem. And i am really concerned about it. Actually i got like 600 labeled positive samples(not only 400), and  I will always keep a control set of 200 in background to ensure that  i have a real generalizing effect. Further i must/should limit the number of features to face the overfitting problem. Finally I would like to have a small set of features in the end. Further i will maybe create much more negative samples but i'm not sure if this will really help. But still, you are perfectly right, there is still a high chance of having features, that discriminate only "by fortune" my set. I will watch the final chosen features carefully (in computer vision i can easily visualize myself whats going on and where, fortunately this is no abstract data i got here) and check them for reasonableness.

    The learning time might be a problem in deed (particularly when i want to play around with different parameters). But only a problem on a single machine. When i use my cluster this problem is "solved" simply by mass ;)

    My implementation will use the (gentle) AdaBoost algorithm. The stages shall not only contain majority voting single features (decision stumps) but decision trees (i'm new to CARTs and will spend a lot of time and fun to arrange the attributes to get this working properly). But in general this is quite nice to parallelize. I will use one supervisor node and x worker nodes. The supervisor knows all features and all samples but will never create the full learn matrice. The nodes know all samples but only a subset of features (to keep the matrix as small as the RAM). In each "round" of the boosting each worker will provide one candidate-feature to be the next chosen (along with its raising error etc). The supervisor node will chose the best quality feature and organizes the reweighting of samples etc. The Supervisor node will only calculate the features of the assembled full stages to calculate the quality of the full stage of all features. In the paper i mentioned above they further build cascades of such boosted classifiers. I already had running one very easy implementation of this for my cluster. Seems to work fine. I'm now about to make "nice code" and implement all my features.

    This also answers your last question. I did not extract the features already.

    Thanks for your offer to provide an extra ExampleSet. But as longer i think about my masses of data I tend to use the cluster here. Maybe i will spend more time with implementation. But i will have more opportunities (from point of time consumption)  to play around with different settings and parameters.

    Greetings, Axel
  • land
    land New Altair Community Member
    Hi,
    then all I can do is wishing you all the best. I think, you will need it, facing such a huge problem. But the higher the problem, the greater the reward, isn't it? :)

    Greetings,
      Sebastian
  • wessel
    wessel New Altair Community Member
    Interesting, could you update on your progress?

    Is it possible to do fast PCA?
    Or instead of pixel intensity use HOG or LRF?
    Or some other form of feature aggregation?

    Bdw, matlab's .mat format seems to be a lot smaller then ASCII .csv


    edit: Axel, are you using some outside software to do dimensionality reduction?
    You were talking about a cluster, PCA is a sort of clustering I guess.