Problem loading data of the Data Mining Contest (prudsys.com)

MartinKoch
MartinKoch New Altair Community Member
edited November 5 in Community Q&A
Hello,

last year I used RapidMiner4.4 successfully for participating in the data mining contest from prudsys. 2 Days ago the task for the current DMC was released (http://www.data-mining-cup.de/en/dmc-competition/task/).

The task contains about 3 MB of data in a .csv file. My problem is that the process which imports the file into a repository seems to freeze.

When I strip the data to a few hundred lines the importing works well, but when I try to load the whole data set the process does not finish. Actually I let RapidMiner "import the data" for about 5 hours on my Computer (DualCore AthlonXP with 4GB RAM) and there was no progress on importing the data.

RapidMiner4.6 can read the data without Problems, but I need the support for Date-Attributes and AFAIK there is no such support in 4.6.

Now I wonder what the problem is.

Could it be a problem with the encoding of the Textfile?

Is it a bug?

Is it because of my impatience that I think there is something wrong? Should I let RM more time to complete the import task?


Yours faithfully
Martin
Tagged:

Answers

  • MartinKoch
    MartinKoch New Altair Community Member
    Hello again,

    i nailed the problem down to the "Read CSV" operator. Maybe somebody from rapid-i can make use of my following description.

    I think that the import a csv file to the repository uses the same code (or most likely the operator) as the "Read CSV" operator.

    The CSV file I tried to import was about 3 or 4 MB big with a total of 32429 lines. After stripping the size down to 2.5 MB or 22171 lines the importing worked. The error has to be in the region of 2.5 MB since importing a file of 22554 lines still didn't work for me. So much for using the functionality of importing to repository.

    If I do the same with the csv operator I get the same results but with one important difference: I get an Exception.

    Apr 17, 2010 1:57:12 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
    Apr 17, 2010 1:57:12 PM SEVERE: Here:          Process[1] (Process)
              subprocess 'Main Process'
                +- Read CSV[1] (Read CSV)
          ==>  +- Read CSV (2)[1] (Read CSV)
    Apr 17, 2010 1:57:12 PM SEVERE: java.lang.NullPointerException
    The exception is only thrown if I activate the "parse numbers" option. If it is not activated the bigger files are imported as well. Maybe there is a memory leak or something in the number parsing logic??

    Since I can now work by combining two trimmed down data sets the problem is solved for me but maybe a programmer can make some use of it.

    Sorry if there are more infos I could provide for the exception. I never had any problems with RM before so I never had to check for additional log files and so on.
  • haddock
    haddock New Altair Community Member
    Greetings!
    Maybe there is a memory leak or something in the number parsing logic??
    Could be in the way dates get parsed - I've seen "0000-00-00" in the last date column.

  • MartinKoch
    MartinKoch New Altair Community Member
    Hi.

    First of all it seems that the behaviour I described is not dependent on a fixed size i.e. I couldn't import some of the data again although the data was much smaller. So I thought about some parsing error again. I splitted a file which didn't got imported into two files to search for some input lines which couldn't get parsed right, but there was none. A file of 5000 lines did not get imported, after splitting it into two of 2500 lines it worked and after concatenating them again it the import failed again.

    @Haddock
    You're right. It wasn't the problem though, because after deleting every zero-date the import still failed. But I noticed that the date 0000-00-00 gets parsed as 0002-11-30.

    So I have to thank you for your comment. You probably saved me a lot of error searching later on. Thank you :)
  • haddock
    haddock New Altair Community Member
    Hi there,

    Not quite sure where you got with this, I've just run the following, and all 32k rows load fine in 1 second - just as long as you don't parse those pesky numbers !

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="391" width="915">
          <operator activated="true" class="read_csv" expanded="true" height="60" name="Read CSV (2)" width="90" x="45" y="165">
            <parameter key="file_name" value="C:\Documents and Settings\Alien\My Documents\Prolog\dmc2010_train.csv"/>
            <parameter key="parse_numbers" value="false"/>
          </operator>
          <connect from_op="Read CSV (2)" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MartinKoch
    MartinKoch New Altair Community Member
    Thanks again ;)

    It's not what I wanted in the first place, but thanks to you I tried the "Guess Types" operator once again. Although I'm pretty sure I tried it before, now everything works as expected, i.e. "Date" type was guessed correctly which didn't happen last time I used it. Maybe the last time the operator recognised my dates as nominal values because the 0000-00-00 date was still in my csv file.

    I don't know, I think I will just work with this now, because I feel like I'm data mining this problem and not the task I should be mining.

    But I also do not want to switch to another solution because last year I successfully competed against about 40 teams with RapidMiner. Please don't think I won or something. I was at rank 22nd. But for working alone all the month I'm proud of myself :)
  • haddock
    haddock New Altair Community Member
    Nice one,

    Go Marts, Go  ;D
  • land
    land New Altair Community Member
    Hi Martin,
    are you participating on the DMC Cup?

    Greetings,
      Sebastian
  • MartinKoch
    MartinKoch New Altair Community Member
    Hi Sebastian,

    yes I participate in it for the second time now. Why do you ask? :)
  • land
    land New Altair Community Member
    Hi,
    well, I should have read the subject. :) I was just curious, I participated two years ago and I know what you mean when you are on your own there. Don't know if they have  changed but back then the first 20 places were assigned to members of just two groups working together. Not very surprising that groups of 10 people can achieve more than just one with one computer...

    I wish you all the best. It's always fun if someone wins the cup of our competitor using our software :) What's the cup about this year? Again customer data?

    Greetings,
      Sebastian
  • IngoRM
    IngoRM New Altair Community Member
    Hello Martin,

    I also wish you all the best! Working alone can be much harder than working in a team. Maybe this motivates: The third place in last year's DMC was made by a single person with RapidMiner. The distribution of used tools in the TOP 10 was the following:
    • RapidMiner: 3x
    • R: 3x
    • SPSS PASW (Clementine): 3x
    • Own code: 1x
    Although I actually believe that a good idea (often a good preprocessing or additionally extracted features) are more important than the selected tool, I am of course really happy to see that so many people use RapidMiner for the cup. Maybe other people here in the community also want to participate and we could add a discussion thread where we can share ideas and thoughts about the best processes or help people in finding other participants and maybe even build teams. How does that sound?

    Cheers,
    Ingo
  • MartinKoch
    MartinKoch New Altair Community Member
    Hi Sebastian and Ingo, (sorry if it's inappropriate addressing you with your first name)

    actually the situation with the DMC has changed. Since last year only students can participate in the DMC and only two students per university can actually register themselves at prudsys. These two students then have to act as a team leader for every student willing to participate. Since we're a small university of applied science in Schmalkalden, Germany there are as good as zero students who are willing to take part in the competition.

    At the moment I try to motivate some students to participate though. But theese are primarily students who are currently taking a course on data mining (sadly the only one for bachelor students, I'm a diploma student by the way :)) so they aren't really into the subject.

    I also told them that it's not about the program but about good ideas, background knowledge and common sense.

    Last year I chose RM because I wanted features like automatic parameter optimization, cross validation techniques and flexible visualization working out of the box. Although I'm an experienced programmer I didn't want to do a lot of coding and Clementine for example  required some SPSS coding because it did not support automatic parameter optimization for e.g. the support vector machine. (Correct me if I'm wrong please.)

    This years task is about classifying costumers like two years before.

    This year I get some support of a research group from our university as well. I'm able to use a home brewed grid computing system(in the widest sense) for doing an exhaustive search for the the optimal parameters of a support vector machine.

    Since we do not have that much lectures on data mining I mainly participate to get experience (on data mining, managing a team) on the field and I'll let the winning do the students who are more experienced ;)

    Greetings,
    Martin
  • IngoRM
    IngoRM New Altair Community Member
    Hi Martin,

    first name is fine  :)

    I thought that only up to two teams per university were allowed but there was no restriction for the team size. But maybe they have changed this again. As you said, a participation in the competition is a great way to get experience and also some practice for real-world data mining tasks. I believe that this is really a chance and I hope that you manage to convince some of your student collegues to participate in your team.

    However, the idea of creating a "market place" for helping students finding their teams and sharing their thoughts will of course not work if only university teams are allowed. Well, maybe this is just the idea of a competition like that  ;)

    Anyway, I wish you all the best and feel free to ask if you have a problem somewhere.

    Cheers,
    Ingo
  • MartinKoch
    MartinKoch New Altair Community Member
    Hi Ingo,

    thanks for your "help offer". I will definitely come back to your offer. You're doing a great job here with your forum by the way :)

    Just one correction to my previous explanations: you can have unlimited students per university but only two students can register themselves officially at prudsys. So the team size is not limited. Sorry if I explained it not clear enough in my last post.

    Greetings,
    Martin

  • IngoRM
    IngoRM New Altair Community Member
    Hi Martin,

    thanks for the kind words  :D

    Then it is the same process like in last year: there also only two teams of arbitrary size were allowed.

    Cheers,
    Ingo