How can I start???

jangabe
jangabe New Altair Community Member
edited November 5 in Community Q&A
Hello, first than everything I don't speak english so good, so please sorry if I make mistakes. Hope you could understand me.
My problem is that I want to apply association rules to a DB of a supermarket, but this data have to be cleaned (data preprocessing) and I don't know how to make this.
So, for an idea the DB has the follow items:

numero(Bill_Number) fecha(date) codigo(ProductCode) costo(Cost) precio(Price) tarifa_iva(Value-added Tax Rate) ->

precio_sin_iva(Price_whitout_Value-added tax) valor_iva(Value_Value-added tax) cantidad(Amount) valor_item(Value_item) ->

bodega(storeRoom) nombre(ProductName)      nclasifica1(ClassificationNane) clasifica1(ClassificationCode) ->

nclasifica2(ClassificationName2) clasifica2(ClassificationCode2)

The data for this are:

216932 31/08/2008 4756 2531.66 3200.00 16.00 2758.62 441.38 10.000 3200.00 1 MARGARINA PRACTIS X 400 GR LLOREDA DISTRIBUCIONES   70   MARGARINAS       90

Please if you can help me about what must I to apply for clean the data or has it of the correct form by apply fp growt then association rules.
Thank you by the attention.
Tagged:

Answers

  • land
    land New Altair Community Member
    Hi,
    FP-Growth needs a transaction format. For each possible item it has to specify if its contained in one transaction or not (true/false). Hence you have to make your dataset consisting of binominal attributes.
    You could use the Nominal2Binominal Operator, but as his name states it only transforms nominal attributes. Numerical ones must be converted to nominal beforehand using for example a discretization operator.

    Greetings,
      Sebastian
  • jangabe
    jangabe New Altair Community Member
    Hi Sebastian,
    I have been trying to do a matrix with the data that I have, only using the bill_number and the product_code, having the bill_number like column and the code_product like row, but I would like to know which is the maximum of data that RM support, because I have a matrix of 56126(columns) X 1481(rows) and I want to know if it's possible working whit this or I have to reduce the matrix, or I have to change some parameter on the RM configuration for accept this data, I say this because after I run the tool I had the follow log file:


    P Sep 25, 2009 4:23:24 PM: Initialising process setup
    P Sep 25, 2009 4:23:24 PM: [NOTE] No filename given for result file, using stdout for logging results!
    P Sep 25, 2009 4:23:24 PM: Checking properties...
    P Sep 25, 2009 4:23:24 PM: Properties are ok.
    P Sep 25, 2009 4:23:24 PM: Checking process setup...
    P Sep 25, 2009 4:23:24 PM: Inner operators are ok.
    P Sep 25, 2009 4:23:24 PM: Checking i/o classes...
    P Sep 25, 2009 4:23:24 PM: i/o classes are ok. Process output: AssociationRules.
    P Sep 25, 2009 4:23:24 PM: Process ok.
    P Sep 25, 2009 4:23:24 PM: Process initialised
    P Sep 25, 2009 4:23:24 PM: [NOTE] Process starts
    P Sep 25, 2009 4:23:24 PM: Process:
      Root[1] (Process)
      +- CSVExampleSource[1] (CSVExampleSource)
      +- Numerical2Binominal[1] (Numerical2Binominal)
      +- FPGrowth[1] (FPGrowth)
      +- AssociationRuleGenerator[1] (AssociationRuleGenerator)
    P Sep 25, 2009 4:24:47 PM: [NOTE] Numerical2Binominal: Breakpoint reached
    G Sep 25, 2009 4:25:04 PM: [Warning] Cannot plot all data points, using only a sample of 5000 rows. You can increase the number of values in the properties dialog from the tools menu, the property name is 'rapidminer.gui.plotter.rows.maximum'
    G Sep 25, 2009 4:25:04 PM: [NOTE] Cannot use plotter 'Scatter Matrix': Data table must have between 0 and 50 columns, was 1482.
    G Sep 25, 2009 4:25:04 PM: [NOTE] Cannot use plotter 'Survey': Data table must have between 0 and 100 columns, was 1482.
    G Sep 25, 2009 4:25:04 PM: [NOTE] Cannot use plotter 'Andrews Curves': Data table must have between 0 and 1000 columns, was 1482.
    G Sep 25, 2009 4:25:04 PM: [NOTE] Cannot use plotter 'Quartile Color Matrix': Data table must have between 0 and 100 columns, was 1482.
    G Sep 25, 2009 4:25:04 PM: [NOTE] Cannot use plotter 'RadViz': Data table must have between 0 and 1000 columns, was 1482.
    G Sep 25, 2009 4:25:04 PM: [NOTE] Cannot use plotter 'Surface 3D': Data table must have between 0 and 50 rows, was 5000.
    P Sep 25, 2009 4:37:48 PM: [NOTE] FPGrowth: Breakpoint reached
    G Sep 25, 2009 4:38:08 PM: [Warning] Cannot plot all data points, using only a sample of 5000 rows. You can increase the number of values in the properties dialog from the tools menu, the property name is 'rapidminer.gui.plotter.rows.maximum'
    G Sep 25, 2009 4:38:08 PM: [NOTE] Cannot use plotter 'Scatter Matrix': Data table must have between 0 and 50 columns, was 1482.
    G Sep 25, 2009 4:38:08 PM: [NOTE] Cannot use plotter 'Survey': Data table must have between 0 and 100 columns, was 1482.
    G Sep 25, 2009 4:38:08 PM: [NOTE] Cannot use plotter 'Andrews Curves': Data table must have between 0 and 1000 columns, was 1482.
    G Sep 25, 2009 4:38:08 PM: [NOTE] Cannot use plotter 'Quartile Color Matrix': Data table must have between 0 and 100 columns, was 1482.
    G Sep 25, 2009 4:38:08 PM: [NOTE] Cannot use plotter 'RadViz': Data table must have between 0 and 1000 columns, was 1482.
    G Sep 25, 2009 4:38:08 PM: [NOTE] Cannot use plotter 'Surface 3D': Data table must have between 0 and 50 rows, was 5000.
    P Sep 25, 2009 4:40:48 PM: [NOTE] AssociationRuleGenerator: Breakpoint reached
    P Sep 25, 2009 4:42:10 PM: Process:
      Root[1] (Process)
      +- CSVExampleSource[1] (CSVExampleSource)
      +- Numerical2Binominal[1] (Numerical2Binominal)
      +- FPGrowth[1] (FPGrowth)
      +- AssociationRuleGenerator[1] (AssociationRuleGenerator)
    P Sep 25, 2009 4:42:10 PM: Produced output:
    IOContainer (1 objects):
    Association Rules
    [4995] --> [5430] (confidence: 1.000)
    [4550] --> [5430] (confidence: 1.000)
    [6049] --> [5430] (confidence: 1.000)
    [9523] --> [5430] (confidence: 1.000)
    (created by AssociationRuleGenerator)
    P Sep 25, 2009 4:42:10 PM: [NOTE] Process finished successfully after 18:45


    Or if you can help me about what is happening because the association rules obtained are practically no one, because the 3 obtained no one is good for me.
    Thank you by the attention.
  • steffen
    steffen New Altair Community Member
    Hello jangabe

    Dont worry, RapidMiner can handle this amount of data. The log-file says that some of the plotters are configured to handle only a certain amount of points. You can change that behaviour in the settings (located in the menu bar). Note that these default settings have been made to decrease computation time of the plotters.

    @association-rules: You have to convert the data into transaction format as Sebastian said. Then the association rules will be more useful. If you do not understand what he means, go and grab yourself a data mining book where the procedure is explained. This is recommended anyway so you can judge the quality and the behavior of the outcome.

    regards,

    Steffen
  • jangabe
    jangabe New Altair Community Member
    Good afternoon,
    Thanks for answer my doubt, and with relation to the transaction format i made a little program that filled the matrix of that form 0s and 1s, but i made what you do with the 'rapidminer.gui.plotter.rows.maximum' and i dont know why this dont gives a good results if the amount of data is big.
    Can you tell me if am i doing something wrong, maybe because i dont use name else code that represent that name.
    If you want write me and i gives you a file sample...

    Thank you by the attention.
  • jangabe
    jangabe New Altair Community Member
    Error in: CSVExampleSource (CSVExampleSource) Could not read file 'C:\Users\JANGABE\Desktop\DATOS PRUEBA CLASIFICADOS 1 mes 65000 NombreProd.csv': Number of columns in line 1 was unexpected, was: 1482, expected: 244. The given file could not be read. Please make sure that the file exists and that the RapidMiner process has sufficient privileges.

    In another attempt i obtained that message mistake, where can I configure it for 1482 columns??
  • steffen
    steffen New Altair Community Member
    Hello again

    Thanks for answer my doubt, and with relation to the transaction format i made a little program that filled the matrix of that form 0s and 1s, but i made what you do with the 'rapidminer.gui.plotter.rows.maximum' and i dont know why this dont gives a good results if the amount of data is big.
    Can you tell me if am i doing something wrong, maybe because i dont use name else code that represent that name.
    If you want write me and i gives you a file sample...
    Okay, this is what I understood: You think that the parameter "rapidminer.gui.plotter.rows.maximum" is limiting the number of rows used in learning (association rules) , but that is wrong. This parameter only affects the number of rows used for plotting.

    Regarding:

    Error in: CSVExampleSource (CSVExampleSource) Could not read file 'C:\Users\JANGABE\Desktop\DATOS PRUEBA CLASIFICADOS 1 mes 65000 NombreProd.csv': Number of columns in line 1 was unexpected, was: 1482, expected: 244. The given file could not be read. Please make sure that the file exists and that the RapidMiner process has sufficient privileges.
    This means that your csv - files is somehow messed up. There are many reasons for this, but my first idea is that the operator infers the number of columns from the first line where are not all column names are specified. If could post the process (operator tab, copy and paste the xml -code to this forum), this would be helpful. 

    regards,

    Steffen
  • jangabe
    jangabe New Altair Community Member
    Hello and thanks again.
    About the parameter I made a new change in the "rapidminer.gui.attributeeditor.rowlimit" but the result is the same.
    About the another theme, I dont understand what do you want to say with 'file is somehow messed up'?. Like i told you, in the first row is the product_code and and no name, but the mistake was because i thought that changing the product_code by the real product_name it was to give the hoped results; at once, the Xml tha is generated is the next:


    <operator name="Root" class="Process" expanded="yes">
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="C:\Users\JANGABE\Desktop\DATOS PRUEBA CLASIFICADOS 1 mes 65000 Lleno.csv"/>
            <parameter key="label_name" value="C_Factura"/>
        </operator>
        <operator name="Numerical2Binominal" class="Numerical2Binominal">
        </operator>
        <operator name="FPGrowth" class="FPGrowth">
            <parameter key="min_support" value="0.4"/>
        </operator>
        <operator name="AssociationRuleGenerator" class="AssociationRuleGenerator">
            <parameter key="min_confidence" value="0.4"/>
        </operator>
    </operator>

    The min_support and min_confidence has been tested with another values and the results are the same:

    [4995] --> [5430] (confidence: 1.000)
    [4550] --> [5430] (confidence: 1.000)
    [6049] --> [5430] (confidence: 1.000)
    [9523] --> [5430] (confidence: 1.000)

    And realy I am surprised because the amount of data is big for only obtain that rules, or is it possible?
    How can I to know that maybe that result is the only one that I'm goign to obtain???
    Thank you by the attention.
  • steffen
    steffen New Altair Community Member
    Hello jangabe

    First of all: I have never flamed anyone for his / her skills in english ( I am not a native speaker, too) , but frankly, your sentences are giving me a headache. Please try to form shorter sentences.

    @your problem: I cannot figure out from your last posts what your data looks like NOW. Please post the first rows of the csv-file your are loading ("C:\Users\JANGABE\Desktop\DATOS PRUEBA CLASIFICADOS 1 mes 65000 Lleno.csv") (including first line).

    then we will see ... I have a vague idea what could be wrong ...

    regards,

    Steffen
  • jangabe
    jangabe New Altair Community Member
    Hello, and sorry for my bad english...

    Well, the first csv-file row is the next:


    C_Factura 4 22 23 36 37 39 41 46 54 61 64.......

    140992 0 0 0 0 0 0 0 0 0 0 0.......
    141191 0 0 0 0 0 0 0 0 0 0 0.......
    141278 0 0 0 0 0 0 0 0 0 0 0.......
        .              .
        .              .
        .              .
        .              .
        .              .

    I didn't put more because are 1481 product_code and 56126 bill_code, but basically that's the form; the product_codes are orderly A-Z as the same way the bill_codes.

    Thank you by the attention.

  • steffen
    steffen New Altair Community Member
    No problem

    Now we are getting somewhere ...

    I copied the data into a text-file and played a little bit. The CSVExampleSource-Operator (as you have posted it) causes a single-column-attribute. I suggest to try this setup:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="SimpleExampleSource" class="SimpleExampleSource">
            <parameter key="filename" value="/home/steffen/Desktop/check.txt"/>
            <parameter key="read_attribute_names" value="true"/>
            <parameter key="id_column" value="1"/>
        </operator>
        <operator name="Numerical2Binominal" class="Numerical2Binominal">
        </operator>
        <operator name="FPGrowth" class="FPGrowth">
            <parameter key="min_support" value="0.4"/>
        </operator>
        <operator name="AssociationRuleGenerator" class="AssociationRuleGenerator">
            <parameter key="min_confidence" value="0.4"/>
        </operator>
    </operator
    Another tip: If you doubleclick on an operator (or rightlick -> Breakpoint after) you can set (guess) a breakpoint. This allows you to see the result of the selected operator. If you now click "resume" (the arrow) at the top task bar, the process continues. This is extremely helpful when it comes to process debugging.

    oh and I think, that "thank you for the / your attention" is correct ;)

    kind regards,

    Steffen
  • jangabe
    jangabe New Altair Community Member
    Hello,
    I made exactly what you wrote and I obtained the same result...
    The Xml code was:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="SimpleExampleSource" class="SimpleExampleSource">
            <parameter key="filename" value="C:\Users\JANGABE\Desktop\DATOS PRUEBA CLASIFICADOS 1 mes 65000 Lleno.csv"/>
            <parameter key="read_attribute_names" value="true"/>
            <parameter key="id_column" value="1"/>
        </operator>
        <operator name="Numerical2Binominal" class="Numerical2Binominal" breakpoints="before">
        </operator>
        <operator name="FPGrowth" class="FPGrowth" breakpoints="before">
            <parameter key="min_support" value="0.4"/>
        </operator>
        <operator name="AssociationRuleGenerator" class="AssociationRuleGenerator" breakpoints="before">
            <parameter key="min_confidence" value="0.4"/>
        </operator>
    </operator>


    So, will be that those are the only association rules on that DB??
    I want to tell you that if you can give me your e-mail for To send you the complete file and you cuoul to make tests??

    Thank you for your attention.
  • land
    land New Altair Community Member
    Hi,
    although I didn't read the hole thread here, just a little remark: Did you lower the min_support parameter of fp-growth and the min_confidence parameter of the rules generator? Might be, there are no more rules until the support and confidence gets really low...

    Greetings,
      Sebastian
  • jangabe
    jangabe New Altair Community Member
    Hi,
    This is the Xml-code on the last try:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="C:\Users\JANGABE\Desktop\DATOS PRUEBA CLASIFICADOS 1 mes 65000 Lleno.csv"/>
            <parameter key="id_column" value="1"/>
        </operator>
        <operator name="Numerical2Binominal" class="Numerical2Binominal">
        </operator>
        <operator name="FPGrowth" class="FPGrowth">
            <parameter key="min_support" value="0.1"/>
        </operator>
        <operator name="AssociationRuleGenerator" class="AssociationRuleGenerator">
            <parameter key="min_confidence" value="0.1"/>
        </operator>
    </operator>


    Like you see the min_support and the min_confidence are in their lower value; and the results are the same:

    Association Rules
    [4995] --> [5430] (confidence: 1.000)
    [4550] --> [5430] (confidence: 1.000)
    [6049] --> [5430] (confidence: 1.000)
    [9523] --> [5430] (confidence: 1.000)
  • land
    land New Altair Community Member
    Yes,
    I see. Obviously there isn't any lower value between 0 and 1 than 0.1...
  • jangabe
    jangabe New Altair Community Member
    Hi,
    I have downloaded the latest RM version.
    Curiosly, I did the test with this version applying the same parameters and I didn't obtain any association rule.
    That's strange  'cause before the result was:

    Association Rules
    [4995] --> [5430] (confidence: 1.000)
    [4550] --> [5430] (confidence: 1.000)
    [6049] --> [5430] (confidence: 1.000)
    [9523] --> [5430] (confidence: 1.000)

    So, what happend in this case??

    Thank you for your attention.
  • land
    land New Altair Community Member
    Hi,
    sorry, but I'm not a magician. I cannot even guess whats causing your problems, because I don't have your data, I do not even have your process. You don't even said, which is your rapid miner version, and since  there are two most recent versions, 4.6 and 5.0beta, I can only guess. And last but not least, if you don't follow my advice to lower your support and confidence threshold (BELOW 0.1! For example just take 0.0000001 for testing if it works anyway.)  I cannot help you at all.

    Greetings,
      Sebastian
  • jangabe
    jangabe New Altair Community Member
    Hi,
    From this link you can download the file tha contain my data:
    http://www.4shared.com/file/137659584/22782223/DATOS_PRUEBA_CLASIFICADOS_1_mes_65000_Lleno.html

    The version of RM is the 4.6, and I tried with confidence and support in 0.0000001 and I obtained 32 association rules

    Something strange happened me, I always try with a new process, and don't obtain any result, just now i did it again but this time with a saved process and amazingly the result were immediately(always take around 18- 20 minutes) and positive(between 32 an 20 varying the confidence and the support), but made a new try again and the result was NO RULES FOUND. What would happen??

    Thanks you for your attention.
  • land
    land New Altair Community Member
    Ok, thank you very much for this. No I have the hope to be able to reproduce your problems. I will check this as soon as I can, but I doubt I will find the time before next week.

    Greetings,
      Sebastian
  • jangabe
    jangabe New Altair Community Member
    Hi,
    Thanks for your help and I want ask you something:
    Why do i have to reduce the min_confidence and the min_support for obtain so many rules??
    Is there some relation between the min_support, the min_confidence and the amount of data??

    Thanks you very much...:)
    Best Regards.
  • land
    land New Altair Community Member
    Hi,
    the both thresholds specify in how many examples this item set or this rule have to occur before it is called frequent. So if you have 1000 examples and a support of 0.1 then the item set must be contained in 100 examples, otherwise the set is discarded.
    The level of support needed for gaining some rules depends on your data. You will always find rules, but with lesser support, these rules are more worthless, because they are less general, describing only a small number (or only one) of transactions.

    Greetings,
      Sebastian
  • jangabe
    jangabe New Altair Community Member
    Hi,
    On last try i obtained this error message:
    Process failed
    IllegalArgumentException caught:
    Duplicate attribute name.
    But I don't have any attribute on a row or column whit some duplicate name.
    So, what does it mean?? and what about with the data that I gave you??

    Thank you for your attention.
    Regards. :)
  • land
    land New Altair Community Member
    Hi,
    your problem is on my to do list and I will take a look at it as soon as I can. If you want to have a guaranteed response time, you will have to check the enterprise solutions at rapid-i.com. Problems from the community will be solved as soon as someone from us has the time. I think it's fair, since you didn't pay anything for this fine piece of software...

    Greetings,
      Sebastian
  • jangabe
    jangabe New Altair Community Member
    Hi,
    I have had another error message:

    This process would need more than the maximum amount of available memory.
    You can either leave the process as it is and use a computer with more memory, reduce the amount of data by one of the sampling operators, optimize the process by using other learning or preprocessing schemes, or directly work on database systems, e.g. by using the cached database example surce operators.

    I'm doing this on a laptop with 3 Gb of ram (dual core 2.0 and 250 Gb hard disk), so it isn't enough??
    Sampling operators??
    Which other learning or preprocessing schemes can I use??
    How is working using the cached database example source operators??

    I'm so confuse and I don't know what to do...

    Thanks a lot. :)
  • land
    land New Altair Community Member
    Hi,
    if you have a large data set and lowering the support during frequent item set generation with FPGrowth, the memory consumption of the used FPTree will explode. Unfortunately this isn't avoidable with FPGrowth. If you are doing something else than building frequent item sets, you should let me know.


    Greetings,
      Sebastian