ignore '?' value

olandesino
olandesino New Altair Community Member
edited November 5 in Community Q&A
Hi,

I would like to ask if there is a preprocessing filter to ignore all the fields (not entire attribute) with a particulare values like '?'.
So that RM accept the file as input (without missing values) but still i can ignore this fake value during my process.

for example:

3 45  sss
? ?    qqq
? ?    rrr


in this case i want that the tool take all 3 attribute but with out the ? values (this will compromise my data!)

thank you in advance for your feedback!

A.
Tagged:

Answers

  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    I must admit that I did not get the point. Could you provide an example process where this would make any sense and describe in detail how the values should be ignored - without being ignored? I think I missed something crucial here  ;)

    Cheers,
    Ingo
  • olandesino
    olandesino New Altair Community Member
    I know...the problem is that i received a horrible data set and fix it for RM is quite hard!
    from the file you can see sthat I've 3 attributes. the attributes "id" and  "verdict" appear every n row (start point of a sequence of n[with n variable] length) meanwhile the minimal_event attribute appears every rows.

    1. if I will replace blank fields (wich is not allowed in RM) with the '?' can i say to RM after to ignore this sign? otherwise it will        compromise my data?

    2.After this, I know how to extract sequences from this file, but how can I associate the verdict value to the related sequence?

    example of desired structure:

    <Sequence1>
    id
    minimal_event1
    minimal_event2
    |
    |
    minimal_eventn
    verdict (related)
    <end sequence1>

    ...and so on.
    I know that RM doesn't like different length size of sequences :(
    is there a smart solution to solve it?
    I hope that my 2 problems are more clear.

    Thanks
    A.

    [attachment deleted by admin]
  • IngoRM
    IngoRM New Altair Community Member
    Hi again,

    this format is, well, a real pain. Any chance that the data source is able to deliver a slightly improved data set? For example, it would be much easier if the sequence events would not have been divided by new lines. Alternatively (additionally?) it would also be much easier if you would not have simply whitespace as a separation character between the columns. As you can easily see, this will always lead to problems (how to identify the columns?). A semicolon for example would be much easier.

    After saying this I currently see only a single option (at least only a single easy one): write your own example source operator (it should not be too difficult).

    Cheers,
    Ingo
  • olandesino
    olandesino New Altair Community Member
    "if the sequence events would not have been divided by new lines"
    well, i can put all the operation related to one sequence in one line. But
    I don't know how RM could interpret those kind of input format since it decides that
    every column is a attribute...Could you explain me better?
    with a script i convert it in to a arff format(every row is a sequence), but the problem remains: "different length of sequences" is something that the tool cannot handle..
    Thank for your time.
    A.
  • IngoRM
    IngoRM New Altair Community Member
    Hi again,

    you could use the Split operator for this purpose like in

    <operator name="Root" class="Process" expanded="yes">
        <operator name="SimpleExampleSource" class="SimpleExampleSource">
            <parameter key="filename" value="sequence_data.txt"/>
            <parameter key="read_attribute_names" value="true"/>
            <parameter key="column_separators" value=";\s*"/>
        </operator>
        <operator name="Split" class="Split">
            <parameter key="attributes" value="VALUES"/>
            <parameter key="split_pattern" value=",\s*"/>
        </operator>
    </operator>

    The attached file has the format:

    ID;  LABEL;  VALUES
    id1; label1; v1, v2, v3
    id2; label2; v5
    id3; label3; v2, v3, v5, v6
    As you can see, the value sequences do not have the same length. The resulting example set will have the format:

    ID    LABEL    VALUE1  VALUE2  VALUE3  VALUE4
    id1 label1 v1 v2 v3 ?
    id2 label2 v5 ? ? ?
    id3 label3 v2 v3 v5 v6
    This should be pretty much what you are looking for.

    Hope that helps,
    Ingo


    P.S.: Please consider voting at KDnuggets. Read more at http://rapid-i.com/rapidforum/index.php/topic,884.msg3302.html

    [attachment deleted by admin]
  • olandesino
    olandesino New Altair Community Member
    ID     LABEL    VALUE1  VALUE2  VALUE3  VALUE4
    id1 label1 v1 v2 v3 v6
    id2 label2 v5 v2 v3 v6
    id3 label3 v2 v3 v5 v6
    I did the same example with the same data and there are this is what i get on my screen.
    It will manipulate my data with the purpose to have "always the same length" :(
    Thanks anyway for your help, I know that somewhere there is a solution.

    Regards,
    A.

  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    sorry, but I did not notice that there was a bug in the split operator for the ordered split mode leading to the wrong values. This bug was fixed in the latest developer branch. Since we are currently moving our CVS servers to subversion, the access is however not as easy as usual. But of course this bug will also be fixed for the next update of the Enterprise Edition and later also for the next community release.

    Cheers,
    Ingo