ignore '?' value

olandesino · May 2009

Hi,

I would like to ask if there is a preprocessing filter to ignore all the fields (not entire attribute) with a particulare values like '?'.
So that RM accept the file as input (without missing values) but still i can ignore this fake value during my process.

for example:

3 45 sss
? ? qqq
? ? rrr

in this case i want that the tool take all 3 attribute but with out the ? values (this will compromise my data!)

thank you in advance for your feedback!

A.

IngoRM · May 2009

Hi,

I must admit that I did not get the point. Could you provide an example process where this would make any sense and describe in detail how the values should be ignored - without being ignored? I think I missed something crucial here

Cheers,
Ingo

olandesino · May 2009

I know...the problem is that i received a horrible data set and fix it for RM is quite hard!
from the file you can see sthat I've 3 attributes. the attributes "id" and "verdict" appear every n row (start point of a sequence of n[with n variable] length) meanwhile the minimal_event attribute appears every rows.

1. if I will replace blank fields (wich is not allowed in RM) with the '?' can i say to RM after to ignore this sign? otherwise it will compromise my data?

2.After this, I know how to extract sequences from this file, but how can I associate the verdict value to the related sequence?

example of desired structure:

<Sequence1>
id
minimal_event1
minimal_event2
|
|
minimal_eventn
verdict (related)
<end sequence1>

...and so on.
I know that RM doesn't like different length size of sequences

is there a smart solution to solve it?
I hope that my 2 problems are more clear.

Thanks
A.

[attachment deleted by admin]

IngoRM · May 2009

Hi again,

this format is, well, a real pain. Any chance that the data source is able to deliver a slightly improved data set? For example, it would be much easier if the sequence events would not have been divided by new lines. Alternatively (additionally?) it would also be much easier if you would not have simply whitespace as a separation character between the columns. As you can easily see, this will always lead to problems (how to identify the columns?). A semicolon for example would be much easier.

After saying this I currently see only a single option (at least only a single easy one): write your own example source operator (it should not be too difficult).

Cheers,
Ingo

olandesino · May 2009

"if the sequence events would not have been divided by new lines"
well, i can put all the operation related to one sequence in one line. But
I don't know how RM could interpret those kind of input format since it decides that
every column is a attribute...Could you explain me better?
with a script i convert it in to a arff format(every row is a sequence), but the problem remains: "different length of sequences" is something that the tool cannot handle..
Thank for your time.
A.

IngoRM · May 2009

Hi again,

you could use the Split operator for this purpose like in


<operator name="Root" class="Process" expanded="yes">
    <operator name="SimpleExampleSource" class="SimpleExampleSource">
        <parameter key="filename"	value="sequence_data.txt"/>
        <parameter key="read_attribute_names"	value="true"/>
        <parameter key="column_separators"	value=";\s*"/>
    </operator>
    <operator name="Split" class="Split">
        <parameter key="attributes"	value="VALUES"/>
        <parameter key="split_pattern"	value=",\s*"/>
    </operator>
</operator>

The attached file has the format:


ID;  LABEL;  VALUES
id1; label1; v1, v2, v3
id2; label2; v5
id3; label3; v2, v3, v5, v6

As you can see, the value sequences do not have the same length. The resulting example set will have the format:


ID     LABEL    VALUE1  VALUE2  VALUE3  VALUE4
id1	label1	v1	v2	v3	?
id2	label2	v5	?	?	?
id3	label3	v2	v3	v5	v6

This should be pretty much what you are looking for.

Hope that helps,
Ingo

P.S.: Please consider voting at KDnuggets. Read more at http://rapid-i.com/rapidforum/index.php/topic,884.msg3302.html

[attachment deleted by admin]

olandesino · May 2009

ID     LABEL    VALUE1  VALUE2  VALUE3  VALUE4
id1	label1	v1	v2	v3	v6
id2	label2	v5	v2	v3	v6
id3	label3	v2	v3	v5	v6

I did the same example with the same data and there are this is what i get on my screen.
It will manipulate my data with the purpose to have "always the same length"

Thanks anyway for your help, I know that somewhere there is a solution.

Regards,
A.

IngoRM · May 2009

Hi,

sorry, but I did not notice that there was a bug in the split operator for the ordered split mode leading to the wrong values. This bug was fixed in the latest developer branch. Since we are currently moving our CVS servers to subversion, the access is however not as easy as usual. But of course this bug will also be fixed for the next update of the Enterprise Edition and later also for the next community release.

Cheers,
Ingo

ignore '?' value

Answers

Categories