ignore '?' value
olandesino
New Altair Community Member
Hi,
I would like to ask if there is a preprocessing filter to ignore all the fields (not entire attribute) with a particulare values like '?'.
So that RM accept the file as input (without missing values) but still i can ignore this fake value during my process.
for example:
3 45 sss
? ? qqq
? ? rrr
in this case i want that the tool take all 3 attribute but with out the ? values (this will compromise my data!)
thank you in advance for your feedback!
A.
I would like to ask if there is a preprocessing filter to ignore all the fields (not entire attribute) with a particulare values like '?'.
So that RM accept the file as input (without missing values) but still i can ignore this fake value during my process.
for example:
3 45 sss
? ? qqq
? ? rrr
in this case i want that the tool take all 3 attribute but with out the ? values (this will compromise my data!)
thank you in advance for your feedback!
A.
Tagged:
0
Answers
-
Hi,
I must admit that I did not get the point. Could you provide an example process where this would make any sense and describe in detail how the values should be ignored - without being ignored? I think I missed something crucial here
Cheers,
Ingo0 -
I know...the problem is that i received a horrible data set and fix it for RM is quite hard!
from the file you can see sthat I've 3 attributes. the attributes "id" and "verdict" appear every n row (start point of a sequence of n[with n variable] length) meanwhile the minimal_event attribute appears every rows.
1. if I will replace blank fields (wich is not allowed in RM) with the '?' can i say to RM after to ignore this sign? otherwise it will compromise my data?
2.After this, I know how to extract sequences from this file, but how can I associate the verdict value to the related sequence?
example of desired structure:
<Sequence1>
id
minimal_event1
minimal_event2
|
|
minimal_eventn
verdict (related)
<end sequence1>
...and so on.
I know that RM doesn't like different length size of sequences
is there a smart solution to solve it?
I hope that my 2 problems are more clear.
Thanks
A.
[attachment deleted by admin]0 -
Hi again,
this format is, well, a real pain. Any chance that the data source is able to deliver a slightly improved data set? For example, it would be much easier if the sequence events would not have been divided by new lines. Alternatively (additionally?) it would also be much easier if you would not have simply whitespace as a separation character between the columns. As you can easily see, this will always lead to problems (how to identify the columns?). A semicolon for example would be much easier.
After saying this I currently see only a single option (at least only a single easy one): write your own example source operator (it should not be too difficult).
Cheers,
Ingo0 -
"if the sequence events would not have been divided by new lines"
well, i can put all the operation related to one sequence in one line. But
I don't know how RM could interpret those kind of input format since it decides that
every column is a attribute...Could you explain me better?
with a script i convert it in to a arff format(every row is a sequence), but the problem remains: "different length of sequences" is something that the tool cannot handle..
Thank for your time.
A.0 -
Hi again,
you could use the Split operator for this purpose like in
<operator name="Root" class="Process" expanded="yes">
<operator name="SimpleExampleSource" class="SimpleExampleSource">
<parameter key="filename" value="sequence_data.txt"/>
<parameter key="read_attribute_names" value="true"/>
<parameter key="column_separators" value=";\s*"/>
</operator>
<operator name="Split" class="Split">
<parameter key="attributes" value="VALUES"/>
<parameter key="split_pattern" value=",\s*"/>
</operator>
</operator>
The attached file has the format:
As you can see, the value sequences do not have the same length. The resulting example set will have the format:
ID; LABEL; VALUES
id1; label1; v1, v2, v3
id2; label2; v5
id3; label3; v2, v3, v5, v6
This should be pretty much what you are looking for.
ID LABEL VALUE1 VALUE2 VALUE3 VALUE4
id1 label1 v1 v2 v3 ?
id2 label2 v5 ? ? ?
id3 label3 v2 v3 v5 v6
Hope that helps,
Ingo
P.S.: Please consider voting at KDnuggets. Read more at http://rapid-i.com/rapidforum/index.php/topic,884.msg3302.html
[attachment deleted by admin]0 -
ID LABEL VALUE1 VALUE2 VALUE3 VALUE4
I did the same example with the same data and there are this is what i get on my screen.
id1 label1 v1 v2 v3 v6
id2 label2 v5 v2 v3 v6
id3 label3 v2 v3 v5 v6
It will manipulate my data with the purpose to have "always the same length"
Thanks anyway for your help, I know that somewhere there is a solution.
Regards,
A.
0 -
Hi,
sorry, but I did not notice that there was a bug in the split operator for the ordered split mode leading to the wrong values. This bug was fixed in the latest developer branch. Since we are currently moving our CVS servers to subversion, the access is however not as easy as usual. But of course this bug will also be fixed for the next update of the Enterprise Edition and later also for the next community release.
Cheers,
Ingo0