Removing duplicates does nothing
SquirrelX
New Altair Community Member
Hi,
I've just carried out the following experiment: got an excel table of some thousands of records and some hundred of attributes. I pressed the 'remove duplicates' options, and I got 500 less rows in my dataset. I don't want to use excel for this though, so I tried the same in RapidMiner. Saved the worksheet as csv, loaded it to Rapidminer. I inspected manually that there are indeed a number of duplicate rows. Then I used the Remove duplicates in Rapidminer, and no rows were removed.
I was thinking about the cause and I think it's because the dataset contains missing data at various places (for various examples, and attributes of various types).
Is there any way to remove duplicates by considering the missing values as 'equally missing'? Or is it a bug somewhere? I couldn't figure out the solution so far.
Thanks in advance.
I've just carried out the following experiment: got an excel table of some thousands of records and some hundred of attributes. I pressed the 'remove duplicates' options, and I got 500 less rows in my dataset. I don't want to use excel for this though, so I tried the same in RapidMiner. Saved the worksheet as csv, loaded it to Rapidminer. I inspected manually that there are indeed a number of duplicate rows. Then I used the Remove duplicates in Rapidminer, and no rows were removed.
I was thinking about the cause and I think it's because the dataset contains missing data at various places (for various examples, and attributes of various types).
Is there any way to remove duplicates by considering the missing values as 'equally missing'? Or is it a bug somewhere? I couldn't figure out the solution so far.
Thanks in advance.
Tagged:
0
Answers
-
Hi,
I have also realized this problem that missing numerical values are never counted as equal. It works with missing nominal values, but not with numericals. I have posted a fix for this (and also a faster implementation of this operator) to the bug tracker:
http://bugs.rapid-i.com/show_bug.cgi?id=438
Hopefully it will get into the next bugfix release of RapidMiner.
Best, Zoltan0 -
Hi all,
the fix will be included in the upcoming RapidMiner Version with the slight change that there must be a switch to turn Unknown Equalness on and off. Otherwise the behavior would not be consistent with older process versions.
Nevertheless you could use a trick to come around this until 5.1 is released:
- Replace missings by a non existing value
- Remove Duplicates
- Declare the value used above as Missing.
Greetings,
Sebastian0 -
Thanks both of you.
I was thinking about solving it in a way that first I read a file without setting the value types (as most of my missings are marked as NULL in the original excel file), remove the duplicates, then save the matrix, and load that, this time setting all the attribute types as needed.
Anyway, I'm looking forward to the next release, particularly because:
as I'm trying to load an excel file, I use the wizard to mark first rows as names, the names are shown in the preview window (which sometimes seems to freeze though I can press the Finish button) but I end up with attribute_0, attribute_1, and so on.
Best,
SX
0 -
I'm trying this for a date type attribute (selected as single), but whichever option I'm using results in no change in the dataset, so I still have those missings. Any simple workaround for this?Sebastian Land wrote:
- Replace missings by a non existing value0 -
Hi,
you mean if you apply the replace missing values on the date attribute nothing happens? Might it be the case that the attribute is special but you didn't check "include specials"?
Well, unless you send me a small and executable sample process, I can't say much about this.
Greetings,
Sebastian0