Removing duplicates does nothing

User: "SquirrelX"
New Altair Community Member
Updated by Jocelyn
Hi,

I've just carried out the following experiment: got an excel table of some thousands of records and some hundred of attributes. I pressed the 'remove duplicates' options, and I got 500 less rows in my dataset. I don't want to use excel for this though, so I tried the same in RapidMiner. Saved the worksheet as csv, loaded it to Rapidminer. I inspected manually that there are indeed a number of duplicate rows. Then I used the Remove duplicates in Rapidminer, and no rows were removed.
I was thinking about the cause and I think it's because the dataset contains missing data at various places (for various examples, and attributes of various types).

Is there any way to remove duplicates by considering the missing values as 'equally missing'? Or is it a bug somewhere? I couldn't figure out the solution so far.

Thanks in advance.

Find more posts tagged with