[SOLVED] Remove duplicates selecting which examples must remain

New Altair Community Member

Jan 16, 2014

Updated Nov 5, 2024 by Jocelyn

Hello.

I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.

I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.

How can I do it?

Thanks

Find more posts tagged with

AI Studio

Sort by:

1 - 6 of 61

RalfKlinkenberg

New Altair Community Member

Jan 16, 2014

Hello Arthur Gouveia,

you can use the RapidMiner operator called Filter Examples.

Cheers,
Ralf

MariusHelf

New Altair Community Member

Jan 17, 2014

Hi Arthur,

in addition to Filter Examples for filtering on is_valid_status you can use the Remove Duplicates operator. It allows to select which attributes are considered for finding duplicates. You should try the operators first on a smaller subset to get a feeling for their settings.

Best regards,
Marius

arthurgouveia

New Altair Community Member

Jan 17, 2014

Thank you! It worked almost perfectly. I can't believe I didn't think about the solution you guys gave me.... ::)

But now I have another problem. I found several contracts with valid statuses and I'd like to remove duplicates but keep only the most recent status. I have an attribute named date_processing that I can use to achieve that but I can't figure how.

Is there any way to remove duplicates keeping only the most recent data?

RalfKlinkenberg

New Altair Community Member

Jan 17, 2014

Hello Arthur Gouveia,

this requires several steps:

You can use the RapidMiner operator Aggregate to determine the maximum of date_processing for each client (group by client ID).
The operator Rename can be used to rename the new attribute max(date_processing) to max_date.
With Join you can add the max_date column to the original data table (select the client ID as ID for both tables).
With Filter Examples you keep only the data lines where date_processing >= max_date and you are done.

Cheers,
Ralf

arthurgouveia

New Altair Community Member

Jan 17, 2014

It didn't work. Almost everything went ok but I couldn't filter just the max_date. It cannot parse value 'max_date' with date pattern yyyy-MM-dd HH:mm:ss Z

What is amazing is that when I look to the meta data view the max_date attribute is type date_time. I've tried to change the type using Date to Nominal, Date to Numerical, Numerical to Date, Nominal to Date, Guess Types but all of them either don't list max_date at the attribute list or don't make Filter Example work.

arthurgouveia

New Altair Community Member

Jan 17, 2014

I found a solution! I used the Generate Attributes to create a date_dif attribute using the function date_diff(max_date,date_processing). Then I just had to filter the examples where date_dif=0.

It's working! Thanks!

[SOLVED] Remove duplicates selecting which examples must remain

Find more posts tagged with

Quick Links