[SOLVED] Remove duplicates selecting which examples must remain
arthurgouveia
New Altair Community Member
Hello.
I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.
I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.
How can I do it?
Thanks
I have a large customer dataset with some values "duplicate". Let me try to make myself clear: I have a dataset with over 50 attributes of over 200k contracts. One of these attributes is contract_status. Some of these statuses are valid and some are invalid. I've created a boolean attribute named is_valid_status.
I'd like to remove duplicates based on a subset of attributes and keep only the examples where is_valid_status is true.
How can I do it?
Thanks
Tagged:
0
Answers
-
Hello Arthur Gouveia,
you can use the RapidMiner operator called Filter Examples.
Cheers,
Ralf0 -
Hi Arthur,
in addition to Filter Examples for filtering on is_valid_status you can use the Remove Duplicates operator. It allows to select which attributes are considered for finding duplicates. You should try the operators first on a smaller subset to get a feeling for their settings.
Best regards,
Marius0 -
Thank you! It worked almost perfectly. I can't believe I didn't think about the solution you guys gave me.... ::)
But now I have another problem. I found several contracts with valid statuses and I'd like to remove duplicates but keep only the most recent status. I have an attribute named date_processing that I can use to achieve that but I can't figure how.
Is there any way to remove duplicates keeping only the most recent data?0 -
Hello Arthur Gouveia,
this requires several steps:- You can use the RapidMiner operator Aggregate to determine the maximum of date_processing for each client (group by client ID).
- The operator Rename can be used to rename the new attribute max(date_processing) to max_date.
- With Join you can add the max_date column to the original data table (select the client ID as ID for both tables).
- With Filter Examples you keep only the data lines where date_processing >= max_date and you are done.
Ralf0 -
It didn't work. Almost everything went ok but I couldn't filter just the max_date. It cannot parse value 'max_date' with date pattern yyyy-MM-dd HH:mm:ss Z
What is amazing is that when I look to the meta data view the max_date attribute is type date_time. I've tried to change the type using Date to Nominal, Date to Numerical, Numerical to Date, Nominal to Date, Guess Types but all of them either don't list max_date at the attribute list or don't make Filter Example work.
0 -
I found a solution! I used the Generate Attributes to create a date_dif attribute using the function date_diff(max_date,date_processing). Then I just had to filter the examples where date_dif=0.
It's working! Thanks!0