Duplicate Data but different value in target

Hi All,
I am dealing with a small data of 120 rows and 5 features with binary target Valid or Not Valid.I have some duplicate rows where all the input features are same but the target values is different as you can see below (sample data its nor original data).How will the model treat those values ? is it ambiguous data ? i ran the model and it was not able to classify the not valid cases as i have only 32 cases out of 120 as Not Valid and most of them are having the duplicates where it has a valid result also with same inputs ? what should i do ?
Att1 Att2 Att3 Target
F3 G929 P2 Valid
F3 G929 P2 Not Valid
F2 G929 P3 Not Valid
F2 G929 P3 Valid
Regards,
Vishnu
Best Answers
-
given that you have valid and invalid flags for the same combination of values in the attributes how can you expect the model to learn and consequently identify those?
The model needs to find patterns in order to make a prediction. If you are not providing a pattern then there is no real result to be expected. You should go through the data and make sure you have one lable with the same combination of data. So you want to use a remove duplicate. Probably you need to sort them first on order to maintain (Valid/invalid) the "right" one from the filtering or you do it manually given your small data set.
2 -
Sorry, this is a community support forum but not an academic research journal! And I'm an experienced data scientist but not an academic myself--so this type of thinking is actually somewhat mystifying to me. There is much about current best practice in data science that you would have a hard time finding specific academic references to substantiate.
1
Answers
-
given that you have valid and invalid flags for the same combination of values in the attributes how can you expect the model to learn and consequently identify those?
The model needs to find patterns in order to make a prediction. If you are not providing a pattern then there is no real result to be expected. You should go through the data and make sure you have one lable with the same combination of data. So you want to use a remove duplicate. Probably you need to sort them first on order to maintain (Valid/invalid) the "right" one from the filtering or you do it manually given your small data set.
2 -
Actually, in these ambiguous cases, you might be better off removing BOTH of the conflicting input records. It somewhat depends on the data and the use case, but the consequence of removing only one duplicate and leaving the other in is that you are teaching the model to associate a particular pattern with one particular outcome that is actually ambiguous in real life. If one outcome is much more important to you than another, this may be sensible (e.g., in fraud detection), but in other types of outcomes, this may lead to undesirable results. So if you have a large enough sample and your misclassification costs are somewhat symmetrical, I would recommend to omit them all.
3 -
Hi All,
i just want to confirm one thing regarding the duplicates.if i have 10 record all are duplicates and 9 of them have taget label as pass and 1 as fail.so in this case if i remove the diplicates then i will end up with 2 record with all input features are same but the target is different(one pass and one fail) which is ambiguous . if i don't remove those duplicates i am giving more weight to those 9 records than the last record ? is it correct?
Regards,
Vishnu
0 -
Correct. And if you remove all the ambiguous records (per my suggestion) then you are not giving weight to either side.
1 -
@Telcontar120 is there any offical page or book where it was mentioned the same information,actually my mananger asked me to show the proper referene for this explanation.
Regards,
Vishnu
0 -
Sorry, this is a community support forum but not an academic research journal! And I'm an experienced data scientist but not an academic myself--so this type of thinking is actually somewhat mystifying to me. There is much about current best practice in data science that you would have a hard time finding specific academic references to substantiate.
1