How to handle empty fields problems (Not missing data) in a data set
MasoudG
New Altair Community Member
Hello guys.
I have a data set that I collected from 35 companies. one of my attributes is: "do they have this type of plan" and the values will be "Yes" and "No" and my second attribute is "how much is the price of this plan" so for the companies that their first attribute is "Yes" the value would be a number like 30 euros, but for the companies that their first attribute is "No" this filled is empty.
I want to do clustering but because of the empty fields, I can't proceed. I don't want to remove this attribute or any example or even fill up these fields with any missing data techniques, because they are not missing.
is there any technique in Rapidminer to define: if the first attribute is no then ignored the second attribute for that example?
Thank you very much
I have a data set that I collected from 35 companies. one of my attributes is: "do they have this type of plan" and the values will be "Yes" and "No" and my second attribute is "how much is the price of this plan" so for the companies that their first attribute is "Yes" the value would be a number like 30 euros, but for the companies that their first attribute is "No" this filled is empty.
I want to do clustering but because of the empty fields, I can't proceed. I don't want to remove this attribute or any example or even fill up these fields with any missing data techniques, because they are not missing.
is there any technique in Rapidminer to define: if the first attribute is no then ignored the second attribute for that example?
Thank you very much
Tagged:
0
Best Answer
-
I agree with @David_A. You can replace those missing values with something meaningful, e.g. 0 for missing (but meaningful) numerical values (I assume if it is not there it can be interpreted as zero) and "undefined" for nominal attributes (so that you could treat these in a special way). If you are concerned that those extra zeroes are going to upset your statistics, e.g. during your cluster analysis, this means that in your mind you want these cases to be treated separately. If this is the case and you wanted to do segmentation analysis, conduct your clustering in two different processes (filter them out or in for each) and interpret each separately. If you wanted to use cluster attribute for building some predictive model, you could then rename these cluster attributes C1 and C2 (create dummy attributes C2 and C1 each, with some specific values - in a sense putting them all in a separate cluster) and append all examples back, generating two extra columns, for further processing.Jacob1
Answers
-
Hi @MasoudG ,you have different options here, depending on what you actually want to cluster and how you want to proceed afterwards.You could use the Replace Missing Values operator to replace the Value field with something useful (for example 0 or the average price).The other option is to first use Filter Examples and filter either for the "Yes" in relevant attribute or "no_missing_attributes".Best,
David1 -
Hi @David_A
Thank you very much for your quick response. Actually, I have around 30 different attributes of 35 companies and I want to cluster these companies based on their features.
1- Replace Missing Values: I don't want to replace any value in these fields since they are not missing. they do not have any value because they do not have this type of plan and i think replacing a value like 0 or average can affect the clustering process.
2- Filter Examples: I don't want to filter any example because my examples are my companies and my main goal is clustering them, so I need them.
Do you have any other idea?
Thank you in advance.
Masoud
0 -
What do you mean exactly by the statement that you can't proceed with clustering because of the empty fields? Are you saying the clustering algorithm is preventing you because those fields are currently designated as missing within RapidMiner?
If you need to remove all your missing values in order to run the clustering algorithm you want then you can populate them appropriately with a two-step process. First use Generate Attributes and an expression to say something like PricePlan=if(HavePlan="Yes",Priceplan,"N/A"). This will keep whatever the value is in the price of the plan variable if they answered yes to whether they have the plan, and if they did not answer yes then it will set the value of the price of the plan to "N/A" (or you can make this whatever you want). Then you can run a subsequent Replace Missing Values and decide how to represent the missing prices where they answered yes to having the plan (for example, with the average price).
If the fields are not technically missing but simply populated by a space or similar, then you should be fine.
1 -
I agree with @David_A. You can replace those missing values with something meaningful, e.g. 0 for missing (but meaningful) numerical values (I assume if it is not there it can be interpreted as zero) and "undefined" for nominal attributes (so that you could treat these in a special way). If you are concerned that those extra zeroes are going to upset your statistics, e.g. during your cluster analysis, this means that in your mind you want these cases to be treated separately. If this is the case and you wanted to do segmentation analysis, conduct your clustering in two different processes (filter them out or in for each) and interpret each separately. If you wanted to use cluster attribute for building some predictive model, you could then rename these cluster attributes C1 and C2 (create dummy attributes C2 and C1 each, with some specific values - in a sense putting them all in a separate cluster) and append all examples back, generating two extra columns, for further processing.Jacob1