count function for nominal values
Hi,
I have a list of 25 nominal attributes, which I would like to aggregate to 1 attribute that counts the 25 said attributes if they have a valid value (being: not missing), but I'm at loss at how to do it in an easy way. I've looked at the aggregate, generate aggregate and generate attributes functions, the aggregate-functions seem only useful for integers and the generate attributes does not have a count-function (at least, not that I've found). I've included an example below for clarity.
att1 att2 att3 att4 att5
valuex valuey missing valuez missing
missing valuex missing valuey missing
-> So the new attribute should have value 3 for example 1, and 2 for example 2.
Anyone has experience with this?
Best Answer
Answers
-
Hi Lise,
If you have installed Python on your computer, you can use the "Execute Python" operator (to download and install via marketplace)
to perform this task, there is only one line of code.
Here you can find the process, with your fictive example set.
The calculated "count_valid_values" attribute is in the last column.
Here the process :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="8.0.001" expanded="true" height="68" name="Read CSV" width="90" x="112" y="85">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Count_Attribute.csv"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="313" y="85">
<parameter key="script" value="import pandas as pd # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): #data['count_missing'] = data.shape[1] - data.count(axis=1) data['count_valid_values'] = data.count(axis=1) # connect 1 output port to see the results return data"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Execute Python" to_port="input 1"/>
<connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Your fictive example set is in attached file.
I hope this will be helpful
Regards,
Lionel
1 -
-
Thank you, I tried it before but made the mistake of ticking off the checkbox "ignore missings" because I assumed it would would not count if an attribute had missing values (which would defeat my purpose).
Thanks for the help!
Lise
1