I have a attribute job which is a label and has 15 different values.
Out of 1000 samples, 7 values contributes to 950 samples and remaining 8 values contributes to 50 samples.
I want to use only the 950 samples (i.e 7 values only) and ignore the rest.
How do I select the values of the label which contributes the most to the sample?This chosen-not chosen combination may change ( 8-7,10-5,12-3 etc) depending on the data.
I tried the following approach
1) Count number of occurrence of the values in the whole table (stuck at this point)
2) Rank the values (have no idea)
3) Filter out the chosen-not chosen values (have no idea)
If a better approach/way can be suggested , I will be very grateful
I have the following table
Name |
Job |
John |
Painting |
Kelly |
Washing |
Diamond |
Carpentry |
Clarice |
Carpentry |
Kennedy |
Washing |
Kevin |
Painting |
Hart |
Painting |
Budsey |
Painting |
David |
Washing |
I tried to count the number of occurrence of the values in the whole table which should look like this
Name |
Job |
Total Job |
John |
Painting |
4 |
Kelly |
Washing |
3 |
Diamond |
Carpentry |
2 |
Clarice |
Carpentry |
2 |
Kennedy |
Washing |
3 |
Kevin |
Painting |
4 |
Hart |
Painting |
4 |
Budsey |
Painting |
4 |
David |
Washing |
3 |
I tried Generate Aggregation but it is updating it wrong
<div><?xml version="1.0" encoding="UTF-8"?><process version="9.6.000"></div><div> <context></div><div> <input/></div><div> <output/></div><div> <macros/></div><div> </context></div><div> <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process"></div><div> <parameter key="logverbosity" value="init"/></div><div> <parameter key="random_seed" value="2001"/></div><div> <parameter key="send_mail" value="never"/></div><div> <parameter key="notification_email" value=""/></div><div> <parameter key="process_duration_for_mail" value="30"/></div><div> <parameter key="encoding" value="SYSTEM"/></div><div> <process expanded="true"></div><div> <operator activated="true" class="retrieve" compatibility="9.6.000" expanded="true" height="68" name="Retrieve job" width="90" x="45" y="34"></div><div> <parameter key="repository_entry" value="../data/job"/></div><div> </operator></div><div> <operator activated="true" class="generate_aggregation" compatibility="9.6.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="246" y="34"></div><div> <parameter key="attribute_name" value="TotalJob"/></div><div> <parameter key="attribute_filter_type" value="single"/></div><div> <parameter key="attribute" value="Job"/></div><div> <parameter key="attributes" value="Job"/></div><div> <parameter key="use_except_expression" value="false"/></div><div> <parameter key="value_type" value="attribute_value"/></div><div> <parameter key="use_value_type_exception" value="false"/></div><div> <parameter key="except_value_type" value="time"/></div><div> <parameter key="block_type" value="attribute_block"/></div><div> <parameter key="use_block_type_exception" value="false"/></div><div> <parameter key="except_block_type" value="value_matrix_row_start"/></div><div> <parameter key="invert_selection" value="false"/></div><div> <parameter key="include_special_attributes" value="true"/></div><div> <parameter key="aggregation_function" value="count"/></div><div> <parameter key="concatenation_separator" value="|"/></div><div> <parameter key="keep_all" value="true"/></div><div> <parameter key="ignore_missings" value="true"/></div><div> <parameter key="ignore_missing_attributes" value="false"/></div><div> </operator></div><div> <connect from_op="Retrieve job" from_port="output" to_op="Generate Aggregation" to_port="example set input"/></div><div> <connect from_op="Generate Aggregation" from_port="example set output" to_port="result 1"/></div><div> <portSpacing port="source_input 1" spacing="0"/></div><div> <portSpacing port="sink_result 1" spacing="0"/></div><div> <portSpacing port="sink_result 2" spacing="0"/></div><div> <portSpacing port="sink_result 3" spacing="0"/></div><div> </process></div><div> </operator></div><div></process>
</div>
The output I am getting is
RowNo Name Job TotalJob
1 |
John |
Painting |
1.0 |
2 |
Kelly |
Washing |
1.0 |
3 |
Diamond |
Carpentry |
1.0 |
4 |
Clarice |
Carpentry |
1.0 |
5 |
Kennedy |
Washing |
1.0 |
6 |
Kevin |
Painting |
1.0 |
7 |
Hart |
Painting |
1.0 |
8 |
Budsey |
Painting |
1.0 |
9 |
David |
Washing |
1.0 |