"how to handle missing values while calculating correllation"
venkatesh20
New Altair Community Member
Hi Gurus,
I am working on movie lens data set, consider the below data set
userid, movieid, rating
1,100,5
1,101,2
1,102,4
2,100,5
2,102,1
I want to compute the correlation between the userids 1 and 2, only based on the items which users 1 and 2 have commonly rated. I want to ignore the uncommon ratings while calculating correlation. For eg. In the above case i want to compute the correlation only based on the ratings of the movie ids 100 and 102 which user 1 and user 2 have in common. Can any one guide me how to do this in rapid miner?
I tried the one below and it has missing values, and does not give proper results
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="449" width="681">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="120">
<parameter key="repository_entry" value="jester/jester_sub"/>
</operator>
<operator activated="true" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="179" y="120">
<parameter key="group_attribute" value="userid"/>
<parameter key="index_attribute" value="jokeid"/>
</operator>
<operator activated="true" class="data_to_similarity" expanded="true" height="76" name="Data to Similarity" width="90" x="447" y="120">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CorrelationSimilarity"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Pivot" to_port="example set input"/>
<connect from_op="Pivot" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="126"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I am working on movie lens data set, consider the below data set
userid, movieid, rating
1,100,5
1,101,2
1,102,4
2,100,5
2,102,1
I want to compute the correlation between the userids 1 and 2, only based on the items which users 1 and 2 have commonly rated. I want to ignore the uncommon ratings while calculating correlation. For eg. In the above case i want to compute the correlation only based on the ratings of the movie ids 100 and 102 which user 1 and user 2 have in common. Can any one guide me how to do this in rapid miner?
I tried the one below and it has missing values, and does not give proper results
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input>
<location/>
</input>
<output>
<location/>
<location/>
</output>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="449" width="681">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="120">
<parameter key="repository_entry" value="jester/jester_sub"/>
</operator>
<operator activated="true" class="pivot" expanded="true" height="76" name="Pivot" width="90" x="179" y="120">
<parameter key="group_attribute" value="userid"/>
<parameter key="index_attribute" value="jokeid"/>
</operator>
<operator activated="true" class="data_to_similarity" expanded="true" height="76" name="Data to Similarity" width="90" x="447" y="120">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CorrelationSimilarity"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Pivot" to_port="example set input"/>
<connect from_op="Pivot" from_port="example set output" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="126"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Tagged:
0
Answers
-
Hi,
I guess it would be the easiest solution to replace the missing values. If you would simply remove all attributes with missing values, you would loose informations, because not rating a movie is an information about a user. If you replace the missing values by -1, this might catch the real connection much better.
Greetings,
Sebastian0