"Correlation Matrix When to use Squared Correlation"

mob
New Altair Community Member
While researching a project involving polynominal datasets I forgot to check if Rapidminer had an operator to help so I'm a bit confused by the Correlation Matrix operator and when to use the "squared correlation"
Is the squared correlation the same as a chi-squared calculation and so is the correlation matrix similar to the "weight by chi-square" but without the need to have a class label defined ?
The tutorial example for the correlation matrix appears to show its suitable for use with the default params with non numeric data but other tools like R seem to prefer only numeric datasets so I'm a bit confused on how to handle non-numeric datasets in RM when I need to see the correlation
Any pointers to help clear the fog?
Is the squared correlation the same as a chi-squared calculation and so is the correlation matrix similar to the "weight by chi-square" but without the need to have a class label defined ?
The tutorial example for the correlation matrix appears to show its suitable for use with the default params with non numeric data but other tools like R seem to prefer only numeric datasets so I'm a bit confused on how to handle non-numeric datasets in RM when I need to see the correlation
Any pointers to help clear the fog?
Tagged:
0
Answers
-
Hi,
as far as i know squared correlation is aquivalent to R² in Excel.
Does this help?
Cheers,
Martin0 -
Hi Martin,
Thanks for helping. If you are talking about the rsq() function in excel that "can be interpreted as the proportion of the variance in y attributable to the variance in x." according to the Excel help docs. The excel function isn't suitable for non-numeric data
Is RM able to process non-numeric data to see if attributes are related or do I need to convert them and if so how do i do that so I don't loose the essence of the relationships between categorical attributes?0 -
Hi,
i think what you want is not possible with a single operator, you need to use a loop here. Attached is a process calculating such a matrix (as a list) using Gini Index. You can use any other Weight by Operator if you want to. Comments are inside the process
~Martin
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="6.4.000" expanded="true" height="60" name="Generate Data" width="90" x="45" y="165">
<parameter key="target_function" value="non linear"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="6.4.000" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="165">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="label"/>
<parameter key="invert_selection" value="true"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="discretize_by_bins" compatibility="6.4.000" expanded="true" height="94" name="Discretize (2)" width="90" x="313" y="165">
<parameter key="number_of_bins" value="5"/>
<parameter key="range_name_type" value="short"/>
</operator>
<operator activated="true" class="loop_attributes" compatibility="6.4.000" expanded="true" height="94" name="Loop Attributes" width="90" x="447" y="165">
<process expanded="true">
<operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="11" y="52"/>
<operator activated="true" class="set_role" compatibility="6.4.000" expanded="true" height="76" name="Set Role" width="90" x="179" y="210">
<parameter key="attribute_name" value="%{loop_attribute}"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="weight_by_gini_index" compatibility="6.4.000" expanded="true" height="76" name="Weight by Gini Index" width="90" x="313" y="210"/>
<operator activated="true" class="weights_to_data" compatibility="6.4.000" expanded="true" height="60" name="Weights to Data" width="90" x="447" y="210"/>
<operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="210">
<list key="function_descriptions">
<parameter key="Iteration" value=""%{loop_attribute}""/>
</list>
</operator>
<operator activated="true" class="order_attributes" compatibility="6.4.000" expanded="true" height="76" name="Reorder Attributes" width="90" x="715" y="210">
<parameter key="attribute_ordering" value="Iteration|Attribute|Weight"/>
</operator>
<connect from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Weight by Gini Index" to_port="example set"/>
<connect from_op="Weight by Gini Index" from_port="weights" to_op="Weights to Data" to_port="attribute weights"/>
<connect from_op="Weights to Data" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Reorder Attributes" to_port="example set input"/>
<connect from_op="Reorder Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<description align="center" color="yellow" colored="false" height="193" resized="true" width="283" x="153" y="129">Weight by Gini Index always calcs the Index for the label. Same stuff works with information Gain etc.</description>
<description align="center" color="yellow" colored="false" height="189" resized="true" width="412" x="441" y="133">Transform it a bit to make it easier readable</description>
</process>
<description align="center" color="transparent" colored="false" width="126">Loop so that each attribute is label once</description>
</operator>
<operator activated="true" class="append" compatibility="6.4.000" expanded="true" height="76" name="Append" width="90" x="581" y="210"/>
<connect from_op="Generate Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Discretize (2)" to_port="example set input"/>
<connect from_op="Discretize (2)" from_port="example set output" to_op="Loop Attributes" to_port="example set"/>
<connect from_op="Loop Attributes" from_port="example set" to_port="result 1"/>
<connect from_op="Loop Attributes" from_port="result 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<description align="center" color="yellow" colored="false" height="249" resized="true" width="394" x="23" y="101">Get some polynomal data</description>
</process>
</operator>
</process>0 -
Is there a reason why you discretized the dataset before looping?0
-
just to get nominal values, because you asked for it.
so: not really0 -
Thanks for that and appreciate the help.. Is there other ways to accomplish the same calculation as I have a fairly large dataset with a large number of columns0
-
None that i know of.
How do you want to use it? Ofc. you can delete coloumns in a iteration so it is not tested anymore in the next iteration. That might make everything faster.
Edit: If you want to use it for feature selection, have a look on this extension: http://sourceforge.net/projects/rm-featselext/
The MRMR operator there might be useful. Sadly this is not on the RM Market Place.0 -
I'm really looking to get a general sense of the dataset before any data mining starts and to be honest have driven myself crazy trying to do things in R with dummy variables for categorical values to gleam some relationships between polynominal categorical data. What I wouldnt give for a numeric dataset and a pearson correlation0
-
Well, Gini Index and Information Gain (aka entropy) are quite good for polynominal values.
The other option would be to use Nominal to numerical and dummy coding. But i think pearson correlation is "wrong" for a binominal (numerical) attribute.
0 -
Thanks Martin,
I went crazy trying the dummy variables route. I'll check out the operators you suggest0