🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

[Solved] Average mutual information / correlation matrix on massive data set

User: "qwertz2"
New Altair Community Member
Updated by Jocelyn


Dear community,

There is a massive data set with a couple of thousands of regular attributes and a single label. The primary goal is to get a list with two columns showing 1) the attribute's names and 2) the average mutual information (related to the label).

As there are so many attributes the average mutual information matrix is slow and memory consuming. So I thought to work on a subset. This way I can calculate label and att1, then label and att2, then label and ... looping through all combinations.
However, I didn't manage to combine each iteration's result in a single table. Recall and remember don't seem to work here as the initial recall is empty.

The secondary goal would be to select the five attributes with the highest average mutual information out of the initial massive data set.
PS: I have the converters extension installed in order to convert matrix to example set.

PPS: The matrix operators don't seem to be able to handle special attributes. That's why I used "set role to regular".

Looking forward to any advice...

Cheers
Sachs

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="7.5.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
<parameter key="number_of_attributes" value="5000"/>
</operator>
<operator activated="true" class="concurrency:loop_attributes" compatibility="7.5.000" expanded="true" height="103" name="Loop Attributes" width="90" x="179" y="34">
<parameter key="regular_expression" value="%{loop_attribute}|label"/>
<process expanded="true">
<operator activated="true" class="work_on_subset" compatibility="7.5.000" expanded="true" height="103" name="Work on Subset" width="90" x="45" y="34">
<parameter key="attribute_filter_type" value="regular_expression"/>
<parameter key="regular_expression" value="%{loop_attribute}|label"/>
<parameter key="include_special_attributes" value="true"/>
<process expanded="true">
<operator activated="true" class="set_role" compatibility="7.5.000" expanded="true" height="82" name="Set Role" width="90" x="45" y="34">
<parameter key="attribute_name" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="mututal_information_matrix" compatibility="7.5.000" expanded="true" height="82" name="Mutual Information Matrix" width="90" x="179" y="34"/>
<operator activated="true" class="converters:matrix_2_example_set" compatibility="0.2.000" expanded="true" height="82" name="Matrix to ExampleSet" width="90" x="313" y="85"/>
<operator activated="true" class="recall" compatibility="7.5.000" expanded="true" height="68" name="Recall" width="90" x="313" y="187">
<parameter key="name" value="temp"/>
</operator>
<operator activated="true" class="append" compatibility="7.5.000" expanded="true" height="103" name="Append" width="90" x="447" y="136"/>
<operator activated="true" class="remember" compatibility="7.5.000" expanded="true" height="68" name="Remember" width="90" x="581" y="136">
<parameter key="name" value="temp"/>
</operator>
<connect from_port="exampleSet" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Mutual Information Matrix" to_port="example set"/>
<connect from_op="Mutual Information Matrix" from_port="example set" to_port="example set"/>
<connect from_op="Mutual Information Matrix" from_port="matrix" to_op="Matrix to ExampleSet" to_port="matrix"/>
<connect from_op="Matrix to ExampleSet" from_port="example set" to_op="Append" to_port="example set 1"/>
<connect from_op="Recall" from_port="result" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Remember" to_port="store"/>
<connect from_op="Remember" from_port="stored" to_port="through 1"/>
<portSpacing port="source_exampleSet" spacing="0"/>
<portSpacing port="sink_example set" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<portSpacing port="sink_through 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Work on Subset" to_port="example set"/>
<connect from_op="Work on Subset" from_port="example set" to_port="output 1"/>
<connect from_op="Work on Subset" from_port="through 1" to_port="output 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Loop Attributes" to_port="input 1"/>
<connect from_op="Loop Attributes" from_port="output 1" to_port="result 1"/>
<connect from_op="Loop Attributes" from_port="output 2" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

Find more posts tagged with

Sort by:
1 - 1 of 11
    User: "MartinLiebig"
    Altair Employee
    Accepted Answer

    Dear Sachs,

     

    mutual information is binning internally anyway. Thus i would recommend to use Weight by information gain on a discretized label.

     

    ~Martin