Looping Clusters and store them in Repository

User: "flo"
New Altair Community Member
Updated by Jocelyn

Hi everybody,

 

My dataset consists 4000 examples, 4 special attributes (ID, cluster, text and outlier), and 570 regular attributes from textprocessing. What I have done with the data so far was only to cluster it. Now I have 37 clusters and I want to store the 1 example set for each cluster in my repository.

Thats where my problem is: I think it should be possible with macros, "loop cluster" - and the "store" -operator, but I cant figure out how to set the parameters right.

I have a snippet attached from the data.

 

And the XML of my process so far:

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve Daten KAM clustered (opt.)" width="90" x="112" y="34">
<parameter key="repository_entry" value="//Datenbearbeitung MA/Filter Outliers/Daten KAM clustered (opt.)"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="ID|label|text"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="34">
<parameter key="attribute_name" value="label"/>
<parameter key="target_role" value="cluster"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="loop_clusters" compatibility="8.2.000" expanded="true" height="82" name="Loop Clusters" width="90" x="648" y="34">
<process expanded="true">
<operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
<list key="filters_list">
<parameter key="filters_entry_key" value="label.equals.%{myMacro_0}"/>
</list>
</operator>
<operator activated="true" class="store" compatibility="8.2.000" expanded="true" height="68" name="Store" width="90" x="648" y="34">
<parameter key="repository_entry" value="999TEST"/>
</operator>
<connect from_port="cluster subset" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="out 1"/>
<portSpacing port="source_cluster subset" spacing="0"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_macros" compatibility="8.2.000" expanded="true" height="68" name="Set Macros" width="90" x="313" y="136">
<list key="macros">
<parameter key="myMacro_0" value="&quot;cluster_0&quot;"/>
</list>
</operator>
<connect from_op="Retrieve Daten KAM clustered (opt.)" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Loop Clusters" to_port="example set"/>
<connect from_op="Loop Clusters" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

My goal is to apply the "Extract Topics from Document (LDA)" operator on every cluster with number of topics = 1 so that I can see the top words for each cluster.

 

Thank you all in advance

flo

Find more posts tagged with

Sort by:
1 - 2 of 21
    User: "MartinLiebig"
    Altair Employee
    Accepted Answer

    Hi,

     

    Group into Collection and Loop Collection from Toolbox does it.

     

    Let me know if you need help with LDA. It's somewhat my baby.

     

    BR,

    Martin

     

    Edit: I guess you do not want to use LDA, but simple process documents or so.

    User: "MartinLiebig"
    Altair Employee
    Accepted Answer

    Hi @flo,

     

    have a look at the attached process. Is should do what you want?

     

    BR,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="8.2.001" expanded="true" height="68" name="Retrieve OpenRanks Reviews Beijing" width="90" x="45" y="34">
    <parameter key="repository_entry" value="data/OpenRanks Reviews Beijing"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Review"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="add_meta_information" value="false"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="5.0"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="85"/>
    <connect from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="concurrency:k_means" compatibility="8.2.001" expanded="true" height="82" name="Clustering" width="90" x="447" y="34"/>
    <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="1.3.000-SNAPSHOT" expanded="true" height="82" name="Group Into Collection" width="90" x="715" y="34">
    <parameter key="group_by_attribute" value="cluster"/>
    </operator>
    <operator activated="true" class="loop_collection" compatibility="8.2.001" expanded="true" height="82" name="Loop Collection" width="90" x="849" y="34">
    <process expanded="true">
    <operator activated="true" class="extract_macro" compatibility="8.2.001" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="34">
    <parameter key="macro" value="clusterId"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="cluster"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="112" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="cluster"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="8.2.001" expanded="true" height="82" name="Aggregate (2)" width="90" x="179" y="34">
    <parameter key="use_default_aggregation" value="true"/>
    <parameter key="default_aggregation_function" value="sum"/>
    <list key="aggregation_attributes"/>
    </operator>
    <operator activated="true" class="transpose" compatibility="8.2.001" expanded="true" height="82" name="Transpose" width="90" x="313" y="34"/>
    <operator activated="true" class="sort" compatibility="8.2.001" expanded="true" height="82" name="Sort" width="90" x="447" y="34">
    <parameter key="attribute_name" value="att_1"/>
    <parameter key="sorting_direction" value="decreasing"/>
    </operator>
    <operator activated="true" class="filter_example_range" compatibility="8.2.001" expanded="true" height="82" name="Filter Example Range" width="90" x="581" y="34">
    <parameter key="first_example" value="1"/>
    <parameter key="last_example" value="5"/>
    <description align="center" color="transparent" colored="false" width="126">Take Top5</description>
    </operator>
    <operator activated="true" class="replace" compatibility="8.2.001" expanded="true" height="82" name="Replace" width="90" x="715" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="id"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="replace_what" value="sum\((.+)\)"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <operator activated="true" class="rename" compatibility="8.2.001" expanded="true" height="82" name="Rename" width="90" x="849" y="34">
    <parameter key="old_name" value="att_1"/>
    <parameter key="new_name" value="sum"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.001" expanded="true" height="82" name="Generate Attributes" width="90" x="983" y="34">
    <list key="function_descriptions">
    <parameter key="cluster" value="%{clusterId}"/>
    </list>
    </operator>
    <connect from_port="single" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Aggregate (2)" to_port="example set input"/>
    <connect from_op="Aggregate (2)" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve OpenRanks Reviews Beijing" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="clustered set" to_op="Group Into Collection" to_port="exa"/>
    <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
    <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="50" resized="true" width="481" x="271" y="235">Task: Calculate the top 5 most frequent words per cluster</description>
    </process>
    </operator>
    </process>

     

     

    Edit: Also have a look at this blog post: https://medium.com/@mSchmitz_/understanding-clustering-cf0117148ef4 

    i think this is closer to what you really want.