Discretize by Density

michaelhecht
michaelhecht New Altair Community Member
edited November 5 in Community Q&A
In the Bayes software Genie there is a discretisation method by giving the number of bins and getting the clusters around most dense areas of an attribute. If you have e.g. two or three separable gauss distributions in your attribute and define three bins the clusters are hierarchical i.e. density based placed around each gauss set.

It would be nice to have this also in RapidMiner.

It seems, that entropy based discretisation is comparable but the number of bins cannot be preselected.
Tagged:

Answers

  • TobiasMalbrecht
    TobiasMalbrecht New Altair Community Member
    Dear Michael,

    using a hierarchical clustering on a data set containing only the attribute to be discretized should yield the desired result. Simply flat the cluster model afterwards specifying the number of discrete values you would like to obtain. Please find attached a process that shows how it works:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.005">
      <context>
        <input/>
        <output/>
        <macros>
          <macro>
            <key>attribute</key>
            <value>a4</value>
          </macro>
          <macro>
            <key>number_of_classes</key>
            <value>3</value>
          </macro>
        </macros>
      </context>
      <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="5.3.005" expanded="true" height="60" name="Retrieve Iris" width="90" x="45" y="30">
            <parameter key="repository_entry" value="//Samples/data/Iris"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="5.3.005" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="work_on_subset" compatibility="5.3.005" expanded="true" height="76" name="Work on Subset" width="90" x="313" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="%{attribute}"/>
            <process expanded="true">
              <operator activated="true" class="agglomerative_clustering" compatibility="5.3.005" expanded="true" height="76" name="Clustering" width="90" x="45" y="30"/>
              <operator activated="true" class="flatten_clustering" compatibility="5.3.005" expanded="true" height="76" name="Flatten Clustering" width="90" x="179" y="30">
                <parameter key="number_of_clusters" value="%{number_of_classes}"/>
              </operator>
              <operator activated="true" class="set_role" compatibility="5.3.005" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
                <parameter key="name" value="cluster"/>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="rename" compatibility="5.3.005" expanded="true" height="76" name="Rename" width="90" x="447" y="30">
                <parameter key="old_name" value="cluster"/>
                <parameter key="new_name" value="%{attribute}_discretized"/>
                <list key="rename_additional_attributes"/>
              </operator>
              <operator activated="true" class="replace" compatibility="5.3.005" expanded="true" height="76" name="Replace" width="90" x="581" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="%{attribute}_discretized"/>
                <parameter key="replace_what" value="cluster"/>
                <parameter key="replace_by" value="value"/>
              </operator>
              <connect from_port="exampleSet" to_op="Clustering" to_port="example set"/>
              <connect from_op="Clustering" from_port="cluster model" to_op="Flatten Clustering" to_port="hierarchical"/>
              <connect from_op="Clustering" from_port="example set" to_op="Flatten Clustering" to_port="example set"/>
              <connect from_op="Flatten Clustering" from_port="example set" to_op="Set Role" to_port="example set input"/>
              <connect from_op="Set Role" from_port="example set output" to_op="Rename" to_port="example set input"/>
              <connect from_op="Rename" from_port="example set output" to_op="Replace" to_port="example set input"/>
              <connect from_op="Replace" from_port="example set output" to_port="example set"/>
              <portSpacing port="source_exampleSet" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Iris" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Work on Subset" to_port="example set"/>
          <connect from_op="Work on Subset" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Best,
    Tobias