"Very basic clustering"

kgbolger
kgbolger New Altair Community Member
edited November 5 in Community Q&A
Hi,

I'm extremely new to both Data Mining and Rapid-Miner itself, just getting comfortable with simple load and aggregation operations etc.
I’ve searched for topics on this but anything I've found is still a little complicated for what I’m trying to do.

I'm interested in doing some basic clustering or classification on a single attribute of a dataset.
I’ve loaded some data from SQL to give me a count of transactions per day:

I have 4 attributes in my dataset:
Year Month Day Count
2010 10       5       345643
2010 10       4       2000
2010 10       7     2356
2010 10       5     18
2010 09       2     10010
2010 10       18     12
2010 01       5       34252

This is a sample, I have a year’s worth of data, so 365 items.

I’m trying to cluster into maybe 5 bins based on count size but I can’t seem to target one attribute using K-means or other algorithms.
Is what I’m trying to do too simplistic for Rapidminer operations? Need to try use RM for project I’m doing…..

Thanks,
kgbolger

Answers

  • Rene
    Rene New Altair Community Member
    Don't know if I got you right -
    to exclude everything but the "count"-attribute,
    you can e.g.
    a.) use  "set role" for the other
    attributes, make them special attributes and thus
    exclude them from the cluster analysis. or
    b.) use "select attributes" to grab just that single
    attribute that interests you before clustering.

    greets,
    rené
  • B_
    B_ New Altair Community Member
    To group months/days by counts you will need to have an id field so you know which entry belongs to which cluster.  If you set up a unique record id in SQL then easiest to use that (select id, count from ).  Or you can create the id from the month/date (select 'month' + 'day' if your fields are text, or else you'll need to change them to text.)  You can also write the cluster results back into the database to join on the original data.


    Simple example:


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.11" expanded="true" name="Process">
        <process expanded="true" height="404" width="748">
          <operator activated="true" class="generate_sales_data" compatibility="5.0.11" expanded="true" height="60" name="Generate Sales Data" width="90" x="41" y="45"/>
          <operator activated="true" class="select_attributes" compatibility="5.0.11" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="75">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="single_price|transaction_id"/>
          </operator>
          <operator activated="true" class="k_means" compatibility="5.0.11" expanded="true" height="76" name="Clustering" width="90" x="380" y="75">
            <parameter key="add_as_label" value="true"/>
            <parameter key="k" value="5"/>
          </operator>
          <connect from_op="Generate Sales Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>