"PCA as with SPSS"

Hi there,

I am completely new to RapidMiner and quite new to statistics in general. Up to now I only worked a little bit with SPSS.
Now in RapidMiner I wanted to repeat the things I've already done with SPSS, e.g. a Principal Component Analysis.
But I have no idea, which operators and parameters are necessary to create the kind of output I am used to when working with SPSS:

Uploaded with ImageShack.us

There I've had a number of variables (Exx and Oxx) and set a fixed number (2) of components. The SPSS output is a table showing the factor loadings for all of my original variables.
After then I usually would evaluate the loadings, group the variables according to their factor loadings and drop variables with too low values.

It seems that RapidMiner's PCA operator is doing something similar but I have no clue, which variables the PC's are generated from or how the PCs are computed.

I hope my explanation isn't too confusing. Maybe RapidMiner simply doesn't offer PCA this way.
Is there anybody out there who can help me?

Best regards,
ron

Find more posts tagged with

AI Studio

Principal Component Analysis (PCA)

Accepted answers

All comments

land

Hi Ron,

RapidMiner focuses on automatic data processing and hence there's no optimized user interface for applying just a single pca and then taking a look at the results to manually decide which attributes/variables to keep.

But of course it's still possible. Let's go through this step by step. In the following process I added an Generate Data operator to simply generate some data where we can apply the PCA on. Then I added a Principal Component Analysis operator. The output object is the data that is compressed on the two resulting components and the model itself. The model contains the principal components with the factors as shown in SPSS. To see the results, you need to execute the process.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
    <process expanded="true" height="206" width="480">
      <operator activated="true" class="generate_data" compatibility="5.1.001" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
        <parameter key="target_function" value="sum"/>
      </operator>
      <operator activated="true" class="principal_component_analysis" compatibility="5.1.001" expanded="true" height="94" name="PCA" width="90" x="180" y="30"/>
      <connect from_op="Generate Data" from_port="output" to_op="PCA" to_port="example set input"/>
      <connect from_op="PCA" from_port="example set output" to_port="result 1"/>
      <connect from_op="PCA" from_port="preprocessing model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

You can now take a look at the factors and use a Select Attribute operator to select which attributes you want to keep manually. But as I said with RapidMiner we prefer doing it automatically. So we can use the "Weight by Component Model" operator to transform one of the components into a weighting vector. Then we can use the "Select by Weight" operator to select only attributes of the original data set that fulfill a given condition. For example we can only use the first k attributes.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
    <process expanded="true" height="446" width="882">
      <operator activated="true" class="generate_data" compatibility="5.1.001" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
        <parameter key="target_function" value="sum"/>
      </operator>
      <operator activated="true" class="principal_component_analysis" compatibility="5.1.001" expanded="true" height="94" name="PCA" width="90" x="180" y="30"/>
      <operator activated="true" class="weight_by_component_model" compatibility="5.1.001" expanded="true" height="94" name="Weight by Component Model" width="90" x="447" y="75">
        <parameter key="normalize_weights" value="false"/>
      </operator>
      <operator activated="true" class="select_by_weights" compatibility="5.1.001" expanded="true" height="94" name="Select by Weights" width="90" x="648" y="75"/>
      <connect from_op="Generate Data" from_port="output" to_op="PCA" to_port="example set input"/>
      <connect from_op="PCA" from_port="example set output" to_port="result 1"/>
      <connect from_op="PCA" from_port="original" to_op="Weight by Component Model" to_port="example set"/>
      <connect from_op="PCA" from_port="preprocessing model" to_op="Weight by Component Model" to_port="model"/>
      <connect from_op="Weight by Component Model" from_port="weights" to_op="Select by Weights" to_port="weights"/>
      <connect from_op="Weight by Component Model" from_port="example set" to_op="Select by Weights" to_port="example set input"/>
      <connect from_op="Select by Weights" from_port="example set output" to_port="result 2"/>
      <connect from_op="Select by Weights" from_port="weights" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

I hope that will help you.

Greetings,
Sebastian

ron

Hi Sebastian,

thank you very much for your hints. I really appreciate your help. Indeed, working with RapidMiner seems quite different from the SPSS way.
But I still don't realise, which of the original attributes the new factors consist of and how the new factors are computed.

In your first XML example the dimensionality is reduced by variance with a threshold of 0.95. In the results I can look at the PCA model. In the "Eigenvalues" view I can see five factors (PC 1 to PC 5) an their proportional and cumulative variance. The "Eigenvectors" view shows the five original attributes and their PC1 to PC5 factor loadings. That's plausible so far.

But looking at the PCA example set results, suddenly there are just 4 attributes (entitled pc_1 to pc_4). I wonder how these four attributes are generated.
I suppose I still don't understand what are the arithmetic steps RapidMiner is doing.

Maybe you have some more hints for me.

land

Hi,
well, after you dropped every Principal Component that exceeds the 95% of variance you wanted to keep, there are just remaining the first 4.
Anyway the model keeps all of them.

You mean which matrix operations are performed in the background? You really want to know that? It are just some standard calculations, I doubt SPSS will show them to you?

Greetings,
Sebastian

slg

IMHO doing statistics "the good old way" and doing data mining are a bit different. I do both - in context. R which comes with rapid miner, affords the type of PCA modeling that SPSS does (and with better flexibility) - so if you have to "manually" evaluate your model and specify/modify it in a way that meets your own "intuition" I suggest going behind RM and using the R console interface...
Just a thought...