Strategy for analysis of multivariate numerical data (novice)
ben_h
New Altair Community Member
How could I estimate the value of a ("class") variable based on the attributes of about 8--10 other related variables? I have some missing data in each of the 8 variables (from as little as 1% up to 15%), and only about 10 in 8000 vales for the class variable.
The data are numeric, well-log data from an antique geophysical survey down a series of boreholes (many boreholes). A peek at the data from one well looks like this (TC is the class variable):
[tt]DEPTH CALI DEN GR LAT LN NEUT SN SP TC
94.7927 79.8064 109.3991 40.0754 125.7779 58.2112 628.632 36.54 33.4619 1.60[/tt]
I have a niche piece of software for mineral exploration data analysis (using the SOM technique as its core 'clustering' method), and have tried to learn the fundamentals of the underlying methodology, though I am no statistician. The implementation is a little black-box to me and I am reliant on a single point of contact regarding its use, so I would like to have some other way of looking at the data and the problem. I am a complete novice to rapidminer and am looking for some help to get started with it (and by proxy some of the algorithms it uses).
More detail (can skip this next bit):
This is part of a larger research project I am undertaking. The essential method of the software I have is imputation of the class variable following grouping/clustering of the data. The well logs are of course responding to physical features of the rocks in the borehole, so I also wish to use this feature of the data to explore other means to estimate the TC variable. For example; unsupervised clustering should identify rock types based on related physical responses recorded in the well logs. Match these with qualitative descriptions and I can estimate unknown variables from global or regional observations. Though the more I say about it the more I might be influencing your thoughts.
The data are numeric, well-log data from an antique geophysical survey down a series of boreholes (many boreholes). A peek at the data from one well looks like this (TC is the class variable):
[tt]DEPTH CALI DEN GR LAT LN NEUT SN SP TC
94.7927 79.8064 109.3991 40.0754 125.7779 58.2112 628.632 36.54 33.4619 1.60[/tt]
I have a niche piece of software for mineral exploration data analysis (using the SOM technique as its core 'clustering' method), and have tried to learn the fundamentals of the underlying methodology, though I am no statistician. The implementation is a little black-box to me and I am reliant on a single point of contact regarding its use, so I would like to have some other way of looking at the data and the problem. I am a complete novice to rapidminer and am looking for some help to get started with it (and by proxy some of the algorithms it uses).
More detail (can skip this next bit):
This is part of a larger research project I am undertaking. The essential method of the software I have is imputation of the class variable following grouping/clustering of the data. The well logs are of course responding to physical features of the rocks in the borehole, so I also wish to use this feature of the data to explore other means to estimate the TC variable. For example; unsupervised clustering should identify rock types based on related physical responses recorded in the well logs. Match these with qualitative descriptions and I can estimate unknown variables from global or regional observations. Though the more I say about it the more I might be influencing your thoughts.
Tagged:
0
Answers
-
Ok. You want to predict TC which is a numerical attribute. Here is what I would do.
1. Handle Missing values : Replace them by min, max, avg of the attribute (or 0)
2. Then apply linear regression to see how it performs.
Other ways. Discretize your dataset.
1. Try Naive bayes
2. Try Decision Trees.
Good luck.
Cheers,
Venki0 -
Thanks for the response Venki,
I have more questions... very basic...
1. I think I Apply Model by connecting output of model operator (e.g. 'Linear Regression') to model input of 'Apply Model' operator, and 'exampleset' output of Linear Regression operator to unlabelled data input of Apply Model, is this correct? I have added xml below for clarity.
2. I am confused as to how to handle missing values in my target attribute.- The modelling step (linear regression step) requires that I replace missing values, but this results in near-garbage estimates due to its very sparse nature.
- Following the modelling step, which data set do I use as unlabelled data to the Apply Model operator? The same set that was input to the modelling operator, or the original data with no missing values replaced? I don't know how this works.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<parameter key="logverbosity" value="all"/>
<parameter key="logfile" value="/home/harb/Documents/DATA/Conductivity/SOM_project/rapidminer/BW6_process.log"/>
<parameter key="resultfile" value="/home/harb/Documents/DATA/Conductivity/SOM_project/rapidminer/BW6_process.res"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve BengwordenSouth6" width="90" x="45" y="30">
<parameter key="repository_entry" value="../data/BengwordenSouth6"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="5.3.008" expanded="true" height="76" name="Select Attributes" width="90" x="112" y="165">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value="DEPTH"/>
<parameter key="attributes" value="CALI|CONDUCTIVITY|DEN|GR|LAT|LN|NEUT|SN|SP"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values" width="90" x="179" y="30">
<parameter key="attribute" value="CONDUCTIVITY"/>
<list key="columns">
<parameter key="CALI" value="average"/>
<parameter key="DEN" value="average"/>
<parameter key="GR" value="average"/>
<parameter key="LAT" value="average"/>
<parameter key="LN" value="average"/>
<parameter key="NEUT" value="average"/>
<parameter key="SN" value="average"/>
<parameter key="SP" value="average"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="5.3.008" expanded="true" height="76" name="Set Role" width="90" x="246" y="165">
<parameter key="attribute_name" value="CONDUCTIVITY"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="linear_regression" compatibility="5.3.008" expanded="true" height="94" name="Linear Regression" width="90" x="380" y="30"/>
<operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="514" y="30">
<list key="application_parameters"/>
</operator>
<connect from_op="Retrieve BengwordenSouth6" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Linear Regression" to_port="training set"/>
<connect from_op="Linear Regression" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Linear Regression" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Linear Regression" from_port="weights" to_port="result 3"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>0