NullPointerException
Darme
New Altair Community Member
Hi,
I am a newbie to RapidMiner. I am trying to use Expectation Maximization to cluster some data. I have a around 500 000 of data rows in .csv file. I am using the process "Read CSV" -> Normalise -> Replace Missing Vlaues -> Clustering
However i always get a nullpointer exception at the clustering time
I am doing something wrong here?
Thanks in advance
Darme
I am a newbie to RapidMiner. I am trying to use Expectation Maximization to cluster some data. I have a around 500 000 of data rows in .csv file. I am using the process "Read CSV" -> Normalise -> Replace Missing Vlaues -> Clustering
However i always get a nullpointer exception at the clustering time
I am doing something wrong here?
Thanks in advance
Darme
Tagged:
0
Answers
-
Do you get an error dialog which allows to submit a bug report? If so, please use the corresponding button.
If there is no such dialog, please post your process setup and give us a detailed description of your data (number and types of attributes, and any particularities).
Best regards,
Marius0 -
Hi Marius,
Thank you for your prompt reply. Following is the error massage i get.
The setup does not seem to contain any obvious errors, but you should check the log massages or activate the debug mode in the settings dialog in order to get more information about this problem
The log contains the following
subprocess 'Main Process'
+- Read CSV[1] (Read CSV)
+- Normalize[1] (Normalize)
+- Replace Missing Values[1] (Replace Missing Values)
==> +- Clustering[1] (Expectation Maximization Clustering)
Apr 23, 2013 4:49:13 PM SEVERE: java.lang.NullPointerException
the data has 11 attributes which are of types text, number and date. In the normalise process i have set value type to numeric
In the clustering i have set randomly assigned examples
In the Replace Missing Values i have set attribute filter type to all and default to average
do you need any more information? Please let me know
Thanks again
Darme0 -
Hi,
it seems that you also have missing values in your nominal and/or date attributes. You should remove/replace all missing values before applying Expectation Maximum Clustering.
Best regards,
Marius0 -
Hi again,
I added two Replace Missing Vlaues steps to the below process. One has attribute filter type , "value_type" set to text with default set to value and replenishment set as "extra"
The other has the value-type "date" and replenishment value of 23/4/2013.
Still i get the same error. Am i still on the wrong path. Please help.
Thank you very much
Darme0 -
Can you please post your process setup as described in the post linked in my signature?
Additionally, try to set a breakpoint before the clustering operator and inspect the metadata for missing values.
Best regards,
Marius0 -
Hi Marius,
Once again thank you for your advices.
I have attached the code of the process i am using and i believe all the required information is there.
Since i have a very large set of data, if a breakpoint is set for clustering then i think i need to iterate for each row of data one by one.
Is there a way to stop when a value is missing, similar to setting conditions to breakpoints?
Thanks and Regards
Darrshan
Code:<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.009">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.009" expanded="true" name="Process">
<process expanded="true" height="494" width="709">
<operator activated="true" class="read_csv" compatibility="5.1.009" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
<parameter key="csv_file" value="C:\Users\yahoo\Desktop\CSEtemp.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="StockCode.true.text.attribute"/>
<parameter key="1" value="SectorKey.true.text.attribute"/>
<parameter key="2" value="TimeKey.true.date.attribute"/>
<parameter key="3" value="OpenPrice.true.real.attribute"/>
<parameter key="4" value="ClosePrice.true.real.attribute"/>
<parameter key="5" value="NetChange.true.real.attribute"/>
<parameter key="6" value="ChangePercentage.true.real.attribute"/>
<parameter key="7" value="Highest.true.real.attribute"/>
<parameter key="8" value="Lowest.true.real.attribute"/>
<parameter key="9" value="Volume.true.integer.attribute"/>
<parameter key="10" value="TotalValue.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="normalize" compatibility="5.1.009" expanded="true" height="94" name="Normalize" width="90" x="45" y="255">
<parameter key="attribute_filter_type" value="value_type"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.1.009" expanded="true" height="94" name="Replace Missing Values (3)" width="90" x="179" y="345">
<list key="columns"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.1.009" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="345">
<parameter key="attribute_filter_type" value="value_type"/>
<parameter key="value_type" value="text"/>
<parameter key="default" value="value"/>
<list key="columns">
<parameter key="SectorKey" value="value"/>
<parameter key="StockCode" value="value"/>
<parameter key="TimeKey" value="value"/>
</list>
<parameter key="replenishment_value" value="extra"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.1.009" expanded="true" height="94" name="Replace Missing Values (2)" width="90" x="447" y="345">
<parameter key="attribute_filter_type" value="value_type"/>
<parameter key="value_type" value="date"/>
<parameter key="default" value="value"/>
<list key="columns"/>
<parameter key="replenishment_value" value="23/4/2013"/>
</operator>
<operator activated="true" class="expectation_maximization_clustering" compatibility="5.1.009" expanded="true" height="76" name="Clustering" width="90" x="514" y="75">
<parameter key="k" value="3"/>
<parameter key="add_as_label" value="true"/>
<parameter key="use_local_random_seed" value="true"/>
<parameter key="inital_distribution" value="randomly assigned examples"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Replace Missing Values (3)" to_port="example set input"/>
<connect from_op="Replace Missing Values (3)" from_port="original" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Replace Missing Values (2)" to_port="example set input"/>
<connect from_op="Replace Missing Values (2)" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>[ /code]0 -
No, you don't need to check each row one by one: just switch the the metadata view in the results perspective, and for each attribute you'll see the number of missing values.
Anyway, my suspect is that in the second Replace Missing Values operator you should select valye_type nominal, polynominal or binominal instead of text (text is a special data type used only in the Text Processing extension).
Experiment with that setting, *and* check the result with a breakpoint.
Best regards,
Marius0 -
Hi,
As you have advised i changed the settings of Replace Missing Values operator and also changed the read csv operators data types accordingly.
Still i am getting the same result
Also i created break points before clustering and in the meta data view the "Missing value" column shows only "?" I also set break points at each step and looked at the meta data and the result was same.
Furthermore i created the given schema on a MS SQL server evaluation edition and ran a query to retrieve null values for the given data set. The result was that there are no null values.
Do you think something else has gone wrong? Any more information needed?
Thanks again
Darme0 -
I have tried to reproduce your error with my own data (with missings included), but your process runs without an error. Your process XML says you are still using a quite old version (5.1). Could you update RapidMiner to 5.3.8 and check again?0
-
Hi again,
I updated to 5.3.008 and still get the same error. Could it be that some setting/configuration issue?
Could you send me your xml file so that i can check it here?
Many thanks again
Darme0 -
Hi again,
I tried out RM version 5.3.8 with modifications to the process. But still the result is same.
I have attached herewith the xml code
Seems something is fundamentally wrong either in the way i am doing or in the data.
Could you please share your xml to try out with my data?
Thanks alot
Darme
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="120">
<parameter key="csv_file" value="C:\Users\yahoo\Desktop\CSEtemp.csv"/>
<parameter key="column_separators" value=","/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<parameter key="encoding" value="windows-1252"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="StockCode.true.polynominal.attribute"/>
<parameter key="1" value="SectorKey.true.binominal.attribute"/>
<parameter key="2" value="TimeKey.true.date.attribute"/>
<parameter key="3" value="OpenPrice.true.real.attribute"/>
<parameter key="4" value="ClosePrice.true.real.attribute"/>
<parameter key="5" value="NetChange.true.real.attribute"/>
<parameter key="6" value="ChangePercentage.true.real.attribute"/>
<parameter key="7" value="Highest.true.real.attribute"/>
<parameter key="8" value="Lowest.true.real.attribute"/>
<parameter key="9" value="Volume.true.integer.attribute"/>
<parameter key="10" value="TotalValue.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="normalize" compatibility="5.3.008" expanded="true" height="94" name="Normalize" width="90" x="45" y="255"/>
<operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values" width="90" x="112" y="390">
<parameter key="attribute_filter_type" value="value_type"/>
<parameter key="value_type" value="date"/>
<parameter key="default" value="zero"/>
<list key="columns"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values (2)" width="90" x="246" y="390">
<parameter key="attribute_filter_type" value="value_type"/>
<parameter key="value_type" value="real"/>
<list key="columns"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values (3)" width="90" x="380" y="390">
<parameter key="attribute_filter_type" value="value_type"/>
<parameter key="value_type" value="binominal"/>
<parameter key="default" value="value"/>
<list key="columns">
<parameter key="SectorKey" value="value"/>
</list>
<parameter key="replenishment_value" value="BFI"/>
</operator>
<operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values (4)" width="90" x="514" y="390">
<parameter key="attribute_filter_type" value="value_type"/>
<parameter key="value_type" value="polynominal"/>
<parameter key="default" value="value"/>
<list key="columns">
<parameter key="StockCode" value="value"/>
</list>
<parameter key="replenishment_value" value="AAAA"/>
</operator>
<operator activated="true" class="expectation_maximization_clustering" compatibility="5.3.008" expanded="true" height="76" name="Clustering" width="90" x="514" y="210">
<parameter key="inital_distribution" value="randomly assigned examples"/>
</operator>
<connect from_op="Read CSV" from_port="output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="original" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="original" to_op="Replace Missing Values (2)" to_port="example set input"/>
<connect from_op="Replace Missing Values (2)" from_port="original" to_op="Replace Missing Values (3)" to_port="example set input"/>
<connect from_op="Replace Missing Values (3)" from_port="original" to_op="Replace Missing Values (4)" to_port="example set input"/>
<connect from_op="Replace Missing Values (4)" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>0 -
OK, I think I see where the problem is. It is a very subtle error I haven't seen directly in your processes. You are connecting the second port of your "Replace missing" to the next operator. The three letters "ori" indicate that this is the original output which is passed through without any changes, so, your data still contains missing values. Please use the first port "exa".
For the NullPointerException we have already created an intern ticket.0 -
Many thanks for your advice
I used the above process with using output as "exe" and got rid of the NullPointerException.
However i have some issues with the result.
1. In the "Replace Missing value" for date, i have provided value as zero and all of the date values have been replaced by "Jan 1, 1970"
2. In the "Replace Missing value" for real, i have set the default value as average and in most of the columns the actual values have been replaced by the average figure
3. In the "Replace Missing value" for binomial, i have set the default value as "BFI" and all of the actual values have been replaced with this.
Is it possible for me to do the clustering with the actual values? Is there any reason why the tool replaces actual values with the values for replacement?
In another experiment, keeping all of the above as same but i altered "Replace Missing value" for date, by setting a default value of 1/1/2009.Then again i got the NullPointerException.
Could you explain this behaviour?
Once again thank you for your understanding and continues help with this regard and hope for solutions for my questions
Regards
Darme0 -
Hi Marius,
I managed to get results by trying out various options in the tool. Mainly I used attribute_type for all attributes rather than their data types and set one as the prediction. I guess if we keep attributes in some data types there could be nullpointer exception possibly because data type mismatches. Please correct me if I am wrong here.
Once again thank you very much for all your help with this regard
P.S shall I put this issue in to solved state
Regards
Darrshan
0