"Exception: java.lang.ArrayIndexOutOfBoundsException"

New Altair Community Member
Updated by Jocelyn
Hello there,
I am using RM for a simple text clustering task. I load my sentences from excel and want to cluster them using the Kmeans clustering operator. I am encountering a weird situation. When I choose EuclideanDistance as distance measure the process works and produce the result. However when I choose CorrelationSimilarity as measure, it gives me an error. RM itself says that the current setting doesn't seem to have a problem and when I check the log the error is: SEVERE: java.lang.ArrayIndexOutOfBoundsException.
Does anybody have any idea about the source of error?
I am using RM for a simple text clustering task. I load my sentences from excel and want to cluster them using the Kmeans clustering operator. I am encountering a weird situation. When I choose EuclideanDistance as distance measure the process works and produce the result. However when I choose CorrelationSimilarity as measure, it gives me an error. RM itself says that the current setting doesn't seem to have a problem and when I check the log the error is: SEVERE: java.lang.ArrayIndexOutOfBoundsException.
Does anybody have any idea about the source of error?
Sort by:
1 - 5 of
51
Thank you Nils for the reply. Sure this is the process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.007">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="5.3.007" expanded="true" height="60" name="Read Excel" width="90" x="112" y="75">
<parameter key="excel_file" value="/Users/mfarhadloo/Documents/engapps/Documents/SentimentAnalysis/Codes/Data/P5/Nouns/P5-BON.xlsx"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.3.007" expanded="true" height="76" name="Nominal to Text" width="90" x="246" y="75"/>
<operator activated="true" class="text:data_to_documents" compatibility="5.3.000" expanded="true" height="60" name="Data to Documents" width="90" x="380" y="75">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="380" y="210">
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="percentual"/>
<parameter key="prunde_below_percent" value="1.0"/>
<parameter key="prune_above_percent" value="100.0"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases" width="90" x="112" y="165"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.3.000" expanded="true" height="76" name="Filter Stopwords (Dictionary)" width="90" x="112" y="300">
<parameter key="file" value="/Users/mfarhadloo/Documents/engapps/Documents/SentimentAnalysis/Codes/english-stop copy.txt"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="112" y="435">
<parameter key="min_chars" value="2"/>
<parameter key="max_chars" value="999"/>
</operator>
<operator activated="true" class="text:stem_porter" compatibility="5.3.000" expanded="true" height="60" name="Stem (Porter)" width="90" x="112" y="570"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="5.3.007" expanded="true" height="76" name="Multiply" width="90" x="514" y="210"/>
<operator activated="true" class="k_means" compatibility="5.3.007" expanded="true" height="76" name="Clustering" width="90" x="715" y="120">
<parameter key="k" value="20"/>
<parameter key="max_runs" value="100"/>
<parameter key="determine_good_start_values" value="true"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CorrelationSimilarity"/>
<parameter key="kernel_gamma" value="0.5"/>
</operator>
<operator activated="true" class="cluster_distance_performance" compatibility="5.3.007" expanded="true" height="94" name="Distance" width="90" x="916" y="120"/>
<connect from_op="Read Excel" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
<connect from_op="Multiply" from_port="output 1" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Distance" to_port="cluster model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Distance" to_port="example set"/>
<connect from_op="Distance" from_port="performance" to_port="result 2"/>
<connect from_op="Distance" from_port="example set" to_port="result 3"/>
<connect from_op="Distance" from_port="cluster model" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>
The data that I am using are around 700 sentences. I noticed that after preprocessing and representing each sentence with the word vector, some of my sentences are represented with zero vector (they don't contain any of the words in my word list)! Is it the reason for the error that I am encountering?
I couldn't reproduce an ArrayIndexOutOfBoundsException, but indeed you will face another problem with such an example and the CorrelationMeasure since the correlation of a zero-vector (or any other constant vector) is not defined (because the standard deviation is 0).
We are currently evaluating if we should allow CorrelationMeasure for kMeans, because of this undefined input. As far as I know it is not even clear if kMeans converges with this measure. At least for my simple and small data set and with the option "Determine good start values" the process does not stop running.
Nevertheless, I will come back to you after we have clarified this. In the meantime could you post your exception so that I can at least see where this exception is thrown? Otherwise I cannot help if i cannot reproduce the error.
Beside that: Are you sure you want to use the CorrelationSimalarity? Typically CosineSimalarity is used in text mining, but often mixed up with the CorrelationSimalarity because of the quite similar names.
We are currently evaluating if we should allow CorrelationMeasure for kMeans, because of this undefined input. As far as I know it is not even clear if kMeans converges with this measure. At least for my simple and small data set and with the option "Determine good start values" the process does not stop running.
Nevertheless, I will come back to you after we have clarified this. In the meantime could you post your exception so that I can at least see where this exception is thrown? Otherwise I cannot help if i cannot reproduce the error.
Beside that: Are you sure you want to use the CorrelationSimalarity? Typically CosineSimalarity is used in text mining, but often mixed up with the CorrelationSimalarity because of the quite similar names.

this seems to be a bug. Could you please post your process setup here so we can file a bug report?
Best,
Nils