"[SOLVED]java.lang.ArrayIndexOutOfBoundsException: -1 using Find Threshold (Meta)"

wmarella
wmarella New Altair Community Member
edited November 5 in Community Q&A
Hi, I'm doing a text mining project for simple classification of about 1,500 very brief documents. All my text cleaning and transforms work fine. I've used a number of learners to develop a model that will predict the label "relevance" in my dataset. The best I can do in this classification task is get an AUC of 0.853 with a Naive Bayes, which isn't bad, but I'd like to get it better. I'm trying to use the Find Thresh operator to develop a third class of cases the model won't try to classify because the probability is too low. I would rather have very high accuracy/AUC in classified cases and have a few left over that have to be reviewed manually than get a lot of false positives and negatives.

However, when I put the Naive Bayes inside the Find Thresh operator and set the cost matrix, I get this java error at the Find Thresh operator when running the model. Could it be because I have the Find Thresh within an X-Validation operator? Not sure why that would matter. (Update: I took it out of the X-Val wrapper and re-ran it and still got the same error message.) I'll paste my code in below. Thanks in advance for any help.

Bill

UPDATE #2 - I figured out what I was doing wrong and am posting in case others make the same mistake. When specifying the costs for the different categories of my predicted variable in the cost matrix, I was adding a "?" category for the missing values and specifying the cost of missing values (i.e., examples that could not be predicted from the model) within the cost matrix itself, AND I was also checking the "allow unknown predictions" in the operator dialog box. You only need to do the latter and should not address unknown predictions or missing values directly in the cost matrix.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
   <process expanded="true" height="404" width="614">
     <operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
       <parameter key="excel_file" value="/Volumes/NO NAME/emr_labeled.xls"/>
       <parameter key="imported_cell_range" value="A1:L1568"/>
       <parameter key="first_row_as_names" value="false"/>
       <list key="annotations">
         <parameter key="0" value="Name"/>
       </list>
       <list key="data_set_meta_data_information">
         <parameter key="0" value="reportid.true.nominal.id"/>
         <parameter key="1" value="EventDate.false.date_time.attribute"/>
         <parameter key="2" value="EventYr.false.integer.attribute"/>
         <parameter key="3" value="EventClassification.false.binominal.attribute"/>
         <parameter key="4" value="HarmScore.false.polynominal.attribute"/>
         <parameter key="5" value="careareadescription.false.polynominal.attribute"/>
         <parameter key="6" value="EventType.true.polynominal.attribute"/>
         <parameter key="7" value="Sub1.true.polynominal.attribute"/>
         <parameter key="8" value="Sub2.true.polynominal.attribute"/>
         <parameter key="9" value="other.false.polynominal.attribute"/>
         <parameter key="10" value="EventDetail.true.text.attribute"/>
         <parameter key="11" value="Relevance.true.binominal.attribute"/>
       </list>
     </operator>
     <operator activated="true" class="set_role" compatibility="5.2.006" expanded="true" height="76" name="Set Role" width="90" x="174" y="30">
       <parameter key="name" value="Relevance"/>
       <parameter key="target_role" value="label"/>
       <list key="set_additional_roles"/>
     </operator>
     <operator activated="true" class="sample" compatibility="5.2.006" expanded="true" height="76" name="Sample" width="90" x="313" y="30">
       <parameter key="sample_size" value="750"/>
       <list key="sample_size_per_class"/>
       <list key="sample_ratio_per_class"/>
       <list key="sample_probability_per_class"/>
       <parameter key="use_local_random_seed" value="true"/>
       <parameter key="local_random_seed" value="120"/>
     </operator>
     <operator activated="true" class="text:process_document_from_data" compatibility="5.2.003" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
       <parameter key="vector_creation" value="Binary Term Occurrences"/>
       <parameter key="keep_text" value="true"/>
       <parameter key="prune_method" value="absolute"/>
       <parameter key="prune_below_absolute" value="2"/>
       <parameter key="prune_above_absolute" value="9999"/>
       <list key="specify_weights"/>
       <process expanded="true" height="391" width="634">
         <operator activated="true" class="text:transform_cases" compatibility="5.2.003" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
         <operator activated="true" class="text:tokenize" compatibility="5.2.003" expanded="true" height="60" name="Tokenize" width="90" x="45" y="120"/>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.003" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="210"/>
         <operator activated="true" class="text:stem_snowball" compatibility="5.2.003" expanded="true" height="60" name="Stem (Snowball)" width="90" x="45" y="300"/>
         <operator activated="true" class="text:filter_by_length" compatibility="5.2.003" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="246" y="30">
           <parameter key="min_chars" value="2"/>
           <parameter key="max_chars" value="9999"/>
         </operator>
         <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.2.003" expanded="true" height="76" name="Filter Stopwords (Dictionary)" width="90" x="447" y="75">
           <parameter key="file" value="/Volumes/NO NAME/emr_stopwords_short.txt"/>
         </operator>
         <connect from_port="document" to_op="Transform Cases" to_port="document"/>
         <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
         <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
         <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
         <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
         <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="generate_concatenation" compatibility="5.2.006" expanded="true" height="76" name="Generate Concatenation" width="90" x="112" y="120">
       <parameter key="first_attribute" value="EventType"/>
       <parameter key="second_attribute" value="Sub1"/>
     </operator>
     <operator activated="true" class="generate_concatenation" compatibility="5.2.006" expanded="true" height="76" name="Generate Concatenation (2)" width="90" x="246" y="120">
       <parameter key="first_attribute" value="EventType_Sub1"/>
       <parameter key="second_attribute" value="Sub2"/>
     </operator>
     <operator activated="true" class="select_attributes" compatibility="5.2.006" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="120">
       <parameter key="attribute_filter_type" value="subset"/>
       <parameter key="attributes" value="EventType|EventType_Sub1|Sub1|Sub2|"/>
       <parameter key="invert_selection" value="true"/>
     </operator>
     <operator activated="true" class="weight_by_chi_squared_statistic" compatibility="5.2.006" expanded="true" height="76" name="Weight by Chi Squared Statistic" width="90" x="45" y="255">
       <parameter key="sort_direction" value="descending"/>
     </operator>
     <operator activated="true" class="select_by_weights" compatibility="5.2.006" expanded="true" height="94" name="Select by Weights" width="90" x="179" y="255">
       <parameter key="weight_relation" value="top k"/>
       <parameter key="k" value="50"/>
     </operator>
     <operator activated="true" class="nominal_to_binominal" compatibility="5.2.006" expanded="true" height="94" name="Nominal to Binominal" width="90" x="313" y="255"/>
     <operator activated="true" class="x_validation" compatibility="5.2.006" expanded="true" height="112" name="Validation" width="90" x="447" y="255">
       <process expanded="true" height="340" width="292">
         <operator activated="true" class="find_threshold_meta" compatibility="5.2.006" expanded="true" height="76" name="Find Threshold (Meta)" width="90" x="50" y="58">
           <list key="class_weights">
             <parameter key="0" value="5.0"/>
             <parameter key="1" value="1.0"/>
             <parameter key="?" value="1.0"/>
           </list>
           <parameter key="allow_unkown_predictions" value="true"/>
           <process expanded="true" height="381" width="634">
             <operator activated="true" class="k_nn" compatibility="5.2.006" expanded="true" height="76" name="k-NN" width="90" x="247" y="35"/>
             <connect from_port="training set" to_op="k-NN" to_port="training set"/>
             <connect from_op="k-NN" from_port="model" to_port="model"/>
             <portSpacing port="source_training set" spacing="0"/>
             <portSpacing port="sink_model" spacing="0"/>
           </process>
         </operator>
         <connect from_port="training" to_op="Find Threshold (Meta)" to_port="training set"/>
         <connect from_op="Find Threshold (Meta)" from_port="model" to_port="model"/>
         <portSpacing port="source_training" spacing="0"/>
         <portSpacing port="sink_model" spacing="0"/>
         <portSpacing port="sink_through 1" spacing="0"/>
       </process>
       <process expanded="true" height="340" width="292">
         <operator activated="true" class="apply_model" compatibility="5.2.006" expanded="true" height="76" name="Apply Model" width="90" x="45" y="87">
           <list key="application_parameters"/>
           <parameter key="create_view" value="true"/>
         </operator>
         <operator activated="true" class="performance_binominal_classification" compatibility="5.2.006" expanded="true" height="76" name="Performance" width="90" x="179" y="165">
           <parameter key="main_criterion" value="AUC"/>
           <parameter key="AUC" value="true"/>
           <parameter key="lift" value="true"/>
           <parameter key="sensitivity" value="true"/>
           <parameter key="specificity" value="true"/>
         </operator>
         <connect from_port="model" to_op="Apply Model" to_port="model"/>
         <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
         <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
         <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
         <portSpacing port="source_model" spacing="0"/>
         <portSpacing port="source_test set" spacing="0"/>
         <portSpacing port="source_through 1" spacing="0"/>
         <portSpacing port="sink_averagable 1" spacing="0"/>
         <portSpacing port="sink_averagable 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
     <connect from_op="Set Role" from_port="example set output" to_op="Sample" to_port="example set input"/>
     <connect from_op="Sample" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
     <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Concatenation" to_port="example set input"/>
     <connect from_op="Generate Concatenation" from_port="example set output" to_op="Generate Concatenation (2)" to_port="example set input"/>
     <connect from_op="Generate Concatenation (2)" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
     <connect from_op="Select Attributes" from_port="example set output" to_op="Weight by Chi Squared Statistic" to_port="example set"/>
     <connect from_op="Weight by Chi Squared Statistic" from_port="weights" to_op="Select by Weights" to_port="weights"/>
     <connect from_op="Weight by Chi Squared Statistic" from_port="example set" to_op="Select by Weights" to_port="example set input"/>
     <connect from_op="Select by Weights" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
     <connect from_op="Select by Weights" from_port="weights" to_port="result 3"/>
     <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Validation" to_port="training"/>
     <connect from_op="Validation" from_port="model" to_port="result 1"/>
     <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
     <portSpacing port="sink_result 4" spacing="0"/>
   </process>
 </operator>
</process>
Tagged:

Answers

  • MariusHelf
    MariusHelf New Altair Community Member
    Thanks for the bug report. We to our internal TODO list an item for creating a more meaningful error message for Find Threshold.