🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

"[SOLVED] java.lang.nullPointerException in simple text mining script"

User: "wmarella"
New Altair Community Member
Updated by Jocelyn
Hello, I'm new to text mining and rapidminer, but I'm following a tutorial in "Practical Text Mining" and cant make a very simple script work. The process fails and returns the java.lang.nullpointerexception error. I'm running Mac OsX 10.6.8, Java 13.7.2, Rapidminer 5.2.006.

I'm using the Read Excel operator to load a simple three-column spreadsheet. The columns are ID, Year, and Abstract. Abstract contains the text I'm trying to mine. I've flagged ID as the id field, and Abstract is flagged as text on the import wizard. There are 901 examples in the example set, and the Read Excel operator is working because I see my data when hovering over the output node. It also looks correct going into the Process Document from Data (PDFD) operator at the exa node.

On the PDFD operator, create word vector is checked (TF-IDF), as is keep text. PDFD contains a subprocess: Transform Case and Tokenize. I've removed all other operators from the program in order to isolate PDFD as the problem. When I hover over the output node of PDFD, it says Examples=0 but still shows my 3 attribute names.

Here is the xml code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.006">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
   <parameter key="logverbosity" value="init"/>
   <parameter key="random_seed" value="2001"/>
   <parameter key="send_mail" value="never"/>
   <parameter key="notification_email" value=""/>
   <parameter key="process_duration_for_mail" value="30"/>
   <parameter key="encoding" value="SYSTEM"/>
   <process expanded="true" height="251" width="413">
     <operator activated="true" class="read_excel" compatibility="5.2.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
       <parameter key="excel_file" value="/Users/Bill/Desktop/Literature_Datsset_1994-2005.xls"/>
       <parameter key="sheet_number" value="1"/>
       <parameter key="imported_cell_range" value="A1:E902"/>
       <parameter key="encoding" value="SYSTEM"/>
       <parameter key="first_row_as_names" value="true"/>
       <list key="annotations">
         <parameter key="0" value="Name"/>
       </list>
       <parameter key="date_format" value=""/>
       <parameter key="time_zone" value="SYSTEM"/>
       <parameter key="locale" value="English (United States)"/>
       <list key="data_set_meta_data_information">
         <parameter key="0" value="ID.true.nominal.attribute"/>
         <parameter key="1" value="YEAR.true.nominal.attribute"/>
         <parameter key="2" value="JOURNAL.true.nominal.attribute"/>
         <parameter key="3" value="ABSTRACT.true.text.attribute"/>
       </list>
       <parameter key="read_not_matching_values_as_missings" value="true"/>
       <parameter key="datamanagement" value="double_array"/>
     </operator>
     <operator activated="true" class="text:process_document_from_data" compatibility="5.2.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="179" y="75">
       <parameter key="create_word_vector" value="true"/>
       <parameter key="vector_creation" value="TF-IDF"/>
       <parameter key="add_meta_information" value="true"/>
       <parameter key="keep_text" value="false"/>
       <parameter key="prune_method" value="absolute"/>
       <parameter key="prunde_below_percent" value="3.0"/>
       <parameter key="prune_above_percent" value="30.0"/>
       <parameter key="prune_below_absolute" value="3"/>
       <parameter key="prune_above_absolute" value="55"/>
       <parameter key="prune_below_rank" value="0.05"/>
       <parameter key="prune_above_rank" value="0.05"/>
       <parameter key="datamanagement" value="double_sparse_array"/>
       <parameter key="select_attributes_and_weights" value="false"/>
       <list key="specify_weights"/>
       <process expanded="true" height="340" width="634">
         <operator activated="true" class="text:transform_cases" compatibility="5.2.002" expanded="true" height="60" name="Transform Cases" width="90" x="59" y="109">
           <parameter key="transform_to" value="lower case"/>
         </operator>
         <operator activated="true" class="text:tokenize" compatibility="5.2.002" expanded="true" height="60" name="Tokenize" width="90" x="169" y="110">
           <parameter key="mode" value="non letters"/>
           <parameter key="characters" value=".:"/>
           <parameter key="language" value="English"/>
           <parameter key="max_token_length" value="3"/>
         </operator>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="297" y="111"/>
         <operator activated="true" class="text:filter_by_length" compatibility="5.2.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="456" y="104">
           <parameter key="min_chars" value="2"/>
           <parameter key="max_chars" value="55"/>
         </operator>
         <connect from_port="document" to_op="Transform Cases" to_port="document"/>
         <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
         <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
         <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="k_means" compatibility="5.2.006" expanded="true" height="76" name="Clustering" width="90" x="45" y="165">
       <parameter key="add_cluster_attribute" value="true"/>
       <parameter key="add_as_label" value="false"/>
       <parameter key="remove_unlabeled" value="false"/>
       <parameter key="k" value="2"/>
       <parameter key="max_runs" value="10"/>
       <parameter key="determine_good_start_values" value="false"/>
       <parameter key="measure_types" value="BregmanDivergences"/>
       <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
       <parameter key="nominal_measure" value="NominalDistance"/>
       <parameter key="numerical_measure" value="EuclideanDistance"/>
       <parameter key="divergence" value="SquaredEuclideanDistance"/>
       <parameter key="kernel_type" value="radial"/>
       <parameter key="kernel_gamma" value="1.0"/>
       <parameter key="kernel_sigma1" value="1.0"/>
       <parameter key="kernel_sigma2" value="0.0"/>
       <parameter key="kernel_sigma3" value="2.0"/>
       <parameter key="kernel_degree" value="3.0"/>
       <parameter key="kernel_shift" value="1.0"/>
       <parameter key="kernel_a" value="1.0"/>
       <parameter key="kernel_b" value="0.0"/>
       <parameter key="max_optimization_steps" value="100"/>
       <parameter key="use_local_random_seed" value="false"/>
       <parameter key="local_random_seed" value="1992"/>
     </operator>
     <connect from_port="input 1" to_op="Read Excel" to_port="file"/>
     <connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
     <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
     <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
     <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="source_input 2" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
   </process>
 </operator>
</process>
Here is the stack trace:

Stack trace:
------------

Exception: java.lang.NullPointerException
Message: null
Stack trace:
 com.rapidminer.operator.nio.model.ExcelResultSetConfiguration.makeDataResultSet(ExcelResultSetConfiguration.java:275)
 com.rapidminer.operator.nio.model.AbstractDataResultSetReader.createExampleSet(AbstractDataResultSetReader.java:127)
 com.rapidminer.operator.io.AbstractExampleSource.read(AbstractExampleSource.java:52)
 com.rapidminer.operator.io.AbstractExampleSource.read(AbstractExampleSource.java:36)
 com.rapidminer.operator.io.AbstractReader.doWork(AbstractReader.java:123)
 com.rapidminer.operator.Operator.execute(Operator.java:834)
 com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUnitExecutor.java:51)
 com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)
 com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:379)
 com.rapidminer.operator.Operator.execute(Operator.java:834)
 com.rapidminer.Process.run(Process.java:925)
 com.rapidminer.Process.run(Process.java:848)
 com.rapidminer.Process.run(Process.java:807)
 com.rapidminer.Process.run(Process.java:802)
 com.rapidminer.Process.run(Process.java:792)
 com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)
Thanks in advance for any help you can offer!

Bill

Find more posts tagged with