"Creating SVDs in X-Validation operator very slow"

text_miner
text_miner New Altair Community Member
edited November 5 in Community Q&A
I am trying to setup a process in RapidMiner for text mining that uses SVDs.  I have compared the time it takes to create SVDs using the entire dataset and for only a training set (within the training subprocess of an X-Validation operator).  (Both processes I used are detailed below.)  Using the entire dataset, the entire process finishes within a minute or so.  When running the process with an X-Validation operator, the time increases dramatically; after 45 minutes the SVDs had not been created.  Any ideas on why creating SVDs is taking so much longer inside the X-Validation operator?

For both processes I am using the comp.graphics and comp.windows.x newsgroups mini-datasets available from http://archive.ics.uci.edu/ml/databases/20newsgroups/20newsgroups.html (mini_newsgroups.tar.gz).

Entire Dataset:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="521" width="614">
      <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="75">
        <list key="text_directories">
          <parameter key="comp.graphics" value="/misc_datasets/mini_newsgroups/comp.graphics"/>
          <parameter key="comp.windows.x" value="/misc_datasets/mini_newsgroups/comp.windows.x"/>
        </list>
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="200"/>
        <process expanded="true" height="650" width="1092">
          <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="73" y="30"/>
          <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
          <operator activated="true" class="text:filter_by_length" expanded="true" height="60" name="Filter by Length" width="90" x="380" y="30">
            <parameter key="min_chars" value="2"/>
            <parameter key="max_chars" value="50"/>
          </operator>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter by Length" to_port="document"/>
          <connect from_op="Filter by Length" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_tfidf" expanded="true" height="76" name="Generate TFIDF" width="90" x="313" y="165"/>
      <operator activated="true" class="singular_value_decomposition" expanded="true" height="94" name="SVD" width="90" x="447" y="165">
        <parameter key="return_preprocessing_model" value="true"/>
        <parameter key="dimensions" value="100"/>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Generate TFIDF" to_port="example set input"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 3"/>
      <connect from_op="Generate TFIDF" from_port="example set output" to_op="SVD" to_port="example set input"/>
      <connect from_op="SVD" from_port="example set output" to_port="result 1"/>
      <connect from_op="SVD" from_port="preprocessing model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>
X-Validation:

Note: I tried putting a Materialize Data operator in before creating the SVDs, but it doesn't seem to speed up the creation of the SVDs.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="521" width="614">
      <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="75">
        <list key="text_directories">
          <parameter key="comp.graphics" value="/misc_datasets/mini_newsgroups/comp.graphics"/>
          <parameter key="comp.windows.x" value="/misc_datasets/mini_newsgroups/comp.windows.x"/>
        </list>
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="200"/>
        <process expanded="true" height="650" width="1092">
          <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="73" y="30"/>
          <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
          <operator activated="true" class="text:filter_by_length" expanded="true" height="60" name="Filter by Length" width="90" x="380" y="30">
            <parameter key="min_chars" value="2"/>
            <parameter key="max_chars" value="50"/>
          </operator>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter by Length" to_port="document"/>
          <connect from_op="Filter by Length" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="246" y="300">
        <process expanded="true" height="650" width="614">
          <operator activated="true" class="generate_tfidf" expanded="true" height="76" name="Generate TFIDF" width="90" x="45" y="30"/>
          <operator activated="true" class="materialize_data" expanded="true" height="76" name="Materialize Data" width="90" x="179" y="30">
            <parameter key="datamanagement" value="double_sparse_array"/>
          </operator>
          <operator activated="true" class="singular_value_decomposition" expanded="true" height="94" name="SVD" width="90" x="313" y="30">
            <parameter key="return_preprocessing_model" value="true"/>
            <parameter key="dimensions" value="100"/>
          </operator>
          <operator activated="true" class="logistic_regression" expanded="true" height="94" name="Logistic Regression" width="90" x="447" y="30"/>
          <connect from_port="training" to_op="Generate TFIDF" to_port="example set input"/>
          <connect from_op="Generate TFIDF" from_port="example set output" to_op="Materialize Data" to_port="example set input"/>
          <connect from_op="Materialize Data" from_port="example set output" to_op="SVD" to_port="example set input"/>
          <connect from_op="SVD" from_port="example set output" to_op="Logistic Regression" to_port="training set"/>
          <connect from_op="SVD" from_port="preprocessing model" to_port="through 1"/>
          <connect from_op="Logistic Regression" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <portSpacing port="sink_through 2" spacing="0"/>
        </process>
        <process expanded="true" height="650" width="547">
          <operator activated="true" class="generate_tfidf" expanded="true" height="76" name="Generate TFIDF (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="179" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model (2)" width="90" x="313" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_binominal_classification" expanded="true" height="76" name="Performance" width="90" x="380" y="165">
            <parameter key="main_criterion" value="f_measure"/>
            <parameter key="AUC (optimistic)" value="true"/>
            <parameter key="precision" value="true"/>
            <parameter key="recall" value="true"/>
            <parameter key="lift" value="true"/>
            <parameter key="fallout" value="true"/>
            <parameter key="f_measure" value="true"/>
            <parameter key="false_positive" value="true"/>
            <parameter key="false_negative" value="true"/>
            <parameter key="true_positive" value="true"/>
            <parameter key="true_negative" value="true"/>
            <parameter key="sensitivity" value="true"/>
            <parameter key="specificity" value="true"/>
            <parameter key="youden" value="true"/>
            <parameter key="positive_predictive_value" value="true"/>
            <parameter key="negative_predictive_value" value="true"/>
            <parameter key="psep" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_port="test set" to_op="Generate TFIDF (2)" to_port="example set input"/>
          <connect from_port="through 1" to_op="Apply Model" to_port="model"/>
          <connect from_op="Generate TFIDF (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="source_through 2" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 1"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Any help would be greatly appreciated.  Thanks!

Answers

  • land
    land New Altair Community Member
    Hi,
    I would guess the problem arises, because there are less examples. This might produce a matrix conditioned worse, so that either the SVD algorithm hangs or needs a longer time to compute the results. Did you try to change the random seed? A new distribution of the examples on the folds might solve the problem.

    Greetings,
      Sebastian
  • text_miner
    text_miner New Altair Community Member
    Sebastian,

    Thanks for the reply.  After trying different seed values I was still getting the same problem.  So I investigated a little further and found the solution. 

    The issue was due to missing values being introduced into the dataset after calculating TFIDF values for the term-by-document matrix.  Since only a subset of the data was used in training each fold, there were certain attributes (i.e., terms) that had zero occurrences for all examples.  For those attributes, the TFIDF operator put missing values ("?") for all examples of that term. 

    The solution was to use the Replace Missing Values operator after the TFIDF operator to replace all missing values with zero.  After replacing the missing values, the SVD operator worked without a problem.

    Thanks again for the reply!
  • land
    land New Altair Community Member
    Hi,
    ok, then it seems to be a good idea to throw a warning, that it cannot cope with missing values. I will note that down.

    Greetings,
      Sebastian
  • text_miner
    text_miner New Altair Community Member
    Sebastian,

    I agree, a warning would be nice.

    In addition, another thing to consider is changing the TFIDFFilter class to set zeros for columns without any counts.  Although the missing values can currently be changed to zeros with the Replace Missing Values operator, this (1) requires the use of another operator and (2) changes the order of attributes in the matrix.  While the first point is not a big deal, I imagine the second point may cause problems.  For example, consider creating SVDs with a training set and then wanting to map (i.e., fold-in) examples from the testing set into the pre-existing latent semantic space. (This example assumes the training and testing set applied TFIDF separately (although in reality, the IDF values from the training set would probably be applied to the testing set...) and the sets have different attributes with zero counts.)  To fold in these new "pseudo documents", the order of the attributes should be the same between the two sets.

    Listed below is the TFIDFFilter class with two simple changes to set zeros for columns without any counts.  The first change is on line 106 and just makes sure at least one document has a count for the current term before trying to calculate IDF.  The second change adds an OR to line 118-119; the value is set to zero if IDF is zero for the current term.

    /*
    *  RapidMiner
    *
    *  Copyright (C) 2001-2009 by Rapid-I and the contributors
    *
    *  Complete list of developers available at our web site:
    *
    *      http://rapid-i.com
    *
    *  This program is free software: you can redistribute it and/or modify
    *  it under the terms of the GNU Affero General Public License as published by
    *  the Free Software Foundation, either version 3 of the License, or
    *  (at your option) any later version.
    *
    *  This program is distributed in the hope that it will be useful,
    *  but WITHOUT ANY WARRANTY; without even the implied warranty of
    *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    *  GNU Affero General Public License for more details.
    *
    *  You should have received a copy of the GNU Affero General Public License
    *  along with this program.  If not, see http://www.gnu.org/licenses/.
    */
    package com.rapidminer.operator.preprocessing.filter;

    import java.util.LinkedList;
    import java.util.List;

    import com.rapidminer.example.Attribute;
    import com.rapidminer.example.Example;
    import com.rapidminer.example.ExampleSet;
    import com.rapidminer.operator.OperatorDescription;
    import com.rapidminer.operator.OperatorException;
    import com.rapidminer.operator.UserError;
    import com.rapidminer.operator.ports.metadata.AttributeMetaData;
    import com.rapidminer.operator.ports.metadata.ExampleSetMetaData;
    import com.rapidminer.operator.ports.metadata.MetaData;
    import com.rapidminer.operator.ports.metadata.SetRelation;
    import com.rapidminer.operator.preprocessing.AbstractDataProcessing;
    import com.rapidminer.parameter.ParameterType;
    import com.rapidminer.parameter.ParameterTypeBoolean;
    import com.rapidminer.parameter.UndefinedParameterError;


    /**
    * This operator generates TF-IDF values from the input data. The input example
    * set must contain either simple counts, which will be normalized during
    * calculation of the term frequency TF, or it already contains the calculated
    * term frequency values (in this case no normalization will be done).
    *
    * @author Ingo Mierswa
    */
    public class TFIDFFilter extends AbstractDataProcessing {

    /** The parameter name for &quot;Indicates if term frequency values should be generated (must be done if input data is given as simple occurence counts).&quot; */
    public static final String PARAMETER_CALCULATE_TERM_FREQUENCIES = "calculate_term_frequencies";

    public TFIDFFilter(OperatorDescription description) {
    super(description);
    }

    @Override
    protected MetaData modifyMetaData(ExampleSetMetaData metaData) throws UndefinedParameterError {
    for (AttributeMetaData amd: metaData.getAllAttributes()) {
    if (!amd.isSpecial() && amd.isNumerical()) {
    amd.getMean().setUnkown();
    amd.setValueSetRelation(SetRelation.UNKNOWN);
    }
    }
    return metaData;
    }

    @Override
    public ExampleSet apply(ExampleSet exampleSet) throws OperatorException {
    if (exampleSet.size() < 1)
    throw new UserError(this, 110, new Object[] { "1" });
    if (exampleSet.getAttributes().size() == 0)
    throw new UserError(this, 106, new Object[0]);

    // init
    double[] termFrequencySum = new double[exampleSet.size()];
    List<Attribute> attributes = new LinkedList<Attribute>();
    for (Attribute attribute: exampleSet.getAttributes()) {
    if (attribute.isNumerical())
    attributes.add(attribute);
    }
    int[] documentFrequencies = new int[attributes.size()];

    // calculate frequencies
    int exampleCounter = 0;
    for (Example example: exampleSet) {
    int i = 0;
    for (Attribute attribute : attributes) {
    double value = example.getValue(attribute);
    termFrequencySum[exampleCounter] += value;
    if (value > 0)
    documentFrequencies++;
    i++;
    }
    exampleCounter++;
    checkForStop();
    }

    // calculate IDF values
    double[] inverseDocumentFrequencies = new double[documentFrequencies.length];
    for (int i = 0; i < attributes.size(); i++) {
    if (documentFrequencies > 0) {
    inverseDocumentFrequencies = Math.log((double) exampleSet.size() / (double) documentFrequencies);
    }
    }

    // set values
    boolean calculateTermFrequencies = getParameterAsBoolean(PARAMETER_CALCULATE_TERM_FREQUENCIES);
    exampleCounter = 0;
    for (Example example: exampleSet) {
    int i = 0;
    for (Attribute attribute : attributes) {
    double value = example.getValue(attribute);
    if (termFrequencySum[exampleCounter] == 0.0d ||
    inverseDocumentFrequencies == 0.0d) {
    example.setValue(attribute, 0.0d);
    } else {
    double tf = value;
    if (calculateTermFrequencies)
    tf /= termFrequencySum[exampleCounter];
    double idf = inverseDocumentFrequencies;
    example.setValue(attribute, (tf * idf));
    }
    i++;
    }
    exampleCounter++;
    checkForStop();
    }
    return exampleSet;
    }

    @Override
    public List<ParameterType> getParameterTypes() {
    List<ParameterType> types = super.getParameterTypes();
    ParameterType type = new ParameterTypeBoolean(PARAMETER_CALCULATE_TERM_FREQUENCIES, "Indicates if term frequency values should be generated (must be done if input data is given as simple occurence counts).", true);
    type.setExpert(false);
    types.add(type);
    return types;
    }

    }
    Thanks!
  • land
    land New Altair Community Member
    Hi,
    I will add this and it will be included in the upcoming final version.

    Anyway, usually we use the TFIDF filter of the Process Documents operator, where this error does not arise as far as I know.

    Greetings,
      Sebastian