ExampleSets, Views, and the Materialize Data Operator
Brian_Wells
New Altair Community Member
I am trying to wrap my brain around the difference in a normal ExampleSet and an "ExampleSet view" and why it is such an expensive process to materialize a view with the "Materialize Data" or similar operator. There are a couple of operators that I have run across in the past that generate a table of data that looks like a normal ExampleSet but throws an error when most other operators are connected which are expecting a regular ExampleSet. An example is the "Data to Similarity" operator which executes incredibly fast for what it is doing but requires either the related "Similarity to Data" operator or a Materialize Data operator to transform the output into a data structure that can be manipulated downstream. This would not be an issue except for the fact that materializing the data has a time complexity hundreds of times longer than, in this example, the "Data to Similarity" operator (by my rough estimation).
In this example, running 10,000 examples through the "Data to Similarity" operator is fairly painless unless you want to use the resulting output for anything other than visual inspection. At least I cannot find any operators which can utilize the output. Adding a "Materialize Data" or "Similarity to Data" operator takes hours to execute on the same dataset even though the time complexity should not be worse than the previous operator's O(n^2). That said, there are techniques I have found to extract the data from the "Data to Similarity" operator by using loops, macros, and "Create Data" but it appears to be even slower than the built-in methods above.
For some context, I have quite a bit of experience with Java as well as recent coursework in advanced data structures, computability, algorithms, advanced OOP techniques, MapReduce, HDFS, Hive, Spark, etc. but I cannot seem to figure out for the life of me the following:
* What a "view" consists of in this context, how it is created or the underlying Java construct
* Why a view is not able to be manipulated by [most] other operators
* What takes so long to transform a view into a standard ExampleSet
* How I might be able to manipulate the data prior to materializing it so it is less expensive to do so
If I were forced to come up with a guess of what was happening behind the scenes at the risk of great bodily harm I would be thinking along the lines that a view is a specialized type of heap, perhaps making use of a bloom filter or other type of hash table as the underlying data structure and that ExampleSets are much more complex storage objects relying on contiguous blocks of memory. This theory would account for the time complexity as well as the relatively low CPU usage during the conversion process (I am running it on an 44 CPU industrial workstation).
Thanks in advance!
P.S. - To demonstrate what I am talking about I included a modified version of the "Document Similarity and Clustering" process from the Training Resources folder within RapidMiner. It is set to run 1,000 examples as configured below:
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process" origin="GENERATED_TRAINING">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="concurrency:loop" compatibility="9.3.001" expanded="true" height="82" name="Loop" width="90" x="179" y="85">
<parameter key="number_of_iterations" value="5"/>
<parameter key="iteration_macro" value="iteration"/>
<parameter key="reuse_results" value="true"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="job post data" origin="GENERATED_TRAINING" width="90" x="45" y="85">
<parameter key="repository_entry" value="../data/JobPosts"/>
</operator>
<operator activated="true" class="append" compatibility="9.3.001" expanded="true" height="103" name="Append" width="90" x="313" y="34">
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="merge_type" value="all"/>
</operator>
<connect from_port="input 1" to_op="Append" to_port="example set 1"/>
<connect from_op="job post data" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="sample" compatibility="9.3.001" expanded="true" height="82" name="Sample" origin="GENERATED_TRAINING" width="90" x="313" y="85">
<parameter key="sample" value="absolute"/>
<parameter key="balance_data" value="false"/>
<parameter key="sample_size" value="1000"/>
<parameter key="sample_ratio" value="0.1"/>
<parameter key="sample_probability" value="0.1"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<description align="center" color="orange" colored="true" width="126">for demo purpose we are sampling this down to make the process complete faster</description>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="9.3.001" expanded="true" height="82" name="Nominal to Text" origin="GENERATED_TRAINING" width="90" x="447" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="JobText"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" origin="GENERATED_TRAINING" width="90" x="581" y="85">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="9.0.000" expanded="true" height="68" name="Extract Content (2)" origin="GENERATED_TRAINING" width="90" x="45" y="34">
<parameter key="extract_content" value="true"/>
<parameter key="minimum_text_block_length" value="3"/>
<parameter key="override_content_type_information" value="true"/>
<parameter key="neglegt_span_tags" value="true"/>
<parameter key="neglect_p_tags" value="true"/>
<parameter key="neglect_b_tags" value="true"/>
<parameter key="neglect_i_tags" value="true"/>
<parameter key="neglect_br_tags" value="true"/>
<parameter key="ignore_non_html_tags" value="true"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize (2)" origin="GENERATED_TRAINING" width="90" x="179" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="8.2.000" expanded="true" height="68" name="Transform Cases (2)" origin="GENERATED_TRAINING" width="90" x="313" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.2.000" expanded="true" height="68" name="Filter Stopwords (English)" origin="GENERATED_TRAINING" width="90" x="447" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.2.000" expanded="true" height="68" name="Filter Tokens (by Length)" origin="GENERATED_TRAINING" width="90" x="581" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="9999"/>
</operator>
<connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="9.3.001" expanded="true" height="82" name="Data to Similarity" origin="GENERATED_TRAINING" width="90" x="715" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<operator activated="true" class="similarity_to_data" compatibility="9.3.001" expanded="true" height="82" name="Similarity to Data" width="90" x="849" y="85">
<parameter key="table_type" value="long_table"/>
</operator>
<operator activated="false" class="read_excel" compatibility="9.3.001" expanded="true" height="68" name="Read Excel" origin="GENERATED_TRAINING" width="90" x="45" y="85">
<parameter key="excel_file" value="D:\RapidMiner\RapidMiner University - Operations\Content Development area\TWM\VancouverDataTextMiningData\VancouverDataTextMiningData.xls"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information"/>
<parameter key="read_not_matching_values_as_missings" value="true"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<description align="center" color="orange" colored="true" width="126">instead of providing the excel - we provide pre-loaded data to use instead<br></description>
</operator>
<operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="job post data (2)" origin="GENERATED_TRAINING" width="90" x="45" y="340">
<parameter key="repository_entry" value="../data/JobPosts"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="9.3.001" expanded="true" height="82" name="Nominal to Text (2)" origin="GENERATED_TRAINING" width="90" x="179" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="JobText"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (2)" origin="GENERATED_TRAINING" width="90" x="313" y="340">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="9.0.000" expanded="true" height="68" name="Extract Content (3)" origin="GENERATED_TRAINING" width="90" x="45" y="34">
<parameter key="extract_content" value="true"/>
<parameter key="minimum_text_block_length" value="3"/>
<parameter key="override_content_type_information" value="true"/>
<parameter key="neglegt_span_tags" value="true"/>
<parameter key="neglect_p_tags" value="true"/>
<parameter key="neglect_b_tags" value="true"/>
<parameter key="neglect_i_tags" value="true"/>
<parameter key="neglect_br_tags" value="true"/>
<parameter key="ignore_non_html_tags" value="true"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize (3)" origin="GENERATED_TRAINING" width="90" x="179" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="8.2.000" expanded="true" height="68" name="Transform Cases (3)" origin="GENERATED_TRAINING" width="90" x="313" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.2.000" expanded="true" height="68" name="Filter Stopwords (2)" origin="GENERATED_TRAINING" width="90" x="447" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.2.000" expanded="true" height="68" name="Filter Tokens (2)" origin="GENERATED_TRAINING" width="90" x="581" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="9999"/>
</operator>
<connect from_port="document" to_op="Extract Content (3)" to_port="document"/>
<connect from_op="Extract Content (3)" from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
<connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" origin="GENERATED_TRAINING" width="90" x="447" y="340">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="false"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k" value="40"/>
<parameter key="max_runs" value="10"/>
<parameter key="determine_good_start_values" value="false"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="SquaredEuclideanDistance"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" origin="GENERATED_TRAINING" width="90" x="581" y="442"/>
<operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.3.001" expanded="true" height="82" name="Cluster Model Visualizer" origin="GENERATED_TRAINING" width="90" x="715" y="340"/>
<connect from_op="Loop" from_port="output 1" to_op="Sample" to_port="example set input"/>
<connect from_op="Sample" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
<connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
<connect from_op="Similarity to Data" from_port="exampleSet" to_port="result 1"/>
<connect from_op="job post data (2)" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Cluster Model Visualizer" to_port="clustered data"/>
<connect from_op="Multiply" from_port="output 2" to_port="result 4"/>
<connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 2"/>
<connect from_op="Cluster Model Visualizer" from_port="model output" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="63"/>
<portSpacing port="sink_result 2" spacing="147"/>
<portSpacing port="sink_result 3" spacing="21"/>
<portSpacing port="sink_result 4" spacing="21"/>
<portSpacing port="sink_result 5" spacing="42"/>
<description align="right" color="green" colored="true" height="249" resized="true" width="818" x="15" y="41">Part I - Document Similarity</description>
<description align="right" color="gray" colored="true" height="272" resized="true" width="818" x="15" y="294">Part II - Document Clustering</description>
<description align="center" color="yellow" colored="true" height="94" resized="true" width="313" x="494" y="187"><br> You may find the demo video <a href="https://academy.rapidminer.com/learn/video/document-similarity-and-clustering">here</a> on the RapidMiner Academy</description>
</process>
</operator>
</process>
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process" origin="GENERATED_TRAINING">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="concurrency:loop" compatibility="9.3.001" expanded="true" height="82" name="Loop" width="90" x="179" y="85">
<parameter key="number_of_iterations" value="5"/>
<parameter key="iteration_macro" value="iteration"/>
<parameter key="reuse_results" value="true"/>
<parameter key="enable_parallel_execution" value="false"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="job post data" origin="GENERATED_TRAINING" width="90" x="45" y="85">
<parameter key="repository_entry" value="../data/JobPosts"/>
</operator>
<operator activated="true" class="append" compatibility="9.3.001" expanded="true" height="103" name="Append" width="90" x="313" y="34">
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="merge_type" value="all"/>
</operator>
<connect from_port="input 1" to_op="Append" to_port="example set 1"/>
<connect from_op="job post data" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="sample" compatibility="9.3.001" expanded="true" height="82" name="Sample" origin="GENERATED_TRAINING" width="90" x="313" y="85">
<parameter key="sample" value="absolute"/>
<parameter key="balance_data" value="false"/>
<parameter key="sample_size" value="1000"/>
<parameter key="sample_ratio" value="0.1"/>
<parameter key="sample_probability" value="0.1"/>
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<description align="center" color="orange" colored="true" width="126">for demo purpose we are sampling this down to make the process complete faster</description>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="9.3.001" expanded="true" height="82" name="Nominal to Text" origin="GENERATED_TRAINING" width="90" x="447" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="JobText"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" origin="GENERATED_TRAINING" width="90" x="581" y="85">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="9.0.000" expanded="true" height="68" name="Extract Content (2)" origin="GENERATED_TRAINING" width="90" x="45" y="34">
<parameter key="extract_content" value="true"/>
<parameter key="minimum_text_block_length" value="3"/>
<parameter key="override_content_type_information" value="true"/>
<parameter key="neglegt_span_tags" value="true"/>
<parameter key="neglect_p_tags" value="true"/>
<parameter key="neglect_b_tags" value="true"/>
<parameter key="neglect_i_tags" value="true"/>
<parameter key="neglect_br_tags" value="true"/>
<parameter key="ignore_non_html_tags" value="true"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize (2)" origin="GENERATED_TRAINING" width="90" x="179" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="8.2.000" expanded="true" height="68" name="Transform Cases (2)" origin="GENERATED_TRAINING" width="90" x="313" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.2.000" expanded="true" height="68" name="Filter Stopwords (English)" origin="GENERATED_TRAINING" width="90" x="447" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.2.000" expanded="true" height="68" name="Filter Tokens (by Length)" origin="GENERATED_TRAINING" width="90" x="581" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="9999"/>
</operator>
<connect from_port="document" to_op="Extract Content (2)" to_port="document"/>
<connect from_op="Extract Content (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="9.3.001" expanded="true" height="82" name="Data to Similarity" origin="GENERATED_TRAINING" width="90" x="715" y="85">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<operator activated="true" class="similarity_to_data" compatibility="9.3.001" expanded="true" height="82" name="Similarity to Data" width="90" x="849" y="85">
<parameter key="table_type" value="long_table"/>
</operator>
<operator activated="false" class="read_excel" compatibility="9.3.001" expanded="true" height="68" name="Read Excel" origin="GENERATED_TRAINING" width="90" x="45" y="85">
<parameter key="excel_file" value="D:\RapidMiner\RapidMiner University - Operations\Content Development area\TWM\VancouverDataTextMiningData\VancouverDataTextMiningData.xls"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information"/>
<parameter key="read_not_matching_values_as_missings" value="true"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
<description align="center" color="orange" colored="true" width="126">instead of providing the excel - we provide pre-loaded data to use instead<br></description>
</operator>
<operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="job post data (2)" origin="GENERATED_TRAINING" width="90" x="45" y="340">
<parameter key="repository_entry" value="../data/JobPosts"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="9.3.001" expanded="true" height="82" name="Nominal to Text (2)" origin="GENERATED_TRAINING" width="90" x="179" y="340">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="JobText"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (2)" origin="GENERATED_TRAINING" width="90" x="313" y="340">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="9.0.000" expanded="true" height="68" name="Extract Content (3)" origin="GENERATED_TRAINING" width="90" x="45" y="34">
<parameter key="extract_content" value="true"/>
<parameter key="minimum_text_block_length" value="3"/>
<parameter key="override_content_type_information" value="true"/>
<parameter key="neglegt_span_tags" value="true"/>
<parameter key="neglect_p_tags" value="true"/>
<parameter key="neglect_b_tags" value="true"/>
<parameter key="neglect_i_tags" value="true"/>
<parameter key="neglect_br_tags" value="true"/>
<parameter key="ignore_non_html_tags" value="true"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize (3)" origin="GENERATED_TRAINING" width="90" x="179" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="8.2.000" expanded="true" height="68" name="Transform Cases (3)" origin="GENERATED_TRAINING" width="90" x="313" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.2.000" expanded="true" height="68" name="Filter Stopwords (2)" origin="GENERATED_TRAINING" width="90" x="447" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.2.000" expanded="true" height="68" name="Filter Tokens (2)" origin="GENERATED_TRAINING" width="90" x="581" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="9999"/>
</operator>
<connect from_port="document" to_op="Extract Content (3)" to_port="document"/>
<connect from_op="Extract Content (3)" from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
<connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" origin="GENERATED_TRAINING" width="90" x="447" y="340">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="false"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k" value="40"/>
<parameter key="max_runs" value="10"/>
<parameter key="determine_good_start_values" value="false"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="SquaredEuclideanDistance"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" origin="GENERATED_TRAINING" width="90" x="581" y="442"/>
<operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.3.001" expanded="true" height="82" name="Cluster Model Visualizer" origin="GENERATED_TRAINING" width="90" x="715" y="340"/>
<connect from_op="Loop" from_port="output 1" to_op="Sample" to_port="example set input"/>
<connect from_op="Sample" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_op="Similarity to Data" to_port="similarity"/>
<connect from_op="Data to Similarity" from_port="example set" to_op="Similarity to Data" to_port="exampleSet"/>
<connect from_op="Similarity to Data" from_port="exampleSet" to_port="result 1"/>
<connect from_op="job post data (2)" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Cluster Model Visualizer" to_port="clustered data"/>
<connect from_op="Multiply" from_port="output 2" to_port="result 4"/>
<connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 2"/>
<connect from_op="Cluster Model Visualizer" from_port="model output" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="63"/>
<portSpacing port="sink_result 2" spacing="147"/>
<portSpacing port="sink_result 3" spacing="21"/>
<portSpacing port="sink_result 4" spacing="21"/>
<portSpacing port="sink_result 5" spacing="42"/>
<description align="right" color="green" colored="true" height="249" resized="true" width="818" x="15" y="41">Part I - Document Similarity</description>
<description align="right" color="gray" colored="true" height="272" resized="true" width="818" x="15" y="294">Part II - Document Clustering</description>
<description align="center" color="yellow" colored="true" height="94" resized="true" width="313" x="494" y="187"><br> You may find the demo video <a href="https://academy.rapidminer.com/learn/video/document-similarity-and-clustering">here</a> on the RapidMiner Academy</description>
</process>
</operator>
</process>
Tagged:
0
Answers
-
Hi, I can not see any view in the example (or I missed something)
But the DataToSimilarity seems to be very fast becaus it does nothing, it just wraps the data with some external object along with the type of the distance measure. The distances are calculated on the fly, so when you request what is the similarity between doc1 and doc2 it simply calculates it.
In case of Similarity2Data the resulting distance is precomputed, so it takes every instance and calculate the distance to another instance. The "similarity" input is simply used to extract the distance measure and the input example set is used to directly calculate the distance which takes n^2 complexity. That is why it takes so long.
In more general ExampleSet is a view on the data. It means that if you perform sampling the dataset do not change , it is still the same data, but you see just a subset of the data. If you do MaterializeData you make a copy of the data which you have in the view, what means you need extra RAM for it.
There is also another type of view - for example see Normalize Attributes -> there you can check the "create view" box. Normally Normalize Attributes operator needs to duplicate each attribute in RAM so it will store new values of the attributes, and the an ExampleSet is created which has a view only on the new and normalized attributes, but when you check the "create view" then new attributes are not stored in RAM, instead they are calculated on the fly, what extends every computation time. It is always a trade of between caching data in RAM and execution time.
Best
Marcin2 -
This is a great post, thank you @marcin_blachnik !
0