ExampleSets, Views, and the Materialize Data Operator

Question

I am trying to wrap my brain around the difference in a normal ExampleSet and an "ExampleSet view" and why it is such an expensive process to materialize a view with the "Materialize Data" or similar operator. There are a couple of operators that I have run across in the past that generate a table of data that looks like a normal ExampleSet but throws an error when most other operators are connected which are expecting a regular ExampleSet. An example is the "Data to Similarity" operator which executes incredibly fast for what it is doing but requires either the related "Similarity to Data" operator or a Materialize Data operator to transform the output into a data structure that can be manipulated downstream. This would not be an issue except for the fact that materializing the data has a time complexity hundreds of times longer than, in this example, the "Data to Similarity" operator (by my rough estimation). In this example, running 10,000 examples through the "Data to Similarity" operator is fairly painless unless you want to use the resulting output for anything other than visual inspection. At least I cannot find any operators which can utilize the output. Adding a "Materialize Data" or "Similarity to Data" operator takes hours to execute on the same dataset even though the time complexity should not be worse than the previous operator's O(n^2). That said, there are techniques I have found to extract the data from the "Data to Similarity" operator by using loops, macros, and "Create Data" but it appears to be even slower than the built-in methods above. For some context, I have quite a bit of experience with Java as well as recent coursework in advanced data structures, computability, algorithms, advanced OOP techniques, MapReduce, HDFS, Hive, Spark, etc. but I cannot seem to figure out for the life of me the following:     * What a "view" consists of in this context, how it is created or the underlying Java construct     * Why a view is not able to be manipulated by [most] other operators    * What takes so long to transform a view into a standard ExampleSet    * How I might be able to manipulate the data prior to materializing it so it is less expensive to do so If I were forced to come up with a guess of what was happening behind the scenes at the risk of great bodily harm I would be thinking along the lines that a view is a specialized type of heap, perhaps making use of a bloom filter or other type of hash table as the underlying data structure and that ExampleSets are much more complex storage objects relying on contiguous blocks of memory. This theory would account for the time complexity as well as the relatively low CPU usage during the conversion process (I am running it on an 44 CPU industrial workstation). Thanks in advance! P.S. - To demonstrate what I am talking about I included a modified version of the "Document Similarity and Clustering" process from the Training Resources folder within RapidMiner. It is set to run 1,000 examples as configured below:                                                                                                                                                                                                                                                                                                                                                                 for demo purpose we are sampling this down to make the process complete faster                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 instead of providing the excel - we provide pre-loaded data to use instead
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Part I - Document Similarity       Part II - Document Clustering
You may find the demo video here on the RapidMiner Academy

MartinLiebig · Answer

This is a great post, thank you @marcin_blachnik !

marcin_blachnik · Answer

Hi, I can not see any view in the example (or I missed something)
But the DataToSimilarity seems to be very fast becaus it does nothing, it just wraps the data with some external object along with the type of the distance measure. The distances are calculated on the fly, so when you request what is the similarity between doc1 and doc2 it simply calculates it.
In case of Similarity2Data the resulting distance is precomputed, so it takes every instance and calculate the distance to another instance. The "similarity" input is simply used to extract the distance measure and the input example set is used to directly calculate the distance which takes n^2 complexity. That is why it takes so long.

In more general ExampleSet is a view on the data. It means that if you perform sampling the dataset do not change , it is still the same data, but you see just a subset of the data. If you do MaterializeData you make a copy of the data which you have in the view, what means you need extra RAM for it.
There is also another type of view - for example see Normalize Attributes -> there you can check the "create view" box. Normally Normalize Attributes operator needs to duplicate each attribute in RAM so it will store new values of the attributes, and the an ExampleSet is created which has a view only on the new and normalized attributes, but when you check the "create view"  then new attributes are not stored in RAM, instead  they are calculated on the fly, what extends every computation time. It is always a trade of between caching data in RAM and execution time.

Best 
Marcin