Text Mining - Computation time and memory usage
Dear all,
I am working on a text mining use case with a data set of around 80.000 data sets and 33 Attributes (after implementing TF-IDF, SVD, different filter and wrapper methods) tyring to predict around 100 classes. We tested different algorithm like Naive Bayes, k-NN, Fast Large Margin and Gradient Boosted Trees and are quite satisfied with the results.
A big challenge while testing and evaluating the different algorithm were the computation time and the usage of the memory. Even working with a quit enhanced machine (128 gb memory and 8 cores, RapidMiner 7.6 professional license) the computation time for evaluating and building the models enlarged up to one week (for example testing Gradient Boosted Trees with an Evolutionary Parameter Optimization), the usage of the memory raised up to 100% and the machine crashes even while testing Naive Bayes. Only with implementation of the Free Memory and Materialize Data operator into the inner and outer validation, the processes run stable, but takes very long. Of course with the high number of data sets, attributes and classes running processes isn`t very simple and fast.
However, as far as I know RapidMiner has implemented a new core for optimizing the computation time and memory usage, but it seems not to work in our use case respectively are these high computation times and the usage of the memory up to 128 gb usual? So based on that my questions are: Are there any mistakes in our process? If not, how could we optimize the processes? How to use the Free Memory and Materialize Data operator to optimize the computation time and memory usage?
For better understanding attached, you will find the validation process for the Gradient Boosted Trees algorithm including the filtering of the attributes by chi squared and the parameter optimization.
Thanks in advance for your help.
Michel
<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="split_validation" compatibility="7.6.001" expanded="true" height="124" name="Äußere Validierung Gradient Boosted Tree" width="90" x="581" y="136">
<process expanded="true">
<operator activated="true" class="weight_by_chi_squared_statistic" compatibility="7.6.001" expanded="true" height="82" name="Weight by Chi Squared Statistic (5)" width="90" x="45" y="34"/>
<operator activated="true" class="select_by_weights" compatibility="7.6.001" expanded="true" height="103" name="Select by Weights (9)" width="90" x="45" y="238">
<parameter key="weight" value="20000.0"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.6.001" expanded="true" height="103" name="Multiply (9)" width="90" x="179" y="34"/>
<operator activated="true" class="optimize_parameters_evolutionary" compatibility="7.6.001" expanded="true" height="103" name="Optimize Parameters (Evolutionary)" width="90" x="313" y="187">
<list key="parameters">
<parameter key="Gradient Boosted Trees.number_of_trees" value="[10;1000.0]"/>
<parameter key="Gradient Boosted Trees.maximal_depth" value="[10;300]"/>
<parameter key="Gradient Boosted Trees.learning_rate" value="[0.1;1.0]"/>
</list>
<parameter key="selection_type" value="roulette wheel"/>
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true">
<operator activated="true" class="split_validation" compatibility="7.6.001" expanded="true" height="124" name="Validierung Optimize Gradient Modelvalidierung" width="90" x="447" y="34">
<parameter key="use_local_random_seed" value="true"/>
<process expanded="true">
<operator activated="true" class="h2o:gradient_boosted_trees" compatibility="7.6.001" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="380" y="34">
<parameter key="number_of_trees" value="636"/>
<parameter key="maximal_depth" value="160"/>
<parameter key="learning_rate" value="0.3528977870143133"/>
<list key="expert_parameters"/>
</operator>
<operator activated="false" class="free_memory" compatibility="7.6.001" expanded="true" height="82" name="Free Memory (2)" width="90" x="45" y="85"/>
<operator activated="false" class="materialize_data" compatibility="7.6.001" expanded="true" height="82" name="Materialize Data (2)" width="90" x="179" y="85"/>
<connect from_port="training" to_op="Gradient Boosted Trees" to_port="training set"/>
<connect from_op="Gradient Boosted Trees" from_port="model" to_port="model"/>
<connect from_op="Free Memory (2)" from_port="through 1" to_op="Materialize Data (2)" to_port="example set input"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="false" class="free_memory" compatibility="7.6.001" expanded="true" height="82" name="Free Memory (19)" width="90" x="45" y="85"/>
<operator activated="false" class="materialize_data" compatibility="7.6.001" expanded="true" height="82" name="Materialize Data (19)" width="90" x="179" y="85"/>
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model Gradient Boosted Trees" width="90" x="313" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance Gradient Boosted Trees" width="90" x="514" y="34">
<parameter key="classification_error" value="true"/>
<parameter key="soft_margin_loss" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model Gradient Boosted Trees" to_port="model"/>
<connect from_port="test set" to_op="Apply Model Gradient Boosted Trees" to_port="unlabelled data"/>
<connect from_op="Free Memory (19)" from_port="through 1" to_op="Materialize Data (19)" to_port="example set input"/>
<connect from_op="Apply Model Gradient Boosted Trees" from_port="labelled data" to_op="Performance Gradient Boosted Trees" to_port="labelled data"/>
<connect from_op="Performance Gradient Boosted Trees" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_port="input 1" to_op="Validierung Optimize Gradient Modelvalidierung" to_port="training"/>
<connect from_op="Validierung Optimize Gradient Modelvalidierung" from_port="averagable 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
</process>
</operator>
<operator activated="true" class="set_parameters" compatibility="7.6.001" expanded="true" height="82" name="Set Parameters (4)" width="90" x="514" y="187">
<list key="name_map">
<parameter key="Gradient Boosted Trees" value="Learner Gradient Boosted Trees"/>
</list>
</operator>
<operator activated="true" class="h2o:gradient_boosted_trees" compatibility="7.6.001" expanded="true" height="103" name="Learner Gradient Boosted Trees" width="90" x="581" y="34">
<parameter key="number_of_trees" value="1"/>
<parameter key="maximal_depth" value="50"/>
<parameter key="learning_rate" value="0.672469087120154"/>
<list key="expert_parameters"/>
</operator>
<operator activated="false" class="free_memory" compatibility="7.6.001" expanded="true" height="82" name="Free Memory (3)" width="90" x="313" y="34"/>
<operator activated="false" class="materialize_data" compatibility="7.6.001" expanded="true" height="82" name="Materialize Data (3)" width="90" x="447" y="34"/>
<connect from_port="training" to_op="Weight by Chi Squared Statistic (5)" to_port="example set"/>
<connect from_op="Weight by Chi Squared Statistic (5)" from_port="weights" to_op="Select by Weights (9)" to_port="weights"/>
<connect from_op="Weight by Chi Squared Statistic (5)" from_port="example set" to_op="Select by Weights (9)" to_port="example set input"/>
<connect from_op="Select by Weights (9)" from_port="example set output" to_op="Multiply (9)" to_port="input"/>
<connect from_op="Select by Weights (9)" from_port="weights" to_port="through 1"/>
<connect from_op="Multiply (9)" from_port="output 1" to_op="Optimize Parameters (Evolutionary)" to_port="input 1"/>
<connect from_op="Multiply (9)" from_port="output 2" to_op="Learner Gradient Boosted Trees" to_port="training set"/>
<connect from_op="Optimize Parameters (Evolutionary)" from_port="parameter" to_op="Set Parameters (4)" to_port="parameter set"/>
<connect from_op="Learner Gradient Boosted Trees" from_port="model" to_port="model"/>
<connect from_op="Free Memory (3)" from_port="through 1" to_op="Materialize Data (3)" to_port="example set input"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<portSpacing port="sink_through 2" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="select_by_weights" compatibility="7.6.001" expanded="true" height="103" name="Select by Weights (10)" width="90" x="112" y="136">
<parameter key="weight" value="20000.0"/>
</operator>
<operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model Gradient Boosted Trees Außen" width="90" x="313" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (6)" width="90" x="447" y="34">
<parameter key="classification_error" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model Gradient Boosted Trees Außen" to_port="model"/>
<connect from_port="test set" to_op="Select by Weights (10)" to_port="example set input"/>
<connect from_port="through 1" to_op="Select by Weights (10)" to_port="weights"/>
<connect from_op="Select by Weights (10)" from_port="example set output" to_op="Apply Model Gradient Boosted Trees Außen" to_port="unlabelled data"/>
<connect from_op="Apply Model Gradient Boosted Trees Außen" from_port="labelled data" to_op="Performance (6)" to_port="labelled data"/>
<connect from_op="Performance (6)" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="source_through 2" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<connect from_op="Äußere Validierung Gradient Boosted Tree" from_port="averagable 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
Answers
-
Although this isn't specific to your memory usage question, I do have a few additional comments that might be helpful on text mining problems..
First, you could try some sampling and work with a much smaller dataset. Often it is helpful to build simpler models on smaller samples first, to get a sense of what is going on, which algorithms are doing better, etc. This will also allow you to try different alternatives much faster than you are able to do on your full dataset. Then after you have done that and learned some things about your specific problem, you can scale up to your full dataset.
Second, you mention that you are trying to predict 100 different classes! That's a tall order, even for the most powerful algorithms. You might want to consider creating a simplified version of your label with fewer classes and seeing how your models perform on that first. You don't provide the specific background on your project, but in most real-world data science cases, there aren't necessarily 100 different outcomes that you would pursue, so consolidating into fewer categories might not lead to meaningful differences in resulting strategy.
2 -
Excuse my lack of spacing, but my javascript is playing up at the moment. You have your Optimize Parameters for your tree INSIDE your cross validation. So what this is doing is each fold of the cross validation potentially picking new parameters for your GBT. This means that when you average the results you might be using different models each time. This is perhaps not best practice. I'd recommend putting everything inside the Optimize operator and just doing it once.
You can use the Result port of the Optimize operator to get the best performing model so you don't need to worry about running the XValidation again also.1 -
0 -
And lastly regarding the Materialize Data / Free Memory. With GBT & the new RM datacore Materialize is unlikely to make a difference, but a Free Memory operator dropped on the Test side of the Cross Validation just after Performance won't hurt & might help to clear out RAM from the previous GBT round.
You might also want to play around with the number of threads you allocate to the GBT, but this is default at maximum so shouldn't cause too much performance issues. Next version of RM has parallel Optimize operators I believe so you'd also get a boost there.0 -
Dear all,
First thanks for your quick response.
@Telcontar120 : thank you for your advices. At the beginning we tested the different algorithms on smaller samples to get a sense of what is going on, which algorithms are doing better, etc.. Now at the end of the project we would like to test the final processes on the entire data set to get a feeling for the "finale" performance as in input for the practical deployment and so on. Regarding our project and the number of our different classes, I know it is a huge number and in normal data science cases it wouldn`t be necessary to consider all classes but in this case it is unfortunately necessary.
@JEdward : thank you for your advices. From my perspective and due to the blog by @Ingo (https://rapidminer.com/learn-right-way-validate-models-part-4-accidental-contamination/)
I think it is the right way to place the parameter optimization inside the outer validation to validate the model every time on an "unseen" sample (based on the parameters in the cross or split validation) with the optimized parameter settings for each "unseen" sample. So you make sure to get the validated performance closed to the performance if you evaluate the model on a real unseen data. In the End the model is built on 100 % of the data set, running only through the training process of the cross validation with an additional parameter optimization. If I am wrong, please correct me.
As you suggest I will place the free memory operator after the performance computation in the training phase of the outer and inner cross validation. Tanks for that. However, I think this won´t solve my issues with the high computation time and memory usage, so as a conclusion, I just have to deal with it and wait a “little bit” longer for the results.
Thanks again.
Regards
Michel
0 -
Hi Michel,
In this case no.
What happens when you do a XValidation is it builds the model with the chosen parameters on the provided training set, scores it on the test set and then repeats this process 10 times (or however many times you have for your model). It then builds a final model using ALL the provided data and provides an Average of the previous test results.
(I like to have an additional holdout sample for testing with the Compare Models operator or a statistical test)
This means that when you are running your GBT with the Optimize parameters inside the training your Cross Validation becomes potentially unusable. Each fold might produce a different models (different parameter settings) which your process will then ignore and build a final model using the Optimized Parameters against the whole dataset.
So not only are you causing the Average performances of your model in X Validation to become unusable, but also you are adding Parameter Optimization steps that don't need to be done.
Edit: just noticed you are using a Split Validation already as the surrounding operator however, I've been testing and noticed that because Reproducible was not ticked on the GBT it meant that the same parameters would deliver different performance results with each Optimization. This would mean that the optimization would potentially not be able to converge. I've reworked the process as below so you can have a play around.
[Code]
Changed to >100 for my Deals example
Note: ensure reproducible is ticked. Otherwise you'll get a new random seed each time you run the Optimize and so the same parameter settings will deliver different results.
This will be the weights remembered from the final FULL model.
This shouldn't affect the GBT having it in or not for scoring, but we can have it.
This is now a Cross Validation
It's good to have a log of how the parameters are optimizing.
Try to limit your search space for the Optimize Paramters also you are searching on 3 parameters across a wide range. If you split that up into smaller ranges, perhaps by using grid search to narrow down the space. Then you can focus on a smaller space for the Evolutionary Optimization.
Also with this approach if you have multiple hardware resources available then you can execute on several machines and check the logs to see which is performing better.
Added a Split Validation to make it more like I would do this, but you don't have to have it.
[/Code]1 -
Okay, my javascript is still broken.
Have attached the process as an RMP.
Also, check the learning rate value as on GBT this can vary a lot, in my test data
0.12539837256193326 gives an accuracy of 0.9942857142857143, but 0.12483998028466764 gives an accuracy of 0.6828571428571428.
Trying to get the right parameter setting when it's jumping around with fractions of a decimal is going to be very difficult. Especially as you are searching in a larger space than my example.
I'm going to say this is actually the root of the problem.
Guys, as a feature request could the number of decimals that Optimize Evolutionary searches in be limited?0 -
Dear @JEdward,
thank you very much for your detailled answer. Sorry for my late reply but i was in holiday and unabel to answer.
I have checked your process and get your points and understand your argumentation. But if i get it right it is different to the approach described in the RapidMiner blog (https://rapidminer.com/learn-right-way-validate-models-part-4-accidental-contamination/). Just to make sure, what is the right way to validate a model including a parameter optimization? Your approach or the one presented in the blog? Or does it rely to the data?
In your approach you included the filter by weights for the selection of the attributes in the test phase of the inner cross validation. So that means every time the attributes will be filtered again based on the input training set. Is it possible to place the filtering by weights outside the parameter optimization so the process has to be done just once? Or are there any reason which does not allow it.
Thanks and regards
Michel
0