Association Rule Creator - PDF files - Memory limitations
Hi,
I am newbie here. Let me give you an overview of what I am doing.
I am processing PDF files (generally trend reports) and want to create association rule for them. As I saw a tutorial, where someone converted pdfs into txt files in order to process it. I converted thoese PDF's online to text files and tried to excute them. I did it like this way:
First, I didn't got any association rule in a result.
Then I tried to play around with the parameters of text processing, changing "PruneMethod" absoulte to percental or degree, but I got memory error, even though I have around 450GB free space. And those converted PDF files are only 7.
Please guide me how to get association rules.
Kind Regards,
Rashid
Best Answer
-
Hi again @rashidaziz411,
I must admit that I am a little lost : By testing the first version of the process (the process that works with .TXT files),
with your 7 .txt files, a dataset with more 7500 attributes is generated .... and the process has .......no problem to generate
the association rules......(the calculation of the association rules is instantaneous !).
I can't explain this behavior.
In conclusion, I would say, that, as palliative solution, you can convert all your PDF files into TXT files...
Regards,
Lionel
1
Answers
-
Hi @rashidaziz411,
Can you share your process (Cf READ BEFORE POSTING / §2. Share your XML process) and your files
in order we better understand.
Have you try to play with the parameters of Create Associations Rules operator ?
Regards,
Lionel
2 -
Hi @lionelderkrikor,
Thank you for your response. Yes! I tried that too but didn't succeeded.
Hereby are the files.
1. Process (workflow)
2. Number of files, which I am working on.
Kind Regards,
Rashid
0 -
Hi @rashidaziz411,
I just checked your process and your files.
A priori your Process Documents from Files operator doesn't produce any output (it doesn't extract the words from your .txt files)
I will continue to search...
Regards,
Lionel
0 -
Hi again @rashidaziz411,
I found some association rules (with 2 of your files : Accenture and Deloitte) by setting :
- in the parameters of FP-Growth operator, min requirement = Frequency / min frequency = 1
Like you will see, the Support of these associations are very close from 0...
I hope it helps,
Regards,
Lionel
NB : I redesigned your process with the Loop Files operator because I had difficulties with the Process Documents from Files operator.
The process :
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" breakpoints="after" class="text:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="34">
<list key="text_directories">
<parameter key="consultancy" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Append_files"/>
</list>
<parameter key="file_pattern" value=".*"/>
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="20"/>
<parameter key="prune_above_absolute" value="60"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="179" y="34"/>
<operator activated="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="313" y="34"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="concurrency:loop_files" compatibility="9.0.000" expanded="true" height="82" name="Loop Files" width="90" x="179" y="34">
<parameter key="directory" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Associations_rules"/>
<parameter key="filter_type" value="regex"/>
<parameter key="filter_by_regex" value="Accenture.*|Delotte.*"/>
<parameter key="enable_macros" value="true"/>
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="9.0.000" expanded="true" height="68" name="Read CSV" width="90" x="313" y="34">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Associations_rules\Accenture-TechVision-2018-Tech-Trends-Report.txt"/>
<parameter key="column_separators" value=","/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<connect from_port="file object" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="subprocess" compatibility="9.0.000" expanded="true" height="82" name="Union Append" origin="GENERATED_COMMUNITY" width="90" x="313" y="34">
<process expanded="true">
<operator activated="true" class="loop_collection" compatibility="9.0.000" expanded="true" height="82" name="Output (4)" origin="GENERATED_COMMUNITY" width="90" x="45" y="34">
<parameter key="set_iteration_macro" value="true"/>
<process expanded="true">
<operator activated="false" breakpoints="after" class="select" compatibility="9.0.000" expanded="true" height="68" name="Select (5)" origin="GENERATED_COMMUNITY" width="90" x="112" y="34">
<parameter key="index" value="%{iteration}"/>
</operator>
<operator activated="true" class="branch" compatibility="9.0.000" expanded="true" height="82" name="Branch (2)" origin="GENERATED_COMMUNITY" width="90" x="313" y="34">
<parameter key="condition_type" value="expression"/>
<parameter key="expression" value="%{iteration}==1"/>
<process expanded="true">
<connect from_port="condition" to_port="input 1"/>
<portSpacing port="source_condition" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_input 1" spacing="0"/>
<portSpacing port="sink_input 2" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="recall" compatibility="9.0.000" expanded="true" height="68" name="Recall (5)" origin="GENERATED_COMMUNITY" width="90" x="45" y="187">
<parameter key="name" value="LoopData"/>
</operator>
<operator activated="true" class="union" compatibility="9.0.000" expanded="true" height="82" name="Union (2)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34"/>
<connect from_port="condition" to_op="Union (2)" to_port="example set 1"/>
<connect from_op="Recall (5)" from_port="result" to_op="Union (2)" to_port="example set 2"/>
<connect from_op="Union (2)" from_port="union" to_port="input 1"/>
<portSpacing port="source_condition" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_input 1" spacing="0"/>
<portSpacing port="sink_input 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="remember" compatibility="9.0.000" expanded="true" height="68" name="Remember (5)" origin="GENERATED_COMMUNITY" width="90" x="581" y="34">
<parameter key="name" value="LoopData"/>
</operator>
<connect from_port="single" to_op="Branch (2)" to_port="condition"/>
<connect from_op="Branch (2)" from_port="input 1" to_op="Remember (5)" to_port="store"/>
<connect from_op="Remember (5)" from_port="stored" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select" compatibility="9.0.000" expanded="true" height="68" name="Select (6)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34">
<parameter key="index" value="%{iteration}"/>
</operator>
<connect from_port="in 1" to_op="Output (4)" to_port="collection"/>
<connect from_op="Output (4)" from_port="output 1" to_op="Select (6)" to_port="collection"/>
<connect from_op="Select (6)" from_port="selected" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="append" compatibility="9.0.000" expanded="true" height="68" name="Append" width="90" x="313" y="187"/>
<operator activated="true" class="nominal_to_text" compatibility="9.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="447" y="34"/>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="313" y="85"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="85"/>
<operator activated="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (2)" width="90" x="581" y="85"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="715" y="85"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="9.0.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="715" y="34">
<parameter key="max" value="1.0"/>
</operator>
<operator activated="true" class="concurrency:fp_growth" compatibility="9.0.000" expanded="true" height="82" name="FP-Growth" width="90" x="849" y="34">
<parameter key="min_requirement" value="frequency"/>
<parameter key="min_support" value="0.01"/>
<parameter key="min_frequency" value="1"/>
<parameter key="find_min_number_of_itemsets" value="false"/>
<enumeration key="must_contain_list"/>
</operator>
<operator activated="true" class="create_association_rules" compatibility="9.0.000" expanded="true" height="82" name="Create Association Rules" width="90" x="983" y="34">
<parameter key="min_confidence" value="0.01"/>
</operator>
<connect from_op="Loop Files" from_port="output 1" to_op="Union Append" to_port="in 1"/>
<connect from_op="Union Append" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
<connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
<connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>0 -
Hi @lionelderkrikor,
I have changed the parameters for fp-growth to 1 (for frequency), but it's still not working.
I didn't understand the thing which you have described as "redesigned process" ? How to do that? Does it has to be operated through "CMD" or some kind of coding? If so, I am unable to open "cmd", I don't know but and I tried.
Or this "redesigning process" can be done through GUI?
P.S why this "process document from files" is not producing the output? Is there a problem with files or this node? Can't be it done through PDF files directory (without converting them to .txt) ?
I would love to hear from you.
Kind Regards,
Rashid
0 -
Hi @rashidaziz411,
Like I said, Process Documents from FIles operator doesn't produce any output and I don't know why.
So I decided to explore an other solution to "extract" the words : Using Loop Files + Process Documents from Data.
To import the process I shared (the XML file) in my previous post, you have to follow these steps :
1. Activate the XML panel :
2. Copy and paste the XML code in the XML panel :
3. Click on the green mark :
4. This is it.. the process appears in the process window...
Regards,
Lionel
0 -
Thank you for your detailed answer and I am glad that it's showing some association rules. As I have changed the directory location for "loop files" to the folder of text files at my PC only and its generating rules.
But I have some questions regarding this process:
1. If we look into XML process (the one you shared above), at LINE 37, you are applying filters to two files (accenture and delotte) only, why not all those 7 files? As for me, I only shared 7 files but I have 31 files to work on. So, if I have to do on all of these files, do I have to mention in the same way like you did?
2. At LINE 41 of XML process, you mentioned parameter value for only one file (accenture), why is that only? As at LINE 37, you mentioned 2 files. And how you did that, by manually typing in XML process or through some node from GUI?
3. At LINE 11 of XML process, you are mentioning to which parameter value? Because I have no Idea which directory path you are mentioning there? Path of the text files' folder or something else?
4. As both these parameter values (LINE 11 and 41) are still the same and reffering to your path directory (C:\lionel\......) and both remained same when I executed the association rules. Do I have to change them manually from XML or is there any filter somewhere in GUI like just I did in "loop files" to changed the path from directory option by just clicking?
Kind Regards,
Rashid
0 -
Hi @rashidaziz411,
First, as general rule, XML files are just for sharing processes between users. Don't make any modification in these files. Import the processes into RapidMiner and work always in the GUI of RapidMiner.
Here the "general" process :
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop_files" compatibility="9.0.000" expanded="true" height="82" name="Loop Files" width="90" x="179" y="34">
<parameter key="directory" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Associations_rules"/>
<parameter key="filter_type" value="regex"/>
<parameter key="filter_by_regex" value=".*txt"/>
<parameter key="enable_macros" value="true"/>
<process expanded="true">
<operator activated="true" class="read_csv" compatibility="9.0.000" expanded="true" height="68" name="Read CSV" width="90" x="313" y="34">
<parameter key="csv_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Associations_rules\Accenture-TechVision-2018-Tech-Trends-Report.txt"/>
<parameter key="column_separators" value=","/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<connect from_port="file object" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="subprocess" compatibility="9.0.000" expanded="true" height="82" name="Union Append" origin="GENERATED_COMMUNITY" width="90" x="313" y="34">
<process expanded="true">
<operator activated="true" class="loop_collection" compatibility="9.0.000" expanded="true" height="82" name="Output (4)" origin="GENERATED_COMMUNITY" width="90" x="45" y="34">
<parameter key="set_iteration_macro" value="true"/>
<process expanded="true">
<operator activated="false" breakpoints="after" class="select" compatibility="9.0.000" expanded="true" height="68" name="Select (5)" origin="GENERATED_COMMUNITY" width="90" x="112" y="34">
<parameter key="index" value="%{iteration}"/>
</operator>
<operator activated="true" class="branch" compatibility="9.0.000" expanded="true" height="82" name="Branch (2)" origin="GENERATED_COMMUNITY" width="90" x="313" y="34">
<parameter key="condition_type" value="expression"/>
<parameter key="expression" value="%{iteration}==1"/>
<process expanded="true">
<connect from_port="condition" to_port="input 1"/>
<portSpacing port="source_condition" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_input 1" spacing="0"/>
<portSpacing port="sink_input 2" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="recall" compatibility="9.0.000" expanded="true" height="68" name="Recall (5)" origin="GENERATED_COMMUNITY" width="90" x="45" y="187">
<parameter key="name" value="LoopData"/>
</operator>
<operator activated="true" class="union" compatibility="9.0.000" expanded="true" height="82" name="Union (2)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34"/>
<connect from_port="condition" to_op="Union (2)" to_port="example set 1"/>
<connect from_op="Recall (5)" from_port="result" to_op="Union (2)" to_port="example set 2"/>
<connect from_op="Union (2)" from_port="union" to_port="input 1"/>
<portSpacing port="source_condition" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_input 1" spacing="0"/>
<portSpacing port="sink_input 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="remember" compatibility="9.0.000" expanded="true" height="68" name="Remember (5)" origin="GENERATED_COMMUNITY" width="90" x="581" y="34">
<parameter key="name" value="LoopData"/>
</operator>
<connect from_port="single" to_op="Branch (2)" to_port="condition"/>
<connect from_op="Branch (2)" from_port="input 1" to_op="Remember (5)" to_port="store"/>
<connect from_op="Remember (5)" from_port="stored" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select" compatibility="9.0.000" expanded="true" height="68" name="Select (6)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34">
<parameter key="index" value="%{iteration}"/>
</operator>
<connect from_port="in 1" to_op="Output (4)" to_port="collection"/>
<connect from_op="Output (4)" from_port="output 1" to_op="Select (6)" to_port="collection"/>
<connect from_op="Select (6)" from_port="selected" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="9.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="447" y="34"/>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="313" y="85"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="85"/>
<operator activated="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (2)" width="90" x="581" y="85"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="715" y="85"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="9.0.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="715" y="34">
<parameter key="max" value="1.0"/>
</operator>
<operator activated="true" class="concurrency:fp_growth" compatibility="9.0.000" expanded="true" height="82" name="FP-Growth" width="90" x="849" y="34">
<parameter key="min_requirement" value="frequency"/>
<parameter key="min_support" value="0.01"/>
<parameter key="min_frequency" value="1"/>
<parameter key="find_min_number_of_itemsets" value="false"/>
<enumeration key="must_contain_list"/>
</operator>
<operator activated="true" class="create_association_rules" compatibility="9.0.000" expanded="true" height="82" name="Create Association Rules" width="90" x="983" y="34">
<parameter key="min_confidence" value="0.01"/>
</operator>
<connect from_op="Loop Files" from_port="output 1" to_op="Union Append" to_port="in 1"/>
<connect from_op="Union Append" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
<connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
<connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>- To sum up, after importing this process into RapidMiner, you have just to :
- Set in the parameters of Loop Files operator, the path where you have stored all your .txt files
- Set one of the .txt files in the Read CSV operator (inside the Loop Files) by clicking in the Import Configuration Wizard button.
- ... That's it
I hope it helps and that this process answers to your need.
Regards,
Lionel
NB : Indeed, I was applying filters to just two files (accenture and delotte) just for testing, now the filter is based on the .txt files, so the process will loop over all your .txt files inside the set path (the path where you stored all your files).
NB2 : There is a path, that you mentionned that was part of the Process Documents from Files operator which was disabled. Now this operator has been deleted and doesn't appear anymore.
1 -
Hi again @rashidaziz411,
Here an experimental modified version of the previous process to work directly with .PDF files.
Like the previous process, you have to set the path, in the Loop Files operator's parameters, where your .PDF files are stored.
The process :
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.0.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop_files" compatibility="9.0.000" expanded="true" height="82" name="Loop Files" width="90" x="45" y="34">
<parameter key="directory" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Associations_rules"/>
<parameter key="filter_type" value="regex"/>
<parameter key="filter_by_regex" value=".*pdf"/>
<parameter key="enable_macros" value="true"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="8.1.000" expanded="true" height="68" name="Read Document" width="90" x="380" y="34">
<parameter key="content_type" value="pdf"/>
</operator>
<operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="34">
<parameter key="text_attribute" value="text"/>
</operator>
<connect from_port="file object" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="subprocess" compatibility="9.0.000" expanded="true" height="82" name="Union Append" origin="GENERATED_COMMUNITY" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="loop_collection" compatibility="9.0.000" expanded="true" height="82" name="Output (4)" origin="GENERATED_COMMUNITY" width="90" x="45" y="34">
<parameter key="set_iteration_macro" value="true"/>
<process expanded="true">
<operator activated="false" breakpoints="after" class="select" compatibility="9.0.000" expanded="true" height="68" name="Select (5)" origin="GENERATED_COMMUNITY" width="90" x="112" y="34">
<parameter key="index" value="%{iteration}"/>
</operator>
<operator activated="true" class="branch" compatibility="9.0.000" expanded="true" height="82" name="Branch (2)" origin="GENERATED_COMMUNITY" width="90" x="313" y="34">
<parameter key="condition_type" value="expression"/>
<parameter key="expression" value="%{iteration}==1"/>
<process expanded="true">
<connect from_port="condition" to_port="input 1"/>
<portSpacing port="source_condition" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_input 1" spacing="0"/>
<portSpacing port="sink_input 2" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="recall" compatibility="9.0.000" expanded="true" height="68" name="Recall (5)" origin="GENERATED_COMMUNITY" width="90" x="45" y="187">
<parameter key="name" value="LoopData"/>
</operator>
<operator activated="true" class="union" compatibility="9.0.000" expanded="true" height="82" name="Union (2)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34"/>
<connect from_port="condition" to_op="Union (2)" to_port="example set 1"/>
<connect from_op="Recall (5)" from_port="result" to_op="Union (2)" to_port="example set 2"/>
<connect from_op="Union (2)" from_port="union" to_port="input 1"/>
<portSpacing port="source_condition" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_input 1" spacing="0"/>
<portSpacing port="sink_input 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="remember" compatibility="9.0.000" expanded="true" height="68" name="Remember (5)" origin="GENERATED_COMMUNITY" width="90" x="581" y="34">
<parameter key="name" value="LoopData"/>
</operator>
<connect from_port="single" to_op="Branch (2)" to_port="condition"/>
<connect from_op="Branch (2)" from_port="input 1" to_op="Remember (5)" to_port="store"/>
<connect from_op="Remember (5)" from_port="stored" to_port="output 1"/>
<portSpacing port="source_single" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="select" compatibility="9.0.000" expanded="true" height="68" name="Select (6)" origin="GENERATED_COMMUNITY" width="90" x="179" y="34">
<parameter key="index" value="%{iteration}"/>
</operator>
<connect from_port="in 1" to_op="Output (4)" to_port="collection"/>
<connect from_op="Output (4)" from_port="output 1" to_op="Select (6)" to_port="collection"/>
<connect from_op="Select (6)" from_port="selected" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="9.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34"/>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="313" y="85"/>
<operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="447" y="85"/>
<operator activated="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (2)" width="90" x="581" y="85"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="715" y="85"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="9.0.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="581" y="34">
<parameter key="max" value="1.0"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="9.0.000" expanded="true" height="82" name="Select Attributes" width="90" x="715" y="34">
<parameter key="attribute_filter_type" value="regular_expression"/>
<parameter key="regular_expression" value="metadata.*"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="concurrency:fp_growth" compatibility="9.0.000" expanded="true" height="82" name="FP-Growth" width="90" x="849" y="34">
<parameter key="min_requirement" value="frequency"/>
<parameter key="min_support" value="0.01"/>
<parameter key="min_frequency" value="1"/>
<parameter key="find_min_number_of_itemsets" value="false"/>
<enumeration key="must_contain_list"/>
</operator>
<operator activated="true" class="create_association_rules" compatibility="9.0.000" expanded="true" height="82" name="Create Association Rules" width="90" x="983" y="34">
<parameter key="min_confidence" value="0.01"/>
</operator>
<connect from_op="Loop Files" from_port="output 1" to_op="Union Append" to_port="in 1"/>
<connect from_op="Union Append" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
<connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
<connect from_op="Create Association Rules" from_port="rules" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Regards,
Lionel
0 -
Thank you for your help. First solution which you have pointed (in text files), thats working fine. Whereas, second XML process (experimental on PDFs), it's pointing error in "LOOP FILES" parameter that it can't read.
Kind Regards,
Rashid
0 -
Hi @rashidaziz411,
I tested this process with some of my .PDF files and it works fine : I'm not able to reproduce your bug.
Can you share some of your "problematic" .PDF files ?
I think that this issue is related to the encoding method of the Read Document operator.
In parallel, can you try other methods ?
Regards,
Lionel
0 -
I was getting prvious error, when I mentioned the path of all the files "combine" (32 files in number), giving path to the folder which contains all these files.
After your latest response, I have changed the path to 1st folder only (consultancy). But now, its giving me memory error just after 3% generating association rules, only for consultancy folder (which contains 7 files, the one which I shared with you as text files) ~I have about 25 GB free space in C drive and 450 in D drive
I am attaching pdf files (only 7) 6 here.
edited: One file size limit exceded (Delotte)
Kind Regards,
Rashid
0 -
Hi @rashidaziz411,
I executed the process with your 6 .PDF files : I face to the same problem as you...
But it is not a problem of storage, it is an insufficient RAM problem. (personally I have 16 Go of RAM on my PC).
After investigations, it's not surprising because, with these 6 files, you have a very large dataset : 6 rows and 4443 attributes !
(and with a little .pdf file (generating 10 attributes), the process works...).
So in conclusion, I'm afraid that you need more RAM on your computer to execute the process with all your files.
Regards,
Lionel
0 -
Hi again @rashidaziz411,
I must admit that I am a little lost : By testing the first version of the process (the process that works with .TXT files),
with your 7 .txt files, a dataset with more 7500 attributes is generated .... and the process has .......no problem to generate
the association rules......(the calculation of the association rules is instantaneous !).
I can't explain this behavior.
In conclusion, I would say, that, as palliative solution, you can convert all your PDF files into TXT files...
Regards,
Lionel
1 -
Thanks for your help. I too have 16 GB of RAM but I will convert them to text files.
Once again, thank you and I really appriciate your efforts in this regard.
Best Regards,
Rashid
1 -
2
-
Hello,
I experienced a similar problem with the memory issue. I used the same process as @rashidaziz411 the first time. It didn't even work with 4 txt files. I then used the process as suggested by @lionelderkrikor, and everything seems to work great.
However, I am not sure what "Loop files" (especially Read CSV, because it is a plain text file and not comma separated) and "Union Append" do in this case. Sorry for the ignorant question. I'm rather new to this.
Thank you
Cheers,
Indi1 -
Hi Indi,Not at all an ignorant question. I expect as well an answer.Maerkli0
-
Hi @Maerkli, Hi @IndiJaDTU,
Loop Files operator is reading the text of the .txt (or .pdf) files which are stored in the specified directory.
You're right "Union Append" subprocess is not necessary in this case : you can only use "Append" operator.
This operator is used to "aggregate" the different files in one example set.
To better understand, set a "BreakPoint After" in the Loop Files operator and in the Append (or Union Append operator).
I hope it helps,
Regards,
Lionel
1 -
Merci Lionel.Maerkli1
-
Thank you for your explanation. However, I seem to have an issue when I start adding more .txt files (64 files). Rapidminer just went to not responding. I have 16GB RAMCheers,1