Text Mining: Document Clustering of JSON files
Hi everyone,
as I am really new at RapidMiner, I have big difficulties on clustering documents.
The idea is to import about 1350 documents (.txt) that are wiritten in JSON format, to convert them into a table (each row represents a document) and to run a document clustering incl. performance measurement. Btw. the content of the document is web content from diferent websites (in english and german).
Unfortunately I do not manage to import these files, so that RapidMiner recognizes them as JSON.
Is there anyone who could help me with that? I would really appreciate any help!
If needed I could send some documents.
Thx a lot!!
Best Answer
-
I haven't run through with your data yet, but in your first Process Documents operator you have ticked the parameter 'Extract Text Only'. This would remove all of the JSON formatting of the documents as they are imported.
Try the below layout. This tackles it in a slightly different order by first reading all the JSON documents in unchanged and then converting them into data. Next it turns that data into text and processes it with the text mining operators.
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop_files" compatibility="7.4.000" expanded="true" height="82" name="Loop Files" width="90" x="45" y="34">
<parameter key="directory" value="C:\Users\Maik\Documents\Daten Masterarbeit\Alle_UN_mehr_2k_Mitarbeiter\50"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="7.4.001" expanded="true" height="68" name="Read Document" width="90" x="246" y="34">
<parameter key="extract_text_only" value="false"/>
</operator>
<connect from_port="file object" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Loops through your text directory and reads in the files as a document, unchanged. In RM 7.4 this is parallelized to make use of multiple processor cores.</description>
</operator>
<operator activated="true" class="text:json_to_data" compatibility="7.4.001" expanded="true" height="82" name="JSON To Data" width="90" x="179" y="34"/>
<operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
<parameter key="value_type" value="text"/>
<description align="center" color="transparent" colored="false" width="126">Additional operators for processing might be needed here, depending on the JSON doc format, but I have assumed not.</description>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="7.4.001" expanded="true" height="68" name="Data to Documents" width="90" x="514" y="34">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="7.4.001" expanded="true" height="103" name="Process Documents" width="90" x="648" y="34">
<parameter key="vector_creation" value="Term Frequency"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="20.0"/>
<parameter key="prune_above_percent" value="100.0"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="112" y="85"/>
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize (3)" width="90" x="246" y="85">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize (2)" width="90" x="380" y="85">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="85"/>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="648" y="85"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="179" y="238">
<parameter key="min_chars" value="2"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="7.4.001" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="238"/>
<operator activated="true" class="text:stem_snowball" compatibility="7.4.001" expanded="true" height="68" name="Stem (Snowball)" width="90" x="514" y="238"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.4.001" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="715" y="238"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="dbscan" compatibility="7.4.000" expanded="true" height="82" name="Clustering" width="90" x="916" y="34"/>
<connect from_op="Loop Files" from_port="output 1" to_op="JSON To Data" to_port="documents 1"/>
<connect from_op="JSON To Data" from_port="example set" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 3"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="21"/>
<portSpacing port="sink_result 3" spacing="147"/>
<portSpacing port="sink_result 4" spacing="273"/>
</process>
</operator>
</process>1
Answers
-
Did you use the JSON to Data operator? Why don't you post the XML of your process and the JSON file, maybe some can trouble shoot it.
0 -
Hi Thomas,
thank you for your reply. I actually have tried to use the JSON to data operator, but it still won't work. As I am pretty new to RapidMiner and especially to Text Mining, I am sure I am missing some basic operators.
So this is the process and I have attached one of the JSON files as .docx as .txt is not supported.
Can't wait for any suggestions!
Thank you!!
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="7.4.001" expanded="true" height="82" name="Process Documents from Files" width="90" x="45" y="34">
<list key="text_directories">
<parameter key="p2test2" value="C:\Users\Maik\Documents\Daten Masterarbeit\Alle_UN_mehr_2k_Mitarbeiter\50"/>
</list>
<parameter key="vector_creation" value="Term Frequency"/>
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="20.0"/>
<parameter key="prune_above_percent" value="100.0"/>
<parameter key="datamanagement" value="float_sparse_array"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="112" y="85"/>
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize (3)" width="90" x="246" y="85">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize (2)" width="90" x="380" y="85">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="85"/>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="648" y="85"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="179" y="238">
<parameter key="min_chars" value="2"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="7.4.001" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="238"/>
<operator activated="true" class="text:stem_snowball" compatibility="7.4.001" expanded="true" height="68" name="Stem (Snowball)" width="90" x="514" y="238"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.4.001" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="715" y="238"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="text:data_to_documents" compatibility="7.4.001" expanded="true" height="68" name="Data to Documents (2)" width="90" x="447" y="595">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="136">
<parameter key="value_type" value="text"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="7.4.001" expanded="true" height="68" name="Data to Documents" width="90" x="380" y="136">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:json_to_data" compatibility="7.4.001" expanded="true" height="82" name="JSON To Data" width="90" x="514" y="136"/>
<operator activated="true" class="dbscan" compatibility="7.4.000" expanded="true" height="82" name="Clustering" width="90" x="782" y="136"/>
<operator activated="false" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="313" y="493">
<parameter key="attribute_filter_type" value="value_type"/>
<parameter key="value_type" value="text"/>
</operator>
<operator activated="false" class="sample" compatibility="7.4.000" expanded="true" height="82" name="Sample" width="90" x="514" y="493">
<list key="sample_size_per_class"/>
<list key="sample_ratio_per_class"/>
<list key="sample_probability_per_class"/>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 1"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="JSON To Data" to_port="documents 1"/>
<connect from_op="JSON To Data" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 2"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>0 -
I haven't run through with your data yet, but in your first Process Documents operator you have ticked the parameter 'Extract Text Only'. This would remove all of the JSON formatting of the documents as they are imported.
Try the below layout. This tackles it in a slightly different order by first reading all the JSON documents in unchanged and then converting them into data. Next it turns that data into text and processes it with the text mining operators.
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop_files" compatibility="7.4.000" expanded="true" height="82" name="Loop Files" width="90" x="45" y="34">
<parameter key="directory" value="C:\Users\Maik\Documents\Daten Masterarbeit\Alle_UN_mehr_2k_Mitarbeiter\50"/>
<process expanded="true">
<operator activated="true" class="text:read_document" compatibility="7.4.001" expanded="true" height="68" name="Read Document" width="90" x="246" y="34">
<parameter key="extract_text_only" value="false"/>
</operator>
<connect from_port="file object" to_op="Read Document" to_port="file"/>
<connect from_op="Read Document" from_port="output" to_port="output 1"/>
<portSpacing port="source_file object" spacing="0"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Loops through your text directory and reads in the files as a document, unchanged. In RM 7.4 this is parallelized to make use of multiple processor cores.</description>
</operator>
<operator activated="true" class="text:json_to_data" compatibility="7.4.001" expanded="true" height="82" name="JSON To Data" width="90" x="179" y="34"/>
<operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
<parameter key="value_type" value="text"/>
<description align="center" color="transparent" colored="false" width="126">Additional operators for processing might be needed here, depending on the JSON doc format, but I have assumed not.</description>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="7.4.001" expanded="true" height="68" name="Data to Documents" width="90" x="514" y="34">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="7.4.001" expanded="true" height="103" name="Process Documents" width="90" x="648" y="34">
<parameter key="vector_creation" value="Term Frequency"/>
<parameter key="add_meta_information" value="false"/>
<parameter key="prune_method" value="percentual"/>
<parameter key="prune_below_percent" value="20.0"/>
<parameter key="prune_above_percent" value="100.0"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="112" y="85"/>
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize (3)" width="90" x="246" y="85">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize (2)" width="90" x="380" y="85">
<parameter key="mode" value="linguistic sentences"/>
<parameter key="language" value="German"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="85"/>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (German)" width="90" x="648" y="85"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="179" y="238">
<parameter key="min_chars" value="2"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="7.4.001" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="238"/>
<operator activated="true" class="text:stem_snowball" compatibility="7.4.001" expanded="true" height="68" name="Stem (Snowball)" width="90" x="514" y="238"/>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.4.001" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="715" y="238"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Tokenize (3)" to_port="document"/>
<connect from_op="Tokenize (3)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
<connect from_op="Filter Stopwords (German)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="dbscan" compatibility="7.4.000" expanded="true" height="82" name="Clustering" width="90" x="916" y="34"/>
<connect from_op="Loop Files" from_port="output 1" to_op="JSON To Data" to_port="documents 1"/>
<connect from_op="JSON To Data" from_port="example set" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 3"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="21"/>
<portSpacing port="sink_result 3" spacing="147"/>
<portSpacing port="sink_result 4" spacing="273"/>
</process>
</operator>
</process>1 -
Ya beat me to it JEdwards. I can confirm that this works with the sample JSON file.
0 -
I need to do this exact thing, but I do not understand the format you wrote that in? I don't see anything in the operations titles context? I too am having difficulty opening json files in rapidminer studio. Help!
0