Performing Principal Component Analysis of a set of tweets.
Hello! First and foremost, I apologize if this topic has been found somewhere. I have spent a considerable amount of time attempting to look for a method.
I have found 2 social science studies that utilized PCA of text data using Rapid Miner. They displayed in a table which words had the highest eigenvalue for a particular factors. I am interested in learning how to do this, but thus far I have been frustrated with a lack of process/steps. I also wonder if it is something so elementary that there are no methods that explain the process?
To be more specific, I am interested in analyzing an excel file containing 2000 tweets (for starters). Thank you in advance for your sincere assistance!
Best Answer
-
Well without reading the whole thing, it's kind of hard to figure out what they did exactly.
I suspect it must be something like this:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
<parameter key="connection" value="Twitter Connection"/>
<parameter key="query" value="iphone"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34">
<list key="function_descriptions">
<parameter key="label" value=""iPhone""/>
</list>
</operator>
<operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter (2)" width="90" x="112" y="136">
<parameter key="connection" value="Twitter Connection"/>
<parameter key="query" value="samsung"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="246" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="380" y="136">
<list key="function_descriptions">
<parameter key="label" value=""samsung""/>
</list>
</operator>
<operator activated="true" class="append" compatibility="7.2.003" expanded="true" height="103" name="Append" width="90" x="581" y="34"/>
<operator activated="true" class="set_role" compatibility="7.2.003" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
<parameter key="attribute_name" value="label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.2.003" expanded="true" height="82" name="Nominal to Text" width="90" x="849" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.2.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="983" y="34">
<parameter key="prune_method" value="percentual"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.2.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="principal_component_analysis" compatibility="7.2.003" expanded="true" height="103" name="PCA" width="90" x="1117" y="34"/>
<connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Search Twitter (2)" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="PCA" to_port="example set input"/>
<connect from_op="PCA" from_port="example set output" to_port="result 2"/>
<connect from_op="PCA" from_port="preprocessing model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>That said, I'm a bit cautious about the 100% accuracy of their model.
0
Answers
-
Can you provide a link to where this was done? My initial thought is that the text was transformed into Word Vectors by using TFIDF or something.
1 -
Hello! Here is one article that claims to do it . I apologize if I cannot provide the whole article, but to quote the specific portion..
"We separated China from Philippine news reports, then extracted principal components from our two separate sets-of-words. This procedure is intuitively similar to what principal components analysis does to quantified variables." (Montiel et al., 2014)
0 -
Well without reading the whole thing, it's kind of hard to figure out what they did exactly.
I suspect it must be something like this:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
<parameter key="connection" value="Twitter Connection"/>
<parameter key="query" value="iphone"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34">
<list key="function_descriptions">
<parameter key="label" value=""iPhone""/>
</list>
</operator>
<operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter (2)" width="90" x="112" y="136">
<parameter key="connection" value="Twitter Connection"/>
<parameter key="query" value="samsung"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="246" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="380" y="136">
<list key="function_descriptions">
<parameter key="label" value=""samsung""/>
</list>
</operator>
<operator activated="true" class="append" compatibility="7.2.003" expanded="true" height="103" name="Append" width="90" x="581" y="34"/>
<operator activated="true" class="set_role" compatibility="7.2.003" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
<parameter key="attribute_name" value="label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.2.003" expanded="true" height="82" name="Nominal to Text" width="90" x="849" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.2.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="983" y="34">
<parameter key="prune_method" value="percentual"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.2.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="principal_component_analysis" compatibility="7.2.003" expanded="true" height="103" name="PCA" width="90" x="1117" y="34"/>
<connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Search Twitter (2)" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="PCA" to_port="example set input"/>
<connect from_op="PCA" from_port="example set output" to_port="result 2"/>
<connect from_op="PCA" from_port="preprocessing model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>That said, I'm a bit cautious about the 100% accuracy of their model.
0 -
Thank you! I will attempt to make sense of this.
0