performing FP-growth on a .txt file

kdafoe
kdafoe New Altair Community Member
edited November 5 in Community Q&A
I have a simple text file that is a copy and paste of a URL into the Create Document operator. I can easily get word frequencies and sentiment, but I get an overwhelming result (everything) when trying FP-growth. I have a feeling that it's because it is a single text document (?), but I'm not sure how to manipulate the file to get better results. Here is my process, and thanks to any suggestions:

<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="9.3.001" expanded="true" height="68" name="Create Document" width="90" x="179" y="34">
        <parameter key="text" value="Oct 11, 2021,07:15am EDT&#10;Why Simple Machine Learning Models Are Key To Driving Business Decisions&#10;YEC&#10;YECCOUNCIL POST| Membership (fee-based)&#10;Entrepreneurs&#10;&#10;By Tapojit Debnath Tapu, Co-founder &amp; CTO, Obviously AI.&#10;&#10;getty&#10;&#10;This article was co-written with my colleague and fellow YEC member, Nirman Dave, CEO at Obviously AI.&#10;&#10;Back in March of this year, MIT Sloan Management Review made a sobering discovery: The majority of data science projects in businesses are deemed failures. A staggering proportion of companies are failing to obtain meaningful ROI from their data science projects. A failure rate of 85% was reported by a Gartner Inc. analyst back in 2017, 87% was reported by VentureBeat in 2019 and 85.4% was reported by Forbes in 2020. Despite the breakthroughs in data science and machine learning (ML), despite the development of several data management softwares and despite hundreds of articles and videos online, why is it that production-ready ML models are just not hitting the mark?&#10;&#10;People often attribute this to a lack of appropriate data science talent and disorganized data, however, my business partner and co-founder, Nirman Dave, and I were discussing this recently, and we believe there is something more intricate at play here. There are three key factors that hinder ML models from being production-ready:&#10;&#10;1. Volume: The rate at which raw data is created&#10;&#10;2. Scrubbing: The ability to make an ML-ready dataset from raw input&#10;&#10;3. Explainability: The ability to explain how decisions are derived from complex ML models to everyday non-technical business users&#10;&#10;Let’s start by looking at volume, one of the first key bottlenecks in making production-ready ML models. We know that the rate of data being collected is growing exponentially. Given this increasing volume of data, it becomes incredibly essential to deliver insights in real-time. However, by the time insights are derived, there is already new raw data that is collected, which may make existing insights obsolete. &#10;MORE FOR YOU&#10;‘We Can Control Our Own Destiny’: John Zimmer Shares Lyft’s Vision For The Company’s Future And $1 Trillion Market Opportunity&#10;The LSE Alumni Turning Their University Into A Startup Powerhouse&#10;Coinrule Bags Big-Name Investors For Its Automated Crypto Trading Platform&#10;&#10;Additionally, this is topped with data scrubbing, the process of organizing, cleaning and manipulating data to make it ML-ready. Given that data is distributed across multiple storage solutions in different formats (i.e., spreadsheets, databases, CRMs), this step can be herculean in nature to execute. A change as small as a new column in a spreadsheet might require changes in the entire pipeline to account for it.&#10;&#10;Moreover, once the models are built, explainability becomes a challenge. Nobody likes to take orders from a computer unless they are well explained. This is why it becomes critical that analysts can explain how models make decisions to their business users without being sucked into the technical details. &#10;&#10;Solving even one of these problems can take an army and many businesses don’t have a data science team or cannot scale one. However, it doesn’t need to be this way. Imagine if all these problems were solved by simply changing the way ML models are chosen. This is what I call the Tiny Model Theory. &#10;&#10;Tiny Model Theory is the idea that you don’t need to use heavy-duty ML models to carry out simple repetitive everyday business predictions. In fact, by using more lightweight models (e.g., random forests, logistic regression, etc.) you can cut down on the time you’d need for the aforementioned bottlenecks, decreasing your time to value.&#10;&#10;Often, it’s easy for engineers to pick complicated deep neural networks to solve problems. However, in my experience as a CTO at one of the leading AI startups in the Bay Area, most problems don’t need complicated deep neural networks. They can do very well with tiny models instead — unlocking speed, reducing complexity and increasing explainability. &#10;&#10;Let’s start with speed. Since a significant portion of the project timeline gets consumed by data preprocessing, data scientists have less time to experiment with different types of models. As a result, they’ll gravitate toward large models with complex architecture, hoping they’ll be the silver bullet to their problems. However, in most business use cases — like predicting churn, forecasting revenue, predicting loan defaults, etc. — they only end up increasing time to value, giving a diminishing return on time invested versus performance.&#10;&#10;I find that it's akin to using a sledgehammer to crack a nut. However, this is exactly where tiny models can shine. Tiny models, like logistic regression, can train concurrently by making use of distributed ML that parallel trains models across different cloud servers. Tiny models require significantly less computational power to train and less storage space. This is due to the lack of complexity in their architecture. This lack of complexity makes them ideal candidates for distributed ML. Some of the top companies prefer simple models for their distributed ML pipeline involving edge devices, like IOTs and smartphones. Federated machine learning, which is based on edge-distributed ML, is quickly becoming popular today.&#10;&#10;An average data scientist can easily identify how a simple model like a decision tree is making a prediction. A trained decision tree can be plotted to represent how individual features contribute to making a prediction. This makes simple models more explainable. They can also use an ensemble of trained simple models, which takes an average of their predictions. This ensemble is more likely to be accurate than a single, complex model. Instead of having all your eggs in one basket, using an ensemble of simple models distributes the risk of having an ML model with bad performance. &#10;&#10;Simple models are much easier to implement today since they’re more accessible. Models like logistic regression and random forests have existed for much longer than neural nets, so they’re better understood today. Popular low-code ML libraries, like SciKit Learn, also helped lower the barrier of entry into ML, allowing one to instantiate ML models using one line of code.&#10;&#10;Given how crucial AI is becoming in business strategy, the number of companies experimenting with AI will only go up. However, if businesses want to gain a tangible competitive edge over others, I believe that simple ML models are the only way to go. This doesn't mean complex models like neural nets will go away — they’ll still be used for niche projects like face recognition and cancer detection — but all businesses require decision-making, and simple models are a better choice than complex ones.&#10;YEC&#10;YEC&#10;&#10;    Print&#10;    Reprints &amp; Permissions&#10;&#10;"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="nominal"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="9.3.001" expanded="true" height="103" name="Process Documents" width="90" x="380" y="34">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_absolute" value="5"/>
        <parameter key="prune_above_absolute" value="99"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases (2)" width="90" x="112" y="34">
            <parameter key="transform_to" value="lower case"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize (2)" width="90" x="246" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="9.3.001" expanded="true" height="68" name="Filter Stopwords (English) (2)" width="90" x="380" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="9.3.001" expanded="true" height="68" name="Filter Tokens (by Length) (2)" width="90" x="514" y="34">
            <parameter key="min_chars" value="4"/>
            <parameter key="max_chars" value="25"/>
          </operator>
          <operator activated="true" class="wordnet:open_wordnet_dictionary" compatibility="5.3.000" expanded="true" height="68" name="Open WordNet Dictionary" width="90" x="447" y="187">
            <parameter key="resource_type" value="directory"/>
            <parameter key="directory" value="/Users/kendafoe/Downloads/WordNet-3.0/dict"/>
          </operator>
          <operator activated="true" class="wordnet:stem_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Stem (WordNet)" width="90" x="648" y="187">
            <parameter key="allow_ambiguity" value="false"/>
            <parameter key="keep_unmatched_stems" value="false"/>
            <parameter key="keep_unmatched_tokens" value="false"/>
            <parameter key="work_on_type_noun" value="true"/>
            <parameter key="work_on_type_verb" value="true"/>
            <parameter key="work_on_type_adjective" value="true"/>
            <parameter key="work_on_type_adverb" value="true"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="9.3.001" expanded="true" height="68" name="Generate n-Grams (Terms) (2)" width="90" x="782" y="85">
            <parameter key="max_length" value="2"/>
          </operator>
          <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (English) (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (English) (2)" from_port="document" to_op="Filter Tokens (by Length) (2)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length) (2)" from_port="document" to_op="Stem (WordNet)" to_port="document"/>
          <connect from_op="Open WordNet Dictionary" from_port="dictionary" to_op="Stem (WordNet)" to_port="dictionary"/>
          <connect from_op="Stem (WordNet)" from_port="document" to_op="Generate n-Grams (Terms) (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms) (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="operator_toolbox:extract_sentiment" compatibility="2.12.000" expanded="true" height="103" name="SentiWordnet" width="90" x="514" y="34">
        <parameter key="model" value="sentiwordnet"/>
        <parameter key="text_attribute" value="text"/>
        <parameter key="show_advanced_output" value="false"/>
        <parameter key="use_default_tokenization_regex" value="true"/>
        <list key="additional_words"/>
      </operator>
      <operator activated="true" class="numerical_to_binominal" compatibility="9.10.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="179" y="340">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="min" value="0.0"/>
        <parameter key="max" value="0.0"/>
      </operator>
      <operator activated="true" class="concurrency:fp_growth" compatibility="9.8.001" expanded="true" height="82" name="FP-Growth" width="90" x="313" y="391">
        <parameter key="input_format" value="items in dummy coded columns"/>
        <parameter key="item_separators" value="|"/>
        <parameter key="use_quotes" value="false"/>
        <parameter key="quotes_character" value="&quot;"/>
        <parameter key="escape_character" value="\"/>
        <parameter key="trim_item_names" value="true"/>
        <parameter key="positive_value" value="true"/>
        <parameter key="min_requirement" value="support"/>
        <parameter key="min_support" value="0.5"/>
        <parameter key="min_frequency" value="10"/>
        <parameter key="min_items_per_itemset" value="1"/>
        <parameter key="max_items_per_itemset" value="0"/>
        <parameter key="max_number_of_itemsets" value="1000000"/>
        <parameter key="find_min_number_of_itemsets" value="true"/>
        <parameter key="min_number_of_itemsets" value="100"/>
        <parameter key="max_number_of_retries" value="15"/>
        <parameter key="requirement_decrease_factor" value="0.9"/>
        <enumeration key="must_contain_list"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="SentiWordnet" to_port="exa"/>
      <connect from_op="SentiWordnet" from_port="exa" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="FP-Growth" from_port="example set" to_port="result 1"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • kdafoe
    kdafoe New Altair Community Member
    edited December 2021
    ignore this answer. my mistake.