I have a simple text file that is a copy and paste of a URL into the Create Document operator. I can easily get word frequencies and sentiment, but I get an overwhelming result (everything) when trying FP-growth. I have a feeling that it's because it is a single text document (?), but I'm not sure how to manipulate the file to get better results. Here is my process, and thanks to any suggestions:
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="text:create_document"
compatibility="9.3.001" expanded="true" height="68" name="Create
Document" width="90" x="179" y="34">
<parameter
key="text" value="Oct 11, 2021,07:15am EDT Why Simple Machine
Learning Models Are Key To Driving Business
Decisions YEC YECCOUNCIL POST| Membership
(fee-based) Entrepreneurs By Tapojit Debnath
Tapu, Co-founder & CTO, Obviously
AI. getty This article was co-written
with my colleague and fellow YEC member, Nirman Dave, CEO at Obviously
AI. Back in March of this year, MIT Sloan Management
Review made a sobering discovery: The majority of data science projects
in businesses are deemed failures. A staggering proportion of companies
are failing to obtain meaningful ROI from their data science projects. A
failure rate of 85% was reported by a Gartner Inc. analyst back in
2017, 87% was reported by VentureBeat in 2019 and 85.4% was reported by
Forbes in 2020. Despite the breakthroughs in data science and machine
learning (ML), despite the development of several data management
softwares and despite hundreds of articles and videos online, why is it
that production-ready ML models are just not hitting the
mark? People often attribute this to a lack of
appropriate data science talent and disorganized data, however, my
business partner and co-founder, Nirman Dave, and I were discussing this
recently, and we believe there is something more intricate at play
here. There are three key factors that hinder ML models from being
production-ready: 1. Volume: The rate at which raw data
is created 2. Scrubbing: The ability to make an
ML-ready dataset from raw input 3. Explainability: The
ability to explain how decisions are derived from complex ML models to
everyday non-technical business users Let’s start by
looking at volume, one of the first key bottlenecks in making
production-ready ML models. We know that the rate of data being
collected is growing exponentially. Given this increasing volume of
data, it becomes incredibly essential to deliver insights in real-time.
However, by the time insights are derived, there is already new raw data
that is collected, which may make existing insights obsolete.
MORE FOR YOU ‘We Can Control Our Own Destiny’: John
Zimmer Shares Lyft’s Vision For The Company’s Future And $1 Trillion
Market Opportunity The LSE Alumni Turning Their University Into A
Startup Powerhouse Coinrule Bags Big-Name Investors For Its
Automated Crypto Trading Platform Additionally, this is
topped with data scrubbing, the process of organizing, cleaning and
manipulating data to make it ML-ready. Given that data is distributed
across multiple storage solutions in different formats (i.e.,
spreadsheets, databases, CRMs), this step can be herculean in nature to
execute. A change as small as a new column in a spreadsheet might
require changes in the entire pipeline to account for
it. Moreover, once the models are built, explainability
becomes a challenge. Nobody likes to take orders from a computer unless
they are well explained. This is why it becomes critical that analysts
can explain how models make decisions to their business users without
being sucked into the technical details. Solving even
one of these problems can take an army and many businesses don’t have a
data science team or cannot scale one. However, it doesn’t need to be
this way. Imagine if all these problems were solved by simply changing
the way ML models are chosen. This is what I call the Tiny Model Theory.
Tiny Model Theory is the idea that you don’t need to
use heavy-duty ML models to carry out simple repetitive everyday
business predictions. In fact, by using more lightweight models (e.g.,
random forests, logistic regression, etc.) you can cut down on the time
you’d need for the aforementioned bottlenecks, decreasing your time to
value. Often, it’s easy for engineers to pick
complicated deep neural networks to solve problems. However, in my
experience as a CTO at one of the leading AI startups in the Bay Area,
most problems don’t need complicated deep neural networks. They can do
very well with tiny models instead — unlocking speed, reducing
complexity and increasing explainability. Let’s start
with speed. Since a significant portion of the project timeline gets
consumed by data preprocessing, data scientists have less time to
experiment with different types of models. As a result, they’ll
gravitate toward large models with complex architecture, hoping they’ll
be the silver bullet to their problems. However, in most business use
cases — like predicting churn, forecasting revenue, predicting loan
defaults, etc. — they only end up increasing time to value, giving a
diminishing return on time invested versus
performance. I find that it's akin to using a
sledgehammer to crack a nut. However, this is exactly where tiny models
can shine. Tiny models, like logistic regression, can train concurrently
by making use of distributed ML that parallel trains models across
different cloud servers. Tiny models require significantly less
computational power to train and less storage space. This is due to the
lack of complexity in their architecture. This lack of complexity makes
them ideal candidates for distributed ML. Some of the top companies
prefer simple models for their distributed ML pipeline involving edge
devices, like IOTs and smartphones. Federated machine learning, which is
based on edge-distributed ML, is quickly becoming popular
today. An average data scientist can easily identify
how a simple model like a decision tree is making a prediction. A
trained decision tree can be plotted to represent how individual
features contribute to making a prediction. This makes simple models
more explainable. They can also use an ensemble of trained simple
models, which takes an average of their predictions. This ensemble is
more likely to be accurate than a single, complex model. Instead of
having all your eggs in one basket, using an ensemble of simple models
distributes the risk of having an ML model with bad performance.
Simple models are much easier to implement today since
they’re more accessible. Models like logistic regression and random
forests have existed for much longer than neural nets, so they’re better
understood today. Popular low-code ML libraries, like SciKit Learn,
also helped lower the barrier of entry into ML, allowing one to
instantiate ML models using one line of code. Given how
crucial AI is becoming in business strategy, the number of companies
experimenting with AI will only go up. However, if businesses want to
gain a tangible competitive edge over others, I believe that simple ML
models are the only way to go. This doesn't mean complex models like
neural nets will go away — they’ll still be used for niche projects like
face recognition and cancer detection — but all businesses require
decision-making, and simple models are a better choice than complex
ones. YEC YEC Print
Reprints & Permissions "/>
<parameter key="add label" value="false"/>
<parameter key="label_type" value="nominal"/>
</operator>
<operator activated="true" class="text:process_documents"
compatibility="9.3.001" expanded="true" height="103" name="Process
Documents" width="90" x="380" y="34">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="none"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="5"/>
<parameter key="prune_above_absolute" value="99"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<process expanded="true">
<operator activated="true" class="text:transform_cases"
compatibility="9.3.001" expanded="true" height="68" name="Transform
Cases (2)" width="90" x="112" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:tokenize"
compatibility="9.3.001" expanded="true" height="68" name="Tokenize (2)"
width="90" x="246" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english"
compatibility="9.3.001" expanded="true" height="68" name="Filter
Stopwords (English) (2)" width="90" x="380" y="34"/>
<operator activated="true" class="text:filter_by_length"
compatibility="9.3.001" expanded="true" height="68" name="Filter Tokens
(by Length) (2)" width="90" x="514" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="25"/>
</operator>
<operator activated="true" class="wordnet:open_wordnet_dictionary"
compatibility="5.3.000" expanded="true" height="68" name="Open WordNet
Dictionary" width="90" x="447" y="187">
<parameter key="resource_type" value="directory"/>
<parameter key="directory" value="/Users/kendafoe/Downloads/WordNet-3.0/dict"/>
</operator>
<operator activated="true" class="wordnet:stem_wordnet"
compatibility="5.3.000" expanded="true" height="82" name="Stem
(WordNet)" width="90" x="648" y="187">
<parameter key="allow_ambiguity" value="false"/>
<parameter key="keep_unmatched_stems" value="false"/>
<parameter key="keep_unmatched_tokens" value="false"/>
<parameter key="work_on_type_noun" value="true"/>
<parameter key="work_on_type_verb" value="true"/>
<parameter key="work_on_type_adjective" value="true"/>
<parameter key="work_on_type_adverb" value="true"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms"
compatibility="9.3.001" expanded="true" height="68" name="Generate
n-Grams (Terms) (2)" width="90" x="782" y="85">
<parameter key="max_length" value="2"/>
</operator>
<connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (English) (2)" to_port="document"/>
<connect from_op="Filter Stopwords (English) (2)"
from_port="document" to_op="Filter Tokens (by Length) (2)"
to_port="document"/>
<connect from_op="Filter Tokens
(by Length) (2)" from_port="document" to_op="Stem (WordNet)"
to_port="document"/>
<connect from_op="Open WordNet Dictionary" from_port="dictionary" to_op="Stem (WordNet)" to_port="dictionary"/>
<connect from_op="Stem (WordNet)" from_port="document" to_op="Generate n-Grams (Terms) (2)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms) (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true"
class="operator_toolbox:extract_sentiment" compatibility="2.12.000"
expanded="true" height="103" name="SentiWordnet" width="90" x="514"
y="34">
<parameter key="model" value="sentiwordnet"/>
<parameter key="text_attribute" value="text"/>
<parameter key="show_advanced_output" value="false"/>
<parameter key="use_default_tokenization_regex" value="true"/>
<list key="additional_words"/>
</operator>
<operator activated="true" class="numerical_to_binominal"
compatibility="9.10.001" expanded="true" height="82" name="Numerical to
Binominal" width="90" x="179" y="340">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="numeric"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="real"/>
<parameter key="block_type" value="value_series"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_series_end"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="min" value="0.0"/>
<parameter key="max" value="0.0"/>
</operator>
<operator activated="true" class="concurrency:fp_growth"
compatibility="9.8.001" expanded="true" height="82" name="FP-Growth"
width="90" x="313" y="391">
<parameter key="input_format" value="items in dummy coded columns"/>
<parameter key="item_separators" value="|"/>
<parameter key="use_quotes" value="false"/>
<parameter key="quotes_character" value="""/>
<parameter key="escape_character" value="\"/>
<parameter key="trim_item_names" value="true"/>
<parameter key="positive_value" value="true"/>
<parameter key="min_requirement" value="support"/>
<parameter key="min_support" value="0.5"/>
<parameter key="min_frequency" value="10"/>
<parameter key="min_items_per_itemset" value="1"/>
<parameter key="max_items_per_itemset" value="0"/>
<parameter key="max_number_of_itemsets" value="1000000"/>
<parameter key="find_min_number_of_itemsets" value="true"/>
<parameter key="min_number_of_itemsets" value="100"/>
<parameter key="max_number_of_retries" value="15"/>
<parameter key="requirement_decrease_factor" value="0.9"/>
<enumeration key="must_contain_list"/>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="SentiWordnet" to_port="exa"/>
<connect from_op="SentiWordnet" from_port="exa" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
<connect from_op="FP-Growth" from_port="example set" to_port="result 1"/>
<connect from_op="FP-Growth" from_port="frequent sets" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>