Will Generative AI
Replace the Need for Data Analysts?
No. But it will redefine the
data analyst role.
By Galen Okazaki in TDS
May 24, 2023
Since ChatGPT’s release in
November of 2022, speculation has grown over whether or not the role of a data
analyst could eventually be replaced by generative AI (ChatGPT, Bard,
and Bing Chat are among the large language models included in this
classification). Much of this speculation is fueled by the ability
of these large language models(LLMs) to write code.
As someone who has been in
the data analysis field for the majority of my professional career,
understanding the impact of generative AI in our field is something that has
definitely piqued my interest. Giving in to curiosity, I have since spent a
fair amount of time assessing the current capabilities of generative AI within
the context of data analysis.
In this article, I summarize
and share my findings with you as I believe generative AI will have a
significant role in data analysis work going forward. Furthermore,
I believe that it is imperative for the data analyst community to understand
the profound impact it will have on not only their field but the business
landscape as a whole.
Where
We Stand Today
At this point, we know that
generative AI can write SQL, Python, and R code. We can also assume the
efficiency of the code they produce will only get better over time with
continuous fine-tuning. But that’s just the start.
At the end of March (2023),
OpenAI’s ChatGPT released a plugin called Code Interpreter. If you are one of the few
who currently have access to the Alpha version, you can upload data files into
it and invoke Python to perform regression analysis and descriptive analysis,
look for patterns in your data and even create visualizations. All without
having to write or even know a line of Python code! Esteemed Wharton
School of Business professor Ethan Mollick has a nice write-up on
this.
So there you have it. The
ability to load, analyze, and present data without writing a stitch of code.
Game over yes? Not so fast.
As incredibly impressive as
these capabilities are, there are some significant limitations to Code
Intrepretor, that are indicative of some of the challenges that generative AI
would have in taking over the data analysis industry.
First, it requires the
upload of ONE table. One two-dimensional CSV file (currently limited to 100
MB). The size limitation aside, imagine being tasked with building one table
with all of your company’s data…
I could probably stop there,
but let's go on.
With your one table in hand,
you now have to get approval to get your one table with ALL of your company’s
data pushed outside of your company’s firewall into an LLM that they have no
control over…
We can probably stop there.
The current alternative(more
on this later) to the above would be that your company builds its own
LLM. While theoretically possible, the complexities of training and
fine-tuning the model, the expertise required and the enormous costs of doing
so would only make that cost-effective for an extremely short list of
companies.
But for the sake of
understanding, let's take a step back and imagine your company is on that list.
But first, let’s start with
some perspective. If we look back to the introduction of business intelligence
tools in the early 2000s, the great value of those tools lies in their ability
to provide non-technical, line-of-business people the ability to
leverage their domain knowledge by enabling them to select, analyze
and present data, without writing a stitch of code. Sound familiar?
Providing user-friendly means to analyze data is nothing new. It will
always have incredible value. Indeed, it is a multi-billion dollar industry
that continues to grow. However, these tools have no use without domain
knowledge. This applies to any data analysis, regardless of the tool(s)
being used. Even if it's generative AI. Without domain knowledge, we do not
know what questions to ask of our data. And even if the questions were provided
to us, how do we interpret our findings?
And in my view, the greatest
value of data analysis work lies in its ability to answer ad hoc questions.
Unforeseen, mission-critical questions. Complex, multi-layered, nonlinear types
of questions. Answering these questions requires domain knowledge.
For example, why did sales
on our best-selling product just drop off a cliff? Our primary supplier just
went out of business, what do we do? Why did our customer churn rate double
last month? These are not straightforward types of questions that can follow an
established decision tree.
What these few examples have in common is that they require immediate
answers to situational questions that have never been asked before. And that is
really the key. If you understand the construct of generative AI, its inability
to answer questions of this nature is truly its Achilles heel in ever being
able to replace data analysts fully.
To briefly summarize,
generative AI utilizes existing data sets to ‘train’ an LLM to generate a
probability-driven answer based on whatever training data it has been fed. And
while you can continuously fine-tune your model with ever more precise data
sets, how would you train your model on multi-layered, situational questions
that have never been asked before?
It would be analogous to you
starting a new job as a data analyst in an industry that you are not yet
familiar with. And on day one, you are asked to urgently answer one of the
questions above. Where would you even start? What data would you pull? How
would you even know what all of the potential variables you would need to
consider? And, even if you could somehow derive an answer, how would you know
if it is correct?
It is for these reasons that
I don’t foresee the role of data analyst ever being fully replaced by
generative AI. However… generative AI, in its current state, already has many
uses in the data analysis field and those uses will only continue to expand
with ever-increasing functionality.
Current Potential Uses for
Generative AI in Data Analysis
As of today, the highest and best use of generative AI in the data
analysis field is its ability to both write code and in turn, explain the code
it writes (which it does quite well). I’ve personally used it to help me write
and understand Python code.
For those of you who are looking to enter the data analysis field, I
could not encourage you enough to take advantage of generative AI to help you
learn to code. It would have greatly speeded up my learning curve when I was
first cutting my teeth in this field.
In another, truly exciting
development for data analysts, generative AI has fueled the development of
dedicated coding tools. GitHub has released its Copilot product, which can suggest
coding solutions/improvements in real-time as you are writing it!
Earlier in this article, I
referred to the potential hurdles companies would face in building their own
LLMs. There is possibly one new alternative to that: Databricks has recently
released an open-source LLM called ’ Dolly’. In theory,
this could solve the issues of cost (being open source) and having to push your
data outside of your company’s firewall. It’s a smaller-scale LLM, more suited
for focused datasets.
I mention Dolly, primarily
as an example of how quickly developments in the field of generative AI are
moving and as a heads-up to how they may affect the data analysis field going
forward.
As we have already seen, the
evolution of AI will only continue to progress at light speed.
Conclusion
There is no doubt in my mind
that generative AI will reshape the workflows in data analysis. Generally
speaking, repetitive types of tasks or even analyses will in time be performed
by generative AI. I could also see coding becoming more of a commodity, versus
being a highly developed skill.
Based on the above, I
believe that the
prototypical data analyst in the future will possess business line-level domain
knowledge combined with an ability to incorporate generative AI tools to help
them be more efficient and productive with their time.
Lastly, on a personal note,
I would encourage anyone reading this to embrace generative AI. Learn about it
and use it both in your personal and business lives. With new APIs and plugins
constantly being created, its reach and capabilities will only grow.
For better or worse.
Feel free to reach out to me
at galen.okazaki@vectordecisionsupport.com
And also, if you would like
to stay tuned for future articles of mine, please give me a follow at Medium.
Mahalo

Artwork Created by Author Using Midjourney