TF-IDF calculation

cncha · 2025-05-30T10:37:57+00:00

There was an error rendering this rich post.

Altair Community Member

May 30, 2025

I'm trying to understand how "Generate TF-IDF operator" calculates TF-IDF's. Please let me know the formulae for Rapidminer's TF and IDF calculation. Unfortunately, past rapidminer community page does not seem valid any more..

Find more posts tagged with

AI Studio

Term Frequency + TF-IDF

Sort by:

1 - 4 of 41

Zhanna Abidor_20687

Altair Employee

Jun 11, 2025

Hello,

Can you clarify what formula you have in mind? Are you running into a specific result that does not satisfy the expectations?

Term Frequency (TF):

Definition: TF measures how often a term appears in a specific document.
Calculation:
- Count the number of times a term appears in a document.
- Divide this count by the total number of terms in the document.
Formula: tf(t, d) = (number of times term t appears in document d) / (total number of terms in document d)

Inverse Document Frequency (IDF):

Definition: IDF measures how rare a term is across the entire corpus (collection of documents).
Calculation:
- Calculate the logarithm of the total number of documents in the corpus (N).
- Divide this by the number of documents containing the term.
Formula:idf(t) = log(N / df(t)) where N is the total number of documents and df(t) is the number of documents containing term t.

3. TF-IDF Calculation:

Formula: tf-idf(t, d) = tf(t, d) * idf(t)
Interpretation: A higher TF-IDF score indicates that a term is frequent in the current document and rare across the corpus, making it a good indicator of the document's topic or content.

All the conversations from RapidMiner community were transferred to the Altair community.

You can find some general information about the operator in the documentation :

https://docs.rapidminer.com/2025.1/studio/operators/blending/attributes/generation/generate_tfidf.html

Can you please clarify /provide more details on how general terms are reflected in RapidMiner calculations and what version of the product and Operator you are running?

You can submit customer support questions to support.altair.com as well.

cncha

Altair Community Member

Jun 12, 2025

Thank you for your kind reply. I'm testing TF-IDF calculation using two examples: Doc1-"This is a book on data mining." and Doc2-"This book describes data mining and text mining using Rapidminer."

The TF-IDF outputs from "Process Documents from Data" and "Generate TFIDF" operators are different as shown below. The sub-process of Process Documents is consists of Tokenize, Stopwords, Filter by length and Stem(Porter). I want to know the difference of two operators…

cncha

Altair Community Member

Jun 12, 2025

Thank you for your kind reply. I'm testing TF-IDF calculation using two examples: Doc1-"This is a book on data mining." and Doc2-"This book describes data mining and text mining using Rapidminer."

Zhanna Abidor_20687

Altair Employee

Jun 12, 2025

It would help if you provided the process you are using, as it might need some additional operators/steps to achieve different results.

We can investigate it with our data science community and get back to you with explanations or suggestions.

cncha

Altair Community Member

Accepted Answer

Jun 13, 2025

Thank you for your kindness. The process is shown below

Nicholas_21406

Altair Employee

Jun 18, 2025

Hi @cncha , Thank you for the image of your process - could you please export the process itself to a .rmp file - you may then need to rename the .rmp file to a .txt file in order to attach it here.

Could you also please attach any input files, such as the xlsx file that the process is presumably reading?

Thank you.

cncha

Altair Community Member

Jun 19, 2025

TextProcessing(TFIDF).xlsx

TextProcessing(TFIDF).txt

Please find process & excel files attached