TF-IDF calculation

User: "cncha"
Altair Community Member

I'm trying to understand how "Generate TF-IDF operator" calculates TF-IDF's. Please let me know the formulae for Rapidminer's TF and IDF calculation. Unfortunately, past rapidminer community page does not seem valid any more..

Find more posts tagged with

Sort by:
1 - 4 of 41

    Hello,

    Can you clarify what formula you have in mind? Are you running into a specific result that does not satisfy the expectations?

    Term Frequency (TF):

    • Definition: TF measures how often a term appears in a specific document. 
    • Calculation:
      • Count the number of times a term appears in a document. 
      • Divide this count by the total number of terms in the document. 
    • Formula: tf(t, d) = (number of times term t appears in document d) / (total number of terms in document d) 

    Inverse Document Frequency (IDF):

    • Definition: IDF measures how rare a term is across the entire corpus (collection of documents). 
    • Calculation:
      • Calculate the logarithm of the total number of documents in the corpus (N). 
      • Divide this by the number of documents containing the term. 
    • Formula:idf(t) = log(N / df(t)) where N is the total number of documents and df(t) is the number of documents containing term t. 

    3. TF-IDF Calculation:

    • Formula: tf-idf(t, d) = tf(t, d) * idf(t) 
    • Interpretation: A higher TF-IDF score indicates that a term is frequent in the current document and rare across the corpus, making it a good indicator of the document's topic or content. 

    All the conversations from RapidMiner community were transferred to the Altair community.

    You can find some general information about the operator in the documentation :

    https://docs.rapidminer.com/2025.1/studio/operators/blending/attributes/generation/generate_tfidf.html

    Can you please clarify /provide more details on how general terms are reflected in RapidMiner calculations and what version of the product and Operator you are running?

    You can submit customer support questions to support.altair.com as well.

    User: "cncha"
    Altair Community Member
    OP
    TFIDF.png

    Thank you for your kind reply. I'm testing TF-IDF calculation using two examples: Doc1-"This is a book on data mining." and Doc2-"This book describes data mining and text mining using Rapidminer."

    The TF-IDF outputs from "Process Documents from Data" and "Generate TFIDF" operators are different as shown below. The sub-process of Process Documents is consists of Tokenize, Stopwords, Filter by length and Stem(Porter). I want to know the difference of two operators…

    User: "cncha"
    Altair Community Member
    OP

    Thank you for your kind reply. I'm testing TF-IDF calculation using two examples: Doc1-"This is a book on data mining." and Doc2-"This book describes data mining and text mining using Rapidminer."

    The TF-IDF outputs from "Process Documents from Data" and "Generate TFIDF" operators are different as shown below. The sub-process of Process Documents is consists of Tokenize, Stopwords, Filter by length and Stem(Porter). I want to know the difference of two operators…

    TFIDF.png

    It would help if you provided the process you are using, as it might need some additional operators/steps to achieve different results.

    We can investigate it with our data science community and get back to you with explanations or suggestions.

    User: "cncha"
    Altair Community Member
    OP
    Accepted Answer

    Thank you for your kindness. The process is shown below

    Rapidminer_Process_TFIDF.png

    Hi @cncha , Thank you for the image of your process - could you please export the process itself to a .rmp file - you may then need to rename the .rmp file to a .txt file in order to attach it here.

    Could you also please attach any input files, such as the xlsx file that the process is presumably reading?

    Thank you.

    User: "cncha"
    Altair Community Member
    OP

    Please find process & excel files attached