Find more posts tagged with
Thank you for your kind reply. I'm testing TF-IDF calculation using two examples: Doc1-"This is a book on data mining." and Doc2-"This book describes data mining and text mining using Rapidminer."
The TF-IDF outputs from "Process Documents from Data" and "Generate TFIDF" operators are different as shown below. The sub-process of Process Documents is consists of Tokenize, Stopwords, Filter by length and Stem(Porter). I want to know the difference of two operators…
Hi @cncha , Thank you for the image of your process - could you please export the process itself to a .rmp file - you may then need to rename the .rmp file to a .txt file in order to attach it here.
Could you also please attach any input files, such as the xlsx file that the process is presumably reading?
Thank you.
Hello,
Can you clarify what formula you have in mind? Are you running into a specific result that does not satisfy the expectations?
Term Frequency (TF):
tf(t, d) = (number of times term t appears in document d) / (total number of terms in document d)
Inverse Document Frequency (IDF):
idf(t) = log(N / df(t))
where N is the total number of documents and df(t) is the number of documents containing term t.3. TF-IDF Calculation:
tf-idf(t, d) = tf(t, d) * idf(t)
All the conversations from RapidMiner community were transferred to the Altair community.
You can find some general information about the operator in the documentation :
https://docs.rapidminer.com/2025.1/studio/operators/blending/attributes/generation/generate_tfidf.html
Can you please clarify /provide more details on how general terms are reflected in RapidMiner calculations and what version of the product and Operator you are running?
You can submit customer support questions to support.altair.com as well.
Thank you for your kind reply. I'm testing TF-IDF calculation using two examples: Doc1-"This is a book on data mining." and Doc2-"This book describes data mining and text mining using Rapidminer."
The TF-IDF outputs from "Process Documents from Data" and "Generate TFIDF" operators are different as shown below. The sub-process of Process Documents is consists of Tokenize, Stopwords, Filter by length and Stem(Porter). I want to know the difference of two operators…