Time based term frequency analysis
dawidprozesky
New Altair Community Member
Hi, I explored rapidminer a while ago, and have now returned with a specific analysis which I hope to achieve. I have a data set in Excel with the following columns:
Date (dd/mm/yyyy format) | Body of Text (text) | Publisher (name)
So each record in the data set relates to a specific body of text published at a specific date, and the name of the publisher.
My end goal is to identify words/terms in the texts which started occurring after a given date (i.e. after 1 January 2010), as well as see the word/term frequencies of these identified words/terms over time (can be per year) after the given date.
My current config is: Read Excel - Nominal to Text - Process Documents from Data (tokenizing, filtering and transforming) - Wordlist to Data
I am very new to rapidminer, so any assistance would be really appreciated!!
Date (dd/mm/yyyy format) | Body of Text (text) | Publisher (name)
So each record in the data set relates to a specific body of text published at a specific date, and the name of the publisher.
My end goal is to identify words/terms in the texts which started occurring after a given date (i.e. after 1 January 2010), as well as see the word/term frequencies of these identified words/terms over time (can be per year) after the given date.
My current config is: Read Excel - Nominal to Text - Process Documents from Data (tokenizing, filtering and transforming) - Wordlist to Data
I am very new to rapidminer, so any assistance would be really appreciated!!
Tagged:
0
Answers
-
You are probably going to want to do some preprocessing on your date/time data first before your text analysis to facilitate your subsequent comparisons. Try Date to Numerical to summarize by month/year. Then when you generate your word counts, you can aggregate by the appropriate time window later.
As far as looking for occurrences after a specific date, a simple Filter Examples should suffice to handle that.
This should get you started.1