Suggest for this project
ElenaVet
New Altair Community Member
Goodmorning everyone,
I have a data dataset composed as follows:
- ts: the date on which the news was published;
- body: the text of the news;
- stock: ticker of the action to which the news refers (e.g. TWTR: Twitter);
- positive: integer> = 0. Indicates a count of "positive" words, from a financial point of view, found in the news;
- negative: integer> = 0. Indicates a count of "negative" words, from a financial point of view, found in the news.
In particular I have to carry out:
1) Exploratory data analysis
2) Data analysis techniques which means:
◼ Association rules
◼ Clustering = Perform multiple analysis sessions with one or more algorithms (e.g., KMeans,
DBSCAN) + Evaluate the various expert quality indexes (e.g., SSE).
Do you have any suggestions on where to start and how should I move?
Thanks so much!!!
0
Best Answer
-
Hi @ElenaVet,
1/ You can begin by seeing some videos on the RapidMiner Academy :
- about clustering :
https://academy.rapidminer.com/catalog?query=clustering
- about association rules :
https://academy.rapidminer.com/catalog?query=association%20rules
2/ More over you have process templates regarding AR and clustering in RapidMiner :
3/ More generally, you have a lot of resources by searching in the top-right search box of RapidMiner Studio :
Hope this helps,
Regards,
Lionel
5
Answers
-
Hi @ElenaVet,
1/ You can begin by seeing some videos on the RapidMiner Academy :
- about clustering :
https://academy.rapidminer.com/catalog?query=clustering
- about association rules :
https://academy.rapidminer.com/catalog?query=association%20rules
2/ More over you have process templates regarding AR and clustering in RapidMiner :
3/ More generally, you have a lot of resources by searching in the top-right search box of RapidMiner Studio :
Hope this helps,
Regards,
Lionel
5 -
Thank you, @lionelderkrikor!
Your answer is inspirational! Do you think, however, a pre-precessing of textual data is necessary? How do you think it is right to start about it?
0 -
Hi @ElenaVet,
If I good understand, your "body" attribute is a text -attribute, so , yes, you have to pre-process this attribute
by tokenizing etc. inside a Process Document subprocess to create a "word vector".
To perform this pre-processing step, you can see videos on the RapidMiner Academy by searching "text mining" or
you can search directly some resources directly inside RapidMiner Studio with the top-right search box like you did for "clustering" and "association rules".
Regards,
Lionel
0 -
@lionelderkrikor
thanks a lot! I also notice that some items aren't in English (for example German, Italian, Spanish and others..), how can I select only English news?0 -
Hi @ElenaVet,
You can use the "Text Vectorization" operator :
- select your text attribute (in your case "body" if I good understand)
- select add language in the parameters of this operator
- the operator will generate an attribute called "language" with different values according to the language of your news: english, italian, spanish etc.
- Then use a Filter examples operator to filter only the examples with language = english
Regards,
Lionel1 -
@lionelderkrikor
unfortunately, filter examples can't recognize language label. Is it necessary a Multiply? Or maybe I have to Write a new CSV with the new label and then work on it?
Thanks0 -
@ElenaVet,
If the name of the language attribute does not appear, you have to enter it manually :
Regards,
Lionel0