Suggest for this project

ElenaVet
ElenaVet New Altair Community Member
edited November 5 in Community Q&A
Goodmorning everyone,
I have a data dataset composed as follows:
- ts: the date on which the news was published;
- body: the text of the news;
- stock: ticker of the action to which the news refers (e.g. TWTR: Twitter);
- positive: integer> = 0. Indicates a count of "positive" words, from a financial point of view, found in the news;
- negative: integer> = 0. Indicates a count of "negative" words, from a financial point of view, found in the news.
In particular I have to carry out:
1) Exploratory data analysis
2) Data analysis techniques which means:
◼ Association rules
◼ Clustering = Perform multiple analysis sessions with one or more algorithms (e.g., KMeans,
DBSCAN) + Evaluate the various expert quality indexes (e.g., SSE).
Do you have any suggestions on where to start and how should I move?
Thanks so much!!!

Best Answer

Answers

  • ElenaVet
    ElenaVet New Altair Community Member
    Thank you, @lionelderkrikor
    Your answer is inspirational! Do you think, however, a pre-precessing of textual data is necessary? How do you think it is right to start about it?
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @ElenaVet,

     If I good understand, your "body" attribute is a text -attribute, so , yes, you have to pre-process this attribute
    by tokenizing etc. inside a Process Document subprocess to create a "word vector".
    To perform this pre-processing step, you can see videos on the RapidMiner Academy by searching "text mining" or
    you can search directly some resources directly inside RapidMiner Studio with the top-right search box like you did for "clustering" and "association rules".

    Regards,

    Lionel 
     
  • ElenaVet
    ElenaVet New Altair Community Member
    @lionelderkrikor
    thanks a lot! I also notice that some items aren't in English (for example German, Italian, Spanish and others..), how can I select only English news?
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    Hi @ElenaVet,

    You can use the "Text Vectorization" operator : 
     - select your text attribute (in your case "body" if I good understand)
     - select add language in the parameters of this operator
     - the operator will generate an attribute called "language" with different values according to the language of your news: english, italian, spanish etc.
     - Then use a Filter examples operator to filter only the examples with language = english

    Regards,

    Lionel
  • ElenaVet
    ElenaVet New Altair Community Member
    @lionelderkrikor
    unfortunately, filter examples can't recognize language label. Is it necessary a Multiply? Or maybe I have to Write a new CSV with the new label and then work on it? 
    Thanks
  • lionelderkrikor
    lionelderkrikor New Altair Community Member
    @ElenaVet,

    If the name of the language attribute does not appear, you have to enter it manually : 



    Regards,

    Lionel