🎉Community Raffle - Win $25

An exclusive raffle opportunity for active members like you! Complete your profile, answer questions and get your first accepted badge to enter the raffle.
Join and Win

Suggest for this project

User: "ElenaVet"
New Altair Community Member
Updated by Jocelyn
Goodmorning everyone,
I have a data dataset composed as follows:
- ts: the date on which the news was published;
- body: the text of the news;
- stock: ticker of the action to which the news refers (e.g. TWTR: Twitter);
- positive: integer> = 0. Indicates a count of "positive" words, from a financial point of view, found in the news;
- negative: integer> = 0. Indicates a count of "negative" words, from a financial point of view, found in the news.
In particular I have to carry out:
1) Exploratory data analysis
2) Data analysis techniques which means:
◼ Association rules
◼ Clustering = Perform multiple analysis sessions with one or more algorithms (e.g., KMeans,
DBSCAN) + Evaluate the various expert quality indexes (e.g., SSE).
Do you have any suggestions on where to start and how should I move?
Thanks so much!!!
Sort by:
1 - 7 of 71
    User: "lionelderkrikor"
    New Altair Community Member
    Accepted Answer
    Hi @ElenaVet,

    1/ You can begin by seeing some videos on the RapidMiner Academy : 
     - about clustering : 
    https://academy.rapidminer.com/catalog?query=clustering

     - about association rules : 
    https://academy.rapidminer.com/catalog?query=association%20rules

    2/ More over you have process templates regarding AR and clustering in RapidMiner : 



    3/ More generally, you have a lot of resources by searching in the top-right search box of RapidMiner Studio : 




    Hope this helps,

    Regards,

    Lionel



    User: "ElenaVet"
    New Altair Community Member
    OP
    Thank you, @lionelderkrikor
    Your answer is inspirational! Do you think, however, a pre-precessing of textual data is necessary? How do you think it is right to start about it?
    User: "lionelderkrikor"
    New Altair Community Member
    Hi @ElenaVet,

     If I good understand, your "body" attribute is a text -attribute, so , yes, you have to pre-process this attribute
    by tokenizing etc. inside a Process Document subprocess to create a "word vector".
    To perform this pre-processing step, you can see videos on the RapidMiner Academy by searching "text mining" or
    you can search directly some resources directly inside RapidMiner Studio with the top-right search box like you did for "clustering" and "association rules".

    Regards,

    Lionel 
     
    User: "ElenaVet"
    New Altair Community Member
    OP
    @lionelderkrikor
    thanks a lot! I also notice that some items aren't in English (for example German, Italian, Spanish and others..), how can I select only English news?
    User: "lionelderkrikor"
    New Altair Community Member
    Hi @ElenaVet,

    You can use the "Text Vectorization" operator : 
     - select your text attribute (in your case "body" if I good understand)
     - select add language in the parameters of this operator
     - the operator will generate an attribute called "language" with different values according to the language of your news: english, italian, spanish etc.
     - Then use a Filter examples operator to filter only the examples with language = english

    Regards,

    Lionel
    User: "ElenaVet"
    New Altair Community Member
    OP
    @lionelderkrikor
    unfortunately, filter examples can't recognize language label. Is it necessary a Multiply? Or maybe I have to Write a new CSV with the new label and then work on it? 
    Thanks
    User: "lionelderkrikor"
    New Altair Community Member
    @ElenaVet,

    If the name of the language attribute does not appear, you have to enter it manually : 



    Regards,

    Lionel