"Categorize words by belonging to the dictionaries"

Eriknem
Eriknem New Altair Community Member
edited November 5 in Community Q&A
Is it possible to categorize words with RapidMiner by belonging to the dictionary?

For example: I have a document and 2 dictionaries. First dictionary is list of hospitals and second dictionary is list of diagnoses. And I want to determine which words from documents are hospitals and which words are from dictionary of diagnoses. Is this possible with RapidMiner?

Thank you very much

Answers

  • StaryVena
    StaryVena New Altair Community Member
    Hi,
    you can merge dictionaries into one file with two columns. In first column will be words from dictionaries and in second will be hospital/diagnoses. And than use "Replace (Dictionary)" operator.

    Cheers
    Vaclav
  • Eriknem
    Eriknem New Altair Community Member
    Hello,

    Probably you misunderstood what I need. I have a document. This document is a medical record. And I have also 2 lists of words. First file is a list of diagnoses. Second one is a list of hospitals.
    I want to determine which words from my medical record are hospitals and which words are diagnoses.

    Thank you very much for answer.
  • StaryVena
    StaryVena New Altair Community Member
    Hi,
    ok now I understand. But I don't know what should be the output. Should it be document with replaced these words by class, or number of occurrence of each category, words occurred in input document. Because I think there is no additional variable for store information for each word in document.

    Cheers,
    Vaclav
  • JEdward
    JEdward New Altair Community Member
    Hi, I'm probably also misunderstanding, but could you simply use the multiply operator and process the document(s) twice? 
    The first time using the dictionary of hospitals and the second time using the dictionary of diagnoses?

    Best,
    JEdward.
  • Eriknem
    Eriknem New Altair Community Member
    Thank you for your answer. It could be probably better if I discribe my whole classification problem.

    My input is for example:
    "The previous Duma was widely viewed as little more than a rubber stamp for the Kremlin, says the BBC's Steve Rosenberg in Moscow - adding that this may explain why the campaign has failed to excite the Russian public.The election is being seen as a referendum on Mr Putin's personal popularity, three months before the Russian prime minister runs again for president. He served two terms in the post between 2000 and 2008.

    Vaccinations against influenza are usually made available to people in developed countries. Farmed poultry is often vaccinated to avoid decimation of the flocks.[15] The most common human vaccine is the trivalent influenza vaccine (TIV) that contains purified and inactivated antigens against three viral strains. Typically, this vaccine includes material from two influenza A virus subtypes and one influenza B virus strain."

    So we have one file with 2 paragraphs. First paragraph is non-medical and second paragraph is with medical information. And I want to classify these paragraphs. So I want to determine which paragraph is medical and which non-medical. I think that could be possible with dictionary with medical words that I mentioned.

    My output should be for example:

    "Non-medical information: The previous Duma was widely viewed as little more than a rubber stamp for the Kremlin, says the BBC's Steve Rosenberg in Moscow - adding that this may explain why the campaign has failed to excite the Russian public.The election is being seen as a referendum on Mr Putin's personal popularity, three months before the Russian prime minister runs again for president. He served two terms in the post between 2000 and 2008.

    Medical information:Vaccinations against influenza are usually made available to people in developed countries. Farmed poultry is often vaccinated to avoid decimation of the flocks.[15] The most common human vaccine is the trivalent influenza vaccine (TIV) that contains purified and inactivated antigens against three viral strains. Typically, this vaccine includes material from two influenza A virus subtypes and one influenza B virus strain."

    Thank you very much.