"text mining"

ghina84
ghina84 New Altair Community Member
edited November 5 in Community Q&A

hello everybody..

which operator I should use to load a serie of text files (.txt or .xml)?????

thank you,

laura

Answers

  • emolano
    emolano New Altair Community Member
    The text plug in come with some examples. They use TextInput that point to a directory with files.
    You can also use ExampleSource and then StringTextInput... I learned from the examples :)
  • ghina84
    ghina84 New Altair Community Member

    I tried with the TextInput too, looking at the example, but the output of the node is not the one it should be:

    instead of having a table with documents in the lines and terms as columns, I get a table with COLUMNS=DOCUMENTS
  • emolano
    emolano New Altair Community Member
    I do not quite understand what you are trying to accomplish. could you explain your process a bit?
  • ghina84
    ghina84 New Altair Community Member

    sure  :)

    My goal is to analyse a serie of articles in .txt format.

    To do this I have to load the .txt files using for example TextInput.

    Looking at this example http://nemoz.org/joomla/content/view/65/53/lang,de/ the output of this opertator SHOULD be a table like this:

    -ROWS: articles
    -COLUMNS: terms

    (this is written right after the second image in the page I gave you the link).
    This matrix, usually called Document Term Matrix,  tells you each document (rows) which words (columns) contains, so is a sparse matrix of binary values, and it is used in the next steps of the analysis.

    BUT...instead of this, I get a table like this:

    -ROWS:progressive id of the article
    -COLUMNS:article (i.e. all the text of each article is the label of an attribute!!!)

    ...and I don't know:

    1) if this is correct...but I don't think so

    2) how to solve the problem

    I hope I explain myself better...thank you for the reply and for the help!!

    ciao,

    laura

  • emolano
    emolano New Altair Community Member
    Ciao,
    It should be:
    -ROWS: article id 
    -COLUMNS: terms
    You see -ROWS: id number because you define the id_attribute_type as number.
    if you change the id_attribute_type to use short or long instead, you will get the filename or filename+path of the article. The idea here is that you do not get the whole article just a reference id to the article.
    You should get
    -COLUMNS: terms (this output may look as the article's words but depends on the operators you add under TextInput. Those operators are a filter to get a better output)
    e