"text mining"
ghina84
New Altair Community Member
hello everybody..
which operator I should use to load a serie of text files (.txt or .xml)?????
thank you,
laura
Tagged:
0
Answers
-
The text plug in come with some examples. They use TextInput that point to a directory with files.
You can also use ExampleSource and then StringTextInput... I learned from the examples0 -
I tried with the TextInput too, looking at the example, but the output of the node is not the one it should be:
instead of having a table with documents in the lines and terms as columns, I get a table with COLUMNS=DOCUMENTS0 -
I do not quite understand what you are trying to accomplish. could you explain your process a bit?0
-
sure
My goal is to analyse a serie of articles in .txt format.
To do this I have to load the .txt files using for example TextInput.
Looking at this example http://nemoz.org/joomla/content/view/65/53/lang,de/ the output of this opertator SHOULD be a table like this:
-ROWS: articles
-COLUMNS: terms
(this is written right after the second image in the page I gave you the link).
This matrix, usually called Document Term Matrix, tells you each document (rows) which words (columns) contains, so is a sparse matrix of binary values, and it is used in the next steps of the analysis.
BUT...instead of this, I get a table like this:
-ROWS:progressive id of the article
-COLUMNS:article (i.e. all the text of each article is the label of an attribute!!!)
...and I don't know:
1) if this is correct...but I don't think so
2) how to solve the problem
I hope I explain myself better...thank you for the reply and for the help!!
ciao,
laura
0 -
Ciao,
It should be:
-ROWS: article id
-COLUMNS: terms
You see -ROWS: id number because you define the id_attribute_type as number.
if you change the id_attribute_type to use short or long instead, you will get the filename or filename+path of the article. The idea here is that you do not get the whole article just a reference id to the article.
You should get
-COLUMNS: terms (this output may look as the article's words but depends on the operators you add under TextInput. Those operators are a filter to get a better output)
e
0