GUI Field Question
jaskiemr
New Altair Community Member
Hi, I'm doing some simple text analysis and to get started, I'm reading in a number of HTML pages,
Tagged:
0
Answers
-
Oops, hit enter too soon.
Anyhow, I'm trying to do some text analysis and I'm reading in HTML pages, lowercasing everything, tokenizing everything, and then filtering out english stop words.
My question is, in the exampleset textinput view of the statistics, what does the statistics column represent? Is it the percent of times a word appears in the total set of words or is it the percent of documents that a word appears in?
Also, what is the range column?
I didn't see the answer in the GUI tutorial.
Any help is appreciated. Thanks,
mj0 -
Hi,
it's quite simple: These columns are independent from the actual source of data. It simply shows some general statistics as mean and standard deviation of all numerical attributes. If you have loaded your text in TFIDF representation, it shows you the mean and standard deviaiton of the TDIDF values. As does the range, whose name is quite self-explanatory I think...
Greetings,
Sebastian0 -
Sebastian, thank you for your reply.
I understand range mathematically, however what does it mean in the text mining domain? If I have a range of the word "Hello" from 0 to .003 and a mean of .002 (I'm making this up), the discrete nature of the word doesn't fit in the definition of the range in my head.
Forgive my empty head.
Again, thanks.
mj0 -
I got home so I can put in a concrete example.
I see that "html" has a value type of "real", average of 0.088 +/- 0.073, range of 0.003 to 0.0530.
mj0 -
Okay, I think I figured part of it out. I've got 2 documents, one w/ "hello world" and another with "hello". Vector_creation is "term occurrences". Mean comes out to 1 for hello since it's in both documents and std dev of 0. World mean is .5 because it's in half of the documents. Can't figure out std dev yet.0
-
Hi,
why not? it's just the standard deviation of the values of this attribute. Ignoring if it's the number of occurrences, a tf idf representation or simply a temperature. Where's the problem in calculating a standard deviation from two values?
Greetings,
Sebastian0