"vectorize text features but do regression on a numerical column?"
Legacy User
New Altair Community Member
I've looked through the forums, documentation, and examples, but can't figure out how to learn a numerical function on text features. That is, the data has text columns and a numeric column. I want to do feature extraction/vectorization on the text columns in order to predict the numerical column.
There is an example in the Text mining samples that shows how to create an example set that contains both text and other columns. Great.
The problem is how to set a regression learner to use the numeric column as the target?
I'm pulling data from a database, so use StringTextInput. If I set the numerical column as the 'label' (using operator attributes) so that it's the target for the learner, the text processor complains:
[Fatal] Process failed: The label attribute (#3: views (integer/single_value)) must be nominal for wvtool.
If I don't set the 'label' attribute as the target column, then the learner complains:
[Fatal] UserError occured in 1st application of JMySVMLearner (JMySVMLearner)
[Fatal] Process failed: Input example set does not have a label attribute
So,
1) how do I get the text processor to let the numerical column pass without modification as the target?
2) how do I specify which column is used as the target for regression (not classification) learning?
Thanks,
Gary
There is an example in the Text mining samples that shows how to create an example set that contains both text and other columns. Great.
The problem is how to set a regression learner to use the numeric column as the target?
I'm pulling data from a database, so use StringTextInput. If I set the numerical column as the 'label' (using operator attributes) so that it's the target for the learner, the text processor complains:
[Fatal] Process failed: The label attribute (#3: views (integer/single_value)) must be nominal for wvtool.
If I don't set the 'label' attribute as the target column, then the learner complains:
[Fatal] UserError occured in 1st application of JMySVMLearner (JMySVMLearner)
[Fatal] Process failed: Input example set does not have a label attribute
So,
1) how do I get the text processor to let the numerical column pass without modification as the target?
2) how do I specify which column is used as the target for regression (not classification) learning?
Thanks,
Gary
Tagged:
0
Answers
-
Hi,
I have right now no text plugin here so here are only a few guesses.
Maybe it is possible to just use the numerical attribute as an arbitrary special attribute instead of the label - try to change the role of the attribute with the operator ChangeAttributeRole to something like "my_special_attribute" and check if it is kept during text processing. If yes, simply change it afterwards to a label (again with the ChangeAttributeRole operator).
If this does not work, you can attach an ID to your data set (with operator "IDTagging"), multiply the data set (with "IOMultiplier"), perform the text processing on the first data set (the ID should be kept during this) and remove the string attribute from the second data set. Afterward, you can join both data sets ("ExampleSetJoin") and change the numerical column to the label ("ChangeAttributeRole"). That's all.
Cheers,
Ingo0 -
Yes, it seems to work. Here's what I did:
1. Erase if necessary the value for 'label_attribute' in the DatabaseExampleSource operator. This means my numeric column is read in as an integer or real valued column. The text operators ignore it because they work on 'nominal' columns.
2. Following the StringTextInput operator and its children, add a ChangeAttributeRole operator. Set 'name' to my desired target column name. Set 'target_role' to 'label'.
It's strange that this workaround is necessary, but it seems to create the desired dataset.
Thanks,
Gary0 -
Hi,
What is strange about this? The main thing is that the text input operators work a) only on regular attributes (they should ignore the special ones like id or weight with exception of the label) and b) only on nominal / string attributes. So changing the role before the operator and after seems pretty straightforward ???
It's strange that this workaround is necessary, but it seems to create the desired dataset.
Cheers,
Ingo0