Correct ARFF Format?
Legacy User
New Altair Community Member
Hi all,
I'm running a Naive Bayes classifier on a set of keyword/keyphrases and then using the produced model to predict the label attribute for an unclassified set of keywords/keyphrases. However, I'm running into some strange problems where the result of my applied model shows a ? if I have a space between keywords. I'm thinking that I may be formatting my ARFFs incorrectly?
Here is my training set:
Thank you.
I'm running a Naive Bayes classifier on a set of keyword/keyphrases and then using the produced model to predict the label attribute for an unclassified set of keywords/keyphrases. However, I'm running into some strange problems where the result of my applied model shows a ? if I have a space between keywords. I'm thinking that I may be formatting my ARFFs incorrectly?
Here is my training set:
@RELATION c_trainingAnd here is my test set:
@ATTRIBUTE keywords STRING
@ATTRIBUTE change {up,down,neutral}
@DATA
'delay acquisition',down
'facing the same conundrum',down
'restructuring',down
'delay acquisition',up
'divestiture',down
'profit dissipated',down
'delay acquisition',up
'profits up', up
'profits down', down
'delay acquisition',up
'delay acquisition',up
'delay acquisition',up
'delay acquisition',up
@RELATION c_testAny help would be appreciated.
@ATTRIBUTE keywords STRING
@DATA
'profit dissipated'
Thank you.
Tagged:
0
Answers
-
Hi,
as far as I remember, Arff uses double quotes (") instead of single quotes ('). Could that be the reason?
Cheers,
Ingo0 -
Nope, I tried converting the single quotes to double quotes in both the training and test data. This problem is in both the gui and when using the jar as a library.
The end result of the above training and test data (with double quotes) is? down 0.24710424710424708 0.752895752895753 0.0
instead of the expectedprofit dissipated down 0.24710424710424708 0.752895752895753 0.0
But the problem is strange. If in the test data we change "profit dissipated" to "profit dissipated" (with 2 spaces) it works fine.
0 -
Also, this warning come up. Not sure what it means, but perhaps it is related.
G Apr 6, 2009 11:37:31 AM: [Warning] Distribution: The number of nominal values is not the same for training and application for attribute 'keywords', training: 5, application: 1
0 -
You are right. There is no difference between single and double quotes. Both are supported. But you missed one important thing by not getting your meta data description right. You are actually not having a "string" attribute but a nominal (categorical) one where you have to define all occuring values. If you do this correctly for both the training and the test data, at least the Naive Bayes error should be gone. Since I do also not have any other issue during data loading I assume that also the output problem could be fixed by that.
So a correct Arff for training would look like
and for testing accordingly
@RELATION c_training
@ATTRIBUTE keywords {'delay acquisition','facing the same conundrum','restructuring','divestiture','profit dissipated','profit dissipated','profits
down'}
@ATTRIBUTE change {up,down,neutral}
@DATA
'delay acquisition',down
'facing the same conundrum',down
'restructuring',down
'delay acquisition',up
'divestiture',down
'profit dissipated',down
'delay acquisition',up
'profit dissipated', up
'profits down', down
'delay acquisition',up
'delay acquisition',up
'delay acquisition',up
'delay acquisition',up
@RELATION c_test
@ATTRIBUTE keywords {'delay acquisition','facing the same conundrum','restructuring','divestiture','profit dissipated','profit dissipated','profits
down'}
@DATA
'profit dissipated'
Please check the meta data view in order to check if everything is done correctly. Instead of using Arff you could also use the Attribute Editor of RM if you do not want to type in the different values yourself. Alternatively, you could load in the data from Arff using a string attribute and write down the data with the ExampleSetWriter (both the meta data file .aml and the data file .dat). Then you could use the same basic .aml file for your test data.
Cheers,
Ingo0 -
Thanks, Ingo. That was the problem. All fixed now!0