Problem with Naive Bayesian
Hello,
I'm from Germany and studying Financial Management. Right now I have to make a presentation about the Naive Bayesian on RapidMiner. My problem is, that I don't understand how the results ,,prediction(no) / prediction(yes)" can be computed.
Here is my XML Process:
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="246" y="34">
<parameter key="laplace_correction" value="true"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="187">
<parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="313" y="340">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="380" y="85">
<parameter key="use_example_weights" value="true"/>
</operator>
</process>
I've used the Golf data set. For example the first row: sunny, Temperature:85, Humidity:85 and Wind:false.
For Temperature and Humidity I've used the probability density function in order to get the following results for no= 0,0003074677... and yes=0,000059000924.
What should I do witht those results to get the results from the Prediction No= 0,711 and Yes= 0,289?
Thank you in advance!
Best Answer
-
Hi again @domi_wiese,
I think you have strictly reach your goal :
Don't forget that you perform calculation without Laplace correction :
without Laplace correction, the results of RapidMiner are :
To come back to the calculations :
5/14 => OK
2/5 => OK
3/5 => OK
0,04125 (Humidity you confirm ?) => OK
0,02121 (Temperature you confirm ?) => NOK (I find 0,1204 => I made an error => can you give detail of your calculation for this case
=> I don't know where is my error in this calculation)
Morality : Is the solution in the calculator of OS Windows..............??????
Best regards,
Lionel
1
Answers
-
-
Thank you for this video. I have already watched it, but there is just the Basic explained, which is not the problem for me. I'm talking about the next steps, I mean: how to combinate the continous numeric values with those from Outlook and Wind to get the predictions (yes and no). In other words: what is the equation to get to those predictions, for example for the first row?
0 -
Hi,
for numeric variables we use a gaussian assumption. The probability is given by the usual gaussian pdf with the calculated mean and variance. For nominal variables we can get the probabilities from simple counting.
Best
1 -
Thank you very much.
Let's stick to the first row with the prediction for no (71,1%).
I've used the probability density function to get for temperature (85) and humidity (85) the following results
Temperature Humidity
yes 0,00097307096 0,00319056274
no 0,0464961233 0,0412564316
At next, I've computed the following results for Outlook=sunny and wind=false
Sunny False
yes 3/9 4/9
no 2/5 2/5
In order to get the prediction no (71,1%) and yes (28,9%) I thought that it would be like this:
Multiply all the results for yes with (9/14) and multiply all the results for no with (5/14). Then add those two results to have the Basis (evidence). At last divide the result for yes with the evidence and the results for no with the evidence to get the predictions.
What am I doing wrong?
Thank you in advance!
0 -
Hi,
This topic interests me a lot.
In deed, from my opinion, it is essential to understand the theory behind the algorithms.
I hope you can give me a few minutes of attention :
1. Here the results of confidence of the Golf test set (after training by the Golf dataset) given by RapidMiner
without Laplace correction :
2. I tried to retrieve this results manually, but I have this illogical results for the first row of the Golf data set :
You can find the whole Excel calculation file by following this link :
https://drive.google.com/open?id=18T153eElmtsjOzihGwLENVh8cwHdaHMT
3. I used too Python, and the results are differents from RapidMiner :
You can fi
you can find the process here :
<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.1.000" expanded="true" height="68" name="Training Golf" width="90" x="45" y="85">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Naive_Bayes_Probabilities\NB_Proba_1.xlsx"/>
<parameter key="imported_cell_range" value="A1:E15"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.1.000" expanded="true" height="103" name="Multiply" width="90" x="179" y="85"/>
<operator activated="true" class="set_role" compatibility="8.1.000" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
<parameter key="attribute_name" value="Play"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="naive_bayes" compatibility="8.1.000" expanded="true" height="82" name="Naive Bayes" width="90" x="447" y="85">
<parameter key="laplace_correction" value="false"/>
</operator>
<operator activated="true" class="read_excel" compatibility="8.1.000" expanded="true" height="68" name="Test Golf" width="90" x="45" y="238">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Naive_Bayes_Probabilities\NB_Proba_1.xlsx"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:D15"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Outlook.true.polynominal.attribute"/>
<parameter key="1" value="Temperature.true.real.attribute"/>
<parameter key="2" value="Humidity.true.real.attribute"/>
<parameter key="3" value="Wind.true.polynominal.attribute"/>
</list>
</operator>
<operator activated="true" class="multiply" compatibility="8.1.000" expanded="true" height="103" name="Multiply (2)" width="90" x="179" y="238"/>
<operator activated="true" class="apply_model" compatibility="8.1.000" expanded="true" height="82" name="Apply Model" width="90" x="581" y="136">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="LabelEncoder" width="90" x="313" y="136">
<parameter key="script" value="import pandas from sklearn.preprocessing import LabelEncoder # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): le = LabelEncoder() data.iloc[:,0] = le.fit_transform(data.iloc[:,0]) data.iloc[:,3] = le.fit_transform(data.iloc[:,3]) data.iloc[:,4] = le.fit_transform(data.iloc[:,4]) # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Naives Bayes Python" width="90" x="447" y="187">
<parameter key="script" value=" from sklearn.naive_bayes import GaussianNB from sklearn.preprocessing import LabelEncoder # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): X= data.iloc[:,0:4] y=data.iloc[:,4] clf = GaussianNB() clf.fit(X,y) # connect 2 output ports to see the results return clf"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="LabelEncoder (2)" width="90" x="313" y="289">
<parameter key="script" value="import pandas from sklearn.preprocessing import LabelEncoder # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): le = LabelEncoder() data.iloc[:,0] = le.fit_transform(data.iloc[:,0]) data.iloc[:,3] = le.fit_transform(data.iloc[:,3]) # connect 2 output ports to see the results return data"/>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Apply Model Python" width="90" x="581" y="238">
<parameter key="script" value=" # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(model, data): base =data[['Outlook', 'Temperature', 'Humidity','Wind']] data['prediction (Play)'] = model.predict(base) data['confidence(no)'] = model.predict_proba(base)[:,0] data['confidence(yes)'] = model.predict_proba(base)[:,1] #set role of prediction attribute to prediction data.rm_metadata['prediction (Play)']=(None,'prediction(Play)') data.rm_metadata['confidence(no)']=(None,'confidence(no)') data.rm_metadata['confidence(yes)']=(None,'confidence(yes)') return data "/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="8.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="238">
<list key="function_descriptions">
<parameter key="prediction (Play)" value="if([prediction (Play)]==0,"no","yes")"/>
</list>
</operator>
<connect from_op="Training Golf" from_port="output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Set Role" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="LabelEncoder" to_port="input 1"/>
<connect from_op="Set Role" from_port="example set output" to_op="Naive Bayes" to_port="training set"/>
<connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_op="Test Golf" from_port="output" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Multiply (2)" from_port="output 2" to_op="LabelEncoder (2)" to_port="input 1"/>
<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
<connect from_op="Apply Model" from_port="model" to_port="result 2"/>
<connect from_op="LabelEncoder" from_port="output 1" to_op="Naives Bayes Python" to_port="input 1"/>
<connect from_op="Naives Bayes Python" from_port="output 1" to_op="Apply Model Python" to_port="input 1"/>
<connect from_op="LabelEncoder (2)" from_port="output 1" to_op="Apply Model Python" to_port="input 2"/>
<connect from_op="Apply Model Python" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>and the training and test Golf dataset by following the link here (Excel file) :
https://drive.google.com/open?id=18Dht5-aTuJVehZvbU3LZLAvzQTBixCLB
4. I think I understood the calculation methodology of confidences and I am almost on my calculations.
Can you help me to find my error if there is an error ?
Why the results of Python are different from RapidMiner ?
Is there a postprocessing of the probabilities in RapidMiner ?
Thanks you for your help.
Best regards,
Lionel
1 -
dang, @IngoRM - @lionelderkrikor also uses Excel to check calculations!! I thought I was the only luddite lingering around. Now if I could only get my hands on my old HP 42S RPN calculator.... :smileylol:
(sorry @lionelderkrikor - I was just showing Ingo some calcs on Excel today and could not resist. Believe me, I am sometimes very proud of my luddite skills...)
Scott
1 -
Hi,
thanks for sharing your point of view Lionel! I really appreciate it.
In my opinion, you used not the ,,correct" equation for the probability density function. I think the following link has it right at the following time: 02:20min.
https://www.youtube.com/watch?v=k2diLn5Nqbs&t=125s&list=PL7r4RQYRQRfgw3-ccVUzdlYh5HK-tQHFs&index=3
By using that equation and performing like I already described in my last post, I've received for prediction no = 78,756%, which still isn't 71,1%.
Could someone please help me find the solution.
Thank you.
0 -
Hi,
i think @lionelderkrikor forgot the priors. So you need to multiply by 4/14 and 10/14 respectively.
Best,
Martin
0 -
Hi,
Thanks you for your feedback @domi_wiese, I admit that you are much closer to the expected results......
Many things :
1. A priori the equation i use, and the equation of your video are equivalent :
yhyhyh
2. In my intermediate results, I retrieve strictly the same results given by RapidMiner in the Distribution Table (mean/std dev of Temperature and Humidity, count of nominal attributes) without Laplace correction :
ppmp
That's why, I don't understand why I obtain these illogical results.
3. @mschmitz, a priori, I have not forgotten the priors in the calculations : In deed that's not explicit and detailed in Excel calculation file.
Although there is no change in the results, here the link to my second release of Excel file :
https://drive.google.com/open?id=12mELZ_SW8fv-VfeRkY-mUjqEUb42ODx6
4. @domi_wiese, maybe you can share your calculation file and/or your intermediate results - P(Xi|Y = yes/no) / P(Y = yes/no) - in order
we find the solution to this mysterious Naive Bayes problematic....
5. Do not give up : I'm sure, we will find the solution to this problem and if we can not do it with Excel, @sgenzer will lend us his HP 42S RPN calculator..... or i will retrieve my old TI 86 calculator from college :
I hope that I advanced the reflection on this topic a little bit.
Best regards,
Lionel
1 -
Ok boys, I'm dropping my beast on the table too...
And on the 7th day, God created the HP 48GX
2 -
Hi,
I'm really sorry. I made a mistake while using the probability density function.
But I've corrected them. Now, I have computed like in the first picture below, in order to get the the intermediate result for no. I do the same for yes. After that, I got the predictions which is 71,7% for no. This is still around 0,5% too much, but I think it could be correct. What do you think?1 Picture2 Picture
1 -
Hi again @domi_wiese,
I think you have strictly reach your goal :
Don't forget that you perform calculation without Laplace correction :
without Laplace correction, the results of RapidMiner are :
To come back to the calculations :
5/14 => OK
2/5 => OK
3/5 => OK
0,04125 (Humidity you confirm ?) => OK
0,02121 (Temperature you confirm ?) => NOK (I find 0,1204 => I made an error => can you give detail of your calculation for this case
=> I don't know where is my error in this calculation)
Morality : Is the solution in the calculator of OS Windows..............??????
Best regards,
Lionel
1 -
Hi @lionelderkrikor,
thank you for bringing my attention to laplace correction. I'll look after that by tommorow.
Of course I will send you my calculation.1 -
Hi again @domi_wiese,
Thanks to you, I found my error : a problem of bracket and exponent in Excel......
For my general culture : What is your calculator software ?
and good luck for your presentation.
Best regards,
Lionel
1 -
Hi @lionelderkrikor,
I'm glad we found our mistakes and solved them and thank you for wishing me luck.
To be honest: First I used my own calculator, but then I used a calculator on the internet. I can show you the link of course.
Have a nice day!
1 -
Hi @lionelderkrikor,
just one thing: could you please send me a picture of your design view with the process? And where is the option with the laplace correction? I know what that is, but I can't find the position of it.
Thank you in advance!
1 -
1