Bayes with multiple hypotheses?

sploo

New member
Joined
Sep 10, 2014
Messages
1
I'm trying to work out a Bayes algorithm that's a little bit like an email spam filter, but I've hit the limit of my knowledge in this field.


I've read building a simple spam filter with naive bayes classifier and c# | netmatze and I'm comfortable with the cookie example. I'm also happy that I understand the spam/not spam logic that's presented. But, what if you have more than 2 hypotheses?


Imagine I have a number of emails from 3 different people (Bob, Sue, John), and a candidate email (and I want to determine which of them was the most likely author of that email). Based on the nomenclature of the website linked above, we have:


H1 = Bob = 1/3
H2 = Sue = 1/3
H3 = John = 1/3


P(h1) = P(h2) = P(h3) = 1/3


I have 5 emails from Bob, 3 from Sue, and 8 from John. In my candidate email, I take the first word and find it exists in 2 of the emails from Bob, 1 of the emails from Sue, and none of the emails from John. So:


P(E/h1) = 2 / 5 (two mails out of five)
P(E/h2) = 1 / 3 (one mail out of three)
P(E/h3) = 0 / 8 (no mails out of eight)


Assuming the above link is correct in dropping P(E), this...


P(h1/E) = P(h1) * P(E/h1) / P(E)
P(h2/E) = P(h2) * P(E/h2) / P(E)
P(h3/E) = P(h3) * P(E/h3) / P(E)


...becomes:


P(h1/E) = 1/3 * 2/5 = 0.1333
P(h2/E) = 1/3 * 1/3 = 0.1111
P(h3/E) = 1/3 * 0/8 = 0.0000


From that, I'd determine that the first word in the candidate email was more likely to belong to an email from Bob (H1). However, how would I correctly sum the results for the remaining words in the candidate email?


In the example on the above link, we only have two hypotheses (spam or not spam) and a Q value is being calculated for the two respective outcomes:


Q = P(h1/E) / P(h2/E)


The Q values for each word (<1 for H2, >1 for H1) are summed, and then divided by the number of words to produce a final result that's greater or less than one.


I assume that this calculation of a Q value is important, as it's "weighting" one value against another before they're summed. Summing all the P(h1/E) values and dividing them by the sum of the P(h2/E) values would produce a different result. So, how do I do the right thing for >2 hypotheses when I'm trying to "sum" multiple results?


For example, imagine I test the next word in the candidate email, and find it exists in 1 mail from Bob, 0 mails from Sue and 3 mails from John. This would give:


P(h1/E) = 1/3 * 1/5 = 0.0666
P(h2/E) = 1/3 * 0/3 = 0.0000
P(h3/E) = 1/3 * 3/8 = 0.1250


I assume it isn't correct to simply sum the P(hx/E) values as follows:


0.1333 + 0.0666
0.1111 + 0.0000
0.0000 + 0.1250


Any help would be greatly appreciated!
 
Top