Bayes with multiple hypotheses?

sploo · Sep 10, 2014

I'm trying to work out a Bayes algorithm that's a little bit like an email spam filter, but I've hit the limit of my knowledge in this field.

I've read building a simple spam filter with naive bayes classifier and c# | netmatze and I'm comfortable with the cookie example. I'm also happy that I understand the spam/not spam logic that's presented. But, what if you have more than 2 hypotheses?

Imagine I have a number of emails from 3 different people (Bob, Sue, John), and a candidate email (and I want to determine which of them was the most likely author of that email). Based on the nomenclature of the website linked above, we have:

H1 = Bob = 1/3
H2 = Sue = 1/3
H3 = John = 1/3

P(h1) = P(h2) = P(h3) = 1/3

I have 5 emails from Bob, 3 from Sue, and 8 from John. In my candidate email, I take the first word and find it exists in 2 of the emails from Bob, 1 of the emails from Sue, and none of the emails from John. So:

P(E/h1) = 2 / 5 (two mails out of five)
P(E/h2) = 1 / 3 (one mail out of three)
P(E/h3) = 0 / 8 (no mails out of eight)

Assuming the above link is correct in dropping P(E), this...

P(h1/E) = P(h1) * P(E/h1) / P(E)
P(h2/E) = P(h2) * P(E/h2) / P(E)
P(h3/E) = P(h3) * P(E/h3) / P(E)

...becomes:

P(h1/E) = 1/3 * 2/5 = 0.1333
P(h2/E) = 1/3 * 1/3 = 0.1111
P(h3/E) = 1/3 * 0/8 = 0.0000

From that, I'd determine that the first word in the candidate email was more likely to belong to an email from Bob (H1). However, how would I correctly sum the results for the remaining words in the candidate email?

In the example on the above link, we only have two hypotheses (spam or not spam) and a Q value is being calculated for the two respective outcomes:

Q = P(h1/E) / P(h2/E)

The Q values for each word (<1 for H2, >1 for H1) are summed, and then divided by the number of words to produce a final result that's greater or less than one.

I assume that this calculation of a Q value is important, as it's "weighting" one value against another before they're summed. Summing all the P(h1/E) values and dividing them by the sum of the P(h2/E) values would produce a different result. So, how do I do the right thing for >2 hypotheses when I'm trying to "sum" multiple results?

For example, imagine I test the next word in the candidate email, and find it exists in 1 mail from Bob, 0 mails from Sue and 3 mails from John. This would give:

P(h1/E) = 1/3 * 1/5 = 0.0666
P(h2/E) = 1/3 * 0/3 = 0.0000
P(h3/E) = 1/3 * 3/8 = 0.1250

I assume it isn't correct to simply sum the P(hx/E) values as follows:

0.1333 + 0.0666
0.1111 + 0.0000
0.0000 + 0.1250

Any help would be greatly appreciated!

Bayes with multiple hypotheses?

sploo

New member