corroborating facts and probabilities

middlegame

New member
Joined
Dec 9, 2008
Messages
2
I'm working at a company that needs to match doctors on a health care claim to a list of doctors in a network. In some cases all we have is the name and an address. The general problem is larger than this but here's a question that would help me along:

I have a list of names and each name is associated with exactly one address. No addresses are shared. If I have a claim come in with both a name and an address on it, I separately find the probability (a separate problem) that the name and address typed on the claim match a name and an address from my list (what they type might not match exactly because of transcription errors, etc).

so if I have the probability that the name is a match at .5
and the probability that the address is a match at .5

it would make intuitive sense that the two pieces of evidence together would indicate that the name/address pair has a higher probability than .5 of being the correct match.

How can I put this intuition to practice?

Thanks.
 
One way to look at this, mathematically, might be as a classic problem in Bayesian inference. Its applications in real-world scenarios are widespread. For instance, suppose you are sitting on a jury that hears DNA evidence. But you know there's a chance of a false positive. It's millions to one, but still nonzero. Let's say it's a million to one that a DNA will match the defendant if the defendant is innocent. That's a false positive (p = .000001).

Now, what is the probability that any random person committed the crime. Suppose, because you have also seen other evidence, you decide that there's a 20% chance the defendant is guilty anyway. We will denote this as event A (that the defendant is guilty).

A positive DNA test we will call event B.

What you as a juror need to formally and methodically determine is:

Q1: If the DNA test is negative, what is the probability that the defendant is guilty?

Q2: If the DNA test is positive, what is the probability that the defendant is guilty?

You already know the probability of event A (.20). But now, if that DNA test comes back positive, you have to revise this probability given the new evidence. It's the same with your name and address. If you can determine the probability of one event happening before you know another event, then you can revise that probability based on the outcome of the second event.

Look up references on Bayesian inference to see if you can find a more direct example.

Your intuition is correct. In fact, the revised probability of guilt given both the a priori evidence in the case (p=.20) AND the positive DNA match would be near unity (p=1), which is bigger than either probability taken on its own. The Bayesian stuff is a way to formalize that with solid mathematics. Your probabilities of 0.5, however, would be in question, and that is where your research has to come into play. The revised probability is a matter of computation, but you have to start with good data.

Q1: What is the real probability that you will identify the doctor correctly given a matching name?

Q2: What is the real probability that you will identify the doctor correctly given a matching address?

Those are your values, and I'm not sure I see where you're getting 0.5. Experience? Can you document it with a known sample?

Anyway, once you know those probabilities, applying the formula for the combining the two pieces of evidence is a straghtforward computation. Put your work into determining the probabilities of the individual events, though, because everything depends on it.
 
Thanks for the reply. The .5 figures were for the example only. I intend to try to figure the probability that the name that comes in is the name that the person intended to type by finding some data on error distributions made while transcribing text. (I'll find matches in the database within the acceptable error range utilizing q-grams and distance span algorithms) Obviously more than one doctor can have the same name and more than one could also have the same address so the problem gets more complicated.

This definitely gives me a research direction. My math skills are quite rusty but I did get a degree in electrical engineering so at least I have some background that will get exercised. My guess is it will take me a couple of months to get back to a point where I could fully formulate the problem. Fortunately this is a long term project.
 
Top