4.6: Bayes’ Theorem

Last updated
Save as PDF

Page ID: 95650

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Another trick that helps compute probabilities in practice is Bayes’ Theorem. We’ve defined Pr(\(A|K\)) as \(\frac{\text{Pr}(A \cap K)}{\text{Pr}(K)}\), and by swapping the letters we get Pr(\(K|A\)) = \(\frac{\text{Pr}(K \cap A)}{\text{Pr}(A)}\). Combining these with a little bit of algebra yields:

\[\begin{aligned} \text{Pr}(A|K) = \dfrac{\text{Pr}(K|A) \ \text{Pr}(A)}{\text{Pr}(K)}\end{aligned}\]

Now this is a very, very powerful equation that has a multitude of uses throughout computer science and statistics. What makes it powerful is that it allows us to express Pr(\(A|K\)), a quantity often very difficult to estimate, in terms of Pr(\(K|A\)), which is often much easier.

A simple and commonly cited example is that of interpreting medical exam results for the presence of a disease. If your doctor recommends that you undergo a blood test to see if you have some rare condition, you might test positive or negative. But suppose you do indeed test positive. What’s the probability that you actually have the disease? That, of course, is the key point.

In symbols, we’re looking for Pr(\(D|T\)), where \(D\) is the event that you actually have the disease in question, and \(T\) is the event that you test positive for it. But this is hard to approximate with available data. For one thing, most people who undergo this test don’t test positive, so we don’t have a ton of examples of event \(T\) occurring whereby we could count the times \(D\) also occurred. But worse, it’s hard to tell whether a patient has the disease, at least before advanced symptoms develop — that, after all, is the purpose of our test!

Bayes’ Theorem, however, lets us rewrite this as:

\[\begin{aligned} \text{Pr}(D|T) = \dfrac{\text{Pr}(T|D) \ \text{Pr}(D)}{\text{Pr}(T)}.\end{aligned}\]

Now we have Pr(\(D|T\)), the hard quantity to compute, in terms of three things we can get data for. To estimate Pr(\(T|D\)), the probability of a person who has the disease testing positive, we can administer the test to unfortunate patients with advanced symptoms and count how many of them test positive. To estimate Pr(\(D\)), the prior probability of having the disease, we can divide the number of known cases by the population as a whole to find how prevalent it is. And getting Pr(\(T\)), the probability of testing positive, is easy since we know the results of the tests we’ve administered.

In numbers, suppose our test is 99% accurate — i.e., if someone actually has the disease, there’s a .99 probability they’ll test positive for it, and if they don’t have it, there’s a .99 probability they’ll test negative. Let’s also assume that this is a very rare disease: only one in a thousand people contracts it.

When we interpret those numbers in light of the formula we’re seeking to populate, we realize that Pr(\(T|D\)) = .99, and Pr(\(D\)) = \(\frac{1}{1000}\). The other quantity we need is Pr(\(T\)), and we’re all set. But how do we figure out Pr(\(T\)), the probability of testing positive?

Answer: use the Law of Total Probability. There are two different “ways" to test positive: (1) to actually have the disease, and (correctly) test positive for it, or (2) to not have the disease, but incorrectly test positive for it anyway because the test was wrong. Let’s compute this:

\[\begin{aligned} \text{Pr}(T) &= \text{Pr}(T|D) \ \text{Pr}(D) + \text{Pr}(T|\overline{D}) \ \text{Pr}({\overline{D}}) \notag \\ &= .99 \cdot \frac{1}{1000} + .01 \cdot \frac{999}{1000} \notag \\ &= .00099 + .00999 = .01098 \label{totalprobeq}\end{aligned}\]

See how that works? If I do have the disease (and there’s a 1 in 1,000 chance of that), there’s a .99 probability of me testing positive. On the other hand, if I don’t have the disease (a 999 in 1,000 chance of that), there’s a .01 probability of me testing positive anyway. The sum of those two mutually exclusive probabilities is .01098.

Now we can use our Bayes’ Theorem formula to deduce:

\[\begin{aligned} \text{Pr}(D|T) &= \dfrac{\text{Pr}(T|D) \ \text{Pr}(D)}{\text{Pr}(T)} \\ &= \dfrac{.99 \cdot \frac{1}{1000}}{.01098} \approx .0902\end{aligned}\]

Wow. We tested positive on a 99% accurate medical exam, yet we only have about a 9% chance of actually having the disease! Great news for the patient, but a head-scratcher for the math student. How can we understand this? Well, the key is to look back at that Total Probability calculation in equation 4.1. Remember that there were two ways to test positive: one where you had the disease, and one where you didn’t. Look at the contribution to the whole that each of those two probabilities produced. The first was .00099, and the second was .00999, over ten times higher. Why? Simply because the disease is so rare. Think about it: the test fails once every hundred times, but a random person only has the disease once every thousand times. If you test positive, it’s far more likely that the test screwed up than that you actually have the disease, which is rarer than blue moons.

Anyway, all the stuff about diseases and tests is a side note. The main point is that Bayes’ Theorem allows us to recast a search for Pr(\(X|Y\)) into a search for Pr(\(Y|X\)), which is often far easier to find numbers for.

One of many computer science applications of Bayes’ Theorem is in text mining. In this field, we computationally analyze the words in documents in order to automatically classify them or form summaries or conclusions about their contents. One goal might be to identify the true author of a document, given samples of the writing of various suspected authors. Consider the Federalist Papers, the group of highly influential \(18^{th}\) century essays that argued for ratifying the Constitution. These essays were jointly authored by Alexander Hamilton, James Madison, and John Jay, but it was uncertain for many years which of these authors wrote which specific essays.

Suppose we’re interested in determining which of these three Founding Fathers actually wrote essay #84 in the collection. To do this, the logical approach is to find Pr(Hamilton\(|\)essay84), Pr(Madison\(|\)essay84), and Pr(Jay\(|\)essay84), and then choose the author with the highest probability. But how can we possibly find out Pr(Hamilton\(|\)essay84)? “Given that essay #84 has these words in this order, what’s the probability that Hamilton wrote it?" Impossible to know.

But with Bayes’ Theorem, we can restructure this in terms of Pr(essay84\(|\)Hamilton) instead. That’s a horse of a different color. We have lots of known samples of Hamilton’s writing (and Madison’s, and Jay’s), so we can ask, “given that Hamilton wrote an essay, what’s the probability that he would have chosen the words that appear in essay #84?" Perhaps essay #84 has a turn of phrase that is very characteristic of Hamilton, and contains certain vocabulary words that Madison never used elsewhere, and has fewer sentences per paragraph than is typical of Jay’s writing. If we can identify the relevant features of the essay and compare them to the writing styles of the candidates, we can use Bayes’ Theorem to estimate the relative probabilities that each of them would have produced that kind of essay. I’m glossing over a lot of details here, but this trick of exchanging one conditional probability for the other is the backbone of this whole technique.