4.4: Conditional probability

Last updated
Save as PDF

Page ID: 95648

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

I mentioned that Bayesians are especially concerned with the idea of revising estimates about probability based on new information that may come to light. This notion can be crystallized in the idea of conditional probability. When we talk about the conditional probability of an event \(A\), we mean “what’s the probability that \(A\) occurs, given that I know some other event \(K\) has also occurred?" Think of \(K\) as “background knowledge": it’s additional information which, when known, may influence how likely we think \(A\) is to have occurred. It can be mathematically computed as follows:

\[\text{Pr}(A|K) = \dfrac{\text{Pr}(A \cap K)}{\text{Pr}(K)}\]

We pronounce Pr(\(A|K\)) as “the probability of \(A\) given \(K\)." It is the conditional probability of \(A\), or “the probability of \(A\) conditioned on \(K\)." We’ll sometimes call plain old Pr(\(A\)) the a priori probability, or the prior probability if we don’t want to sound Latin. The prior is simply the original unadjusted probability, if we aren’t privy to the background information \(K\).

Let’s go back to American Idol. We know that the probability of an underage winner is only .4, because \(U\) = { Kelly, Fantasia }, and we estimate that each of them has a .2 probability of winning. So it seems more likely than not that our winner will be over 21. But wait: suppose we had some additional information. Just before the outcome is announced, news is leaked through a Rupert Murdoch news source that the winner is a woman! If we believe this reporter, does that change our expectation about how old the winner is likely to be?

Indeed it does. Knowing that the winner is female eliminates Dave from consideration. Looking back at Figure 4.2.1, we can see that once we know Dave is out of the running, the remaining pool consists of just \(F\), which includes Kelly, Fantasia, and Carrie. The question is, how do we update our probability from .4 to reflect the fact that only these three ladies are

In this case \(F\) is the background knowledge: we know that the event \(F\) has occurred. And we want to know how likely \(U\) is to also have occurred. This is found easily:

\[\begin{aligned} \text{Pr}(U|F) &= \dfrac{\text{Pr}(U \cap F)}{\text{Pr}(F)} \\ &= \dfrac{\text{Pr}(\{\text{Kelly,Fantasia}\})}{\text{Pr}(\{\text{Kelly,Fantasia,Carrie}\})} \\ &= \dfrac{.4}{.5} = .8.\end{aligned}\]

Our estimated chance of an underage winner doubled once we found out she was female (even though we don’t yet know which female).

If you stare at the equation and diagram, you’ll see the rationale for this formula. Kelly and Fantasia originally had only .4 of the entire probability between them. But once David was axed, the question became: “what percentage of the remaining probability do Kelly and Fantasia have?" The answer was no longer .4 out of 1, but .4 out of .5, since only .5 of the whole was left post-David. This is why we divided by Pr(\(F\)): that’s what we know remains given our background fact.

Now in this case, the conditional probability was higher than the original probability. Could it ever be lower? Easily. Consider the probability of a rock-star winner, Pr(\(R\)). A priori, it’s .7. But again, let’s say we had information leaked to us that the winner, whoever she may be, is female. We can now update our estimate:

\[\begin{aligned} \text{Pr}(R|F) &= \dfrac{\text{Pr}(R \cap F)}{\text{Pr}(F)} \\ &= \dfrac{\text{Pr}(\{\text{Kelly}\})}{\text{Pr}(\{\text{Kelly,Fantasia,Carrie}\})} \\ &= \dfrac{.2}{.5} = .4.\end{aligned}\]

You see, once we find out that David is no longer a possibility, our only remaining hope for a rock star is Kelly. And she has only 40% of the probability that’s left over. Note that this is a higher chance for her personally — she’s got to be excited by the press leak — but it’s lower for rock stars, of which she is only one (and evidently, not the predicted strongest).

Background knowledge can even peg our probability estimate to an extreme: all the way to 0, or to 1. What’s Pr(\(U|C\)), the probability of an underage winner, given that he/she is a country singer? The intersection of \(U\) and \(C\) is zero, so this makes Pr(\(U|C\)) = 0. In words: a country winner eliminates any possibility of an underage winner. And what’s Pr(\(F|U\)), the probability that a woman wins, given that we know the winner to be underage? Well, \(F \cap U\) and \(U\) are the same (check me), so \(\frac{\text{Pr}(F \cap U)}{\text{Pr}(U)} = \frac{.4}{.4} = 1\). Therefore, an underage winner guarantees a female winner.

The way I think about conditional probability is this: look at the diagram, consider the events known to have occurred, and then mentally block out everything except that. Once we know the background fact(s), we’re essentially dealing with a restricted world. Take the example of the known female winner. Once we know that event \(F\) in fact occurred, we can visually filter out David, and look at the \(F\) blob as though that were our entire world. In this restricted female-only view, the underage elements comprise a greater percentage of the total than they did before. And half of the rock-star elements have now been obscured, leaving only Kelly as the one-of-the-remaining-three.

Many psychologists, by the way, claim that we’re constantly doing this sort of thing in our minds: gathering facts, then revising our beliefs about the world in light of those facts. We start by believing that Pr(\(X\)) is approximately some value. Then we learn \(K_1\) has occurred, and we update this to Pr(\(X|K_1\)). Then we learn that \(K_2\) has also occurred, and so now we have Pr(\(X|K_1 \cap K_2\)). (Can you see why it’s the intersection?) The more we learn, the more we revise our estimate up or down, presumably getting more accurate as we go. Another way of looking at it is that every time we learn something new is true, we also learn that its opposite is not true, and therefore we can eliminate some parts of the theoretically-possible universe that we have now ruled out. The denominator gets smaller and smaller as we eliminate possibilities.

Keep in mind, by the way, that unlike union and intersection, conditional probability is not commutative. In other words, Pr(\(X|Y\)) \(\neq\) Pr(\(Y|X\)) in general. To take just one example, look again at the \(F\) and \(U\) sets from All-time Idol. Pr(\(F|U\)), as we already computed, is equal to 1 since if \(U\) has occurred, we automatically know that \(F\) has also occurred (there aren’t any underage contestants except females). But the reverse is certainly not true: just because we have a female winner doesn’t mean we have an underage winner, since the winner might be Carrie. Working it out, Pr(\(U|F\)) = \(\frac{\text{Pr}(U \cap F)}{\text{Pr}(F)} = \frac{.4}{.5} = .8\). Higher than Pr(\(U\)), but not 1.