Skip to main content
Mathematics LibreTexts

11.3: Multiple Comparisons

  • Page ID
    130658
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    Learning Objectives
    • When you perform a large number of statistical tests, some will have \(P\) values less than \(0.05\) purely by chance, even if all your null hypotheses are really true. The Bonferroni correction is one simple way to take this into account; adjusting the false discovery rate using the Benjamini-Hochberg procedure is a more powerful method.

    The problem with multiple comparisons

    Any time you reject a null hypothesis because a \(P\) value is less than your critical value, it's possible that you're wrong; the null hypothesis might really be true, and your significant result might be due to chance. A \(P\) value of \(0.05\) means that there's a \(5\%\) chance of getting your observed result, if the null hypothesis were true. It does not mean that there's a \(5\%\) chance that the null hypothesis is true.

    For example, if you do \(100\) statistical tests, and for all of them the null hypothesis is actually true, you'd expect about \(5\) of the tests to be significant at the \(P<0.05\) level, just due to chance. In that case, you'd have about \(5\) statistically significant results, all of which were false positives. The cost, in time, effort and perhaps money, could be quite high if you based important conclusions on these false positives, and it would at least be embarrassing for you once other people did further research and found that you'd been mistaken.

    This problem, that when you do multiple statistical tests, some fraction will be false positives, has received increasing attention in the last few years. This is important for such techniques as the use of microarrays, which make it possible to measure RNA quantities for tens of thousands of genes at once; brain scanning, in which blood flow can be estimated in \(100,000\) or more three-dimensional bits of brain; and evolutionary genomics, where the sequences of every gene in the genome of two or more species can be compared. There is no universally accepted approach for dealing with the problem of multiple comparisons; it is an area of active research, both in the mathematical details and broader epistomological questions.

    Controlling the familywise error rate - Bonferroni Correction

    The classic approach to the multiple comparison problem is to control the familywise error rate. Instead of setting the critical \(P\) level for significance, or alpha, to \(0.05\), you use a lower critical value. If the null hypothesis is true for all of the tests, the probability of getting one result that is significant at this new, lower critical value is \(0.05\). In other words, if all the null hypotheses are true, the probability that the family of tests includes one or more false positives due to chance is \(0.05\).

    The most common way to control the familywise error rate is with the Bonferroni correction. You find the critical value (alpha) for an individual test by dividing the familywise error rate (usually \(0.05\)) by the number of tests. Thus if you are doing \(100\) statistical tests, the critical value for an individual test would be \(0.05/100=0.0005\), and you would only consider individual tests with \(P<0.0005\) to be significant. As an example, García-Arenzana et al. (2014) tested associations of \(25\) dietary variables with mammographic density, an important risk factor for breast cancer, in Spanish women. They found the following results:

    Dietary variable P value
    Total calories <0.001
    Olive oil 0.008
    Whole milk 0.039
    White meat 0.041
    Proteins 0.042
    Nuts 0.06
    Cereals and pasta 0.074
    White fish 0.205
    Butter 0.212
    Vegetables 0.216
    Skimmed milk 0.222
    Red meat 0.251
    Fruit 0.269
    Eggs 0.275
    Blue fish 0.34
    Legumes 0.341
    Carbohydrates 0.384
    Potatoes 0.569
    Bread 0.594
    Fats 0.696
    Sweets 0.762
    Dairy products 0.94
    Semi-skimmed milk 0.942
    Total meat 0.975
    Processed meat 0.986

    As you can see, five of the variables show a significant (\(P<0.05\)) \(P\) value. However, because García-Arenzana et al. (2014) tested \(25\) dietary variables, you'd expect one or two variables to show a significant result purely by chance, even if diet had no real effect on mammographic density. Applying the Bonferroni correction, you'd divide \(P=0.05\) by the number of tests (\(25\)) to get the Bonferroni critical value, so a test would have to have \(P<0.002\) to be significant. Under that criterion, only the test for total calories is significant.

    The Bonferroni correction is appropriate when a single false positive in a set of tests would be a problem. It is mainly useful when there are a fairly small number of multiple comparisons and you're looking for one or two that might be significant. However, if you have a large number of multiple comparisons and you're looking for many that might be significant, the Bonferroni correction may lead to a very high rate of false negatives. For example, let's say you're comparing the expression level of \(20,000\) genes between liver cancer tissue and normal liver tissue. Based on previous studies, you are hoping to find dozens or hundreds of genes with different expression levels. If you use the Bonferroni correction, a \(P\) value would have to be less than \(0.05/20000=0.0000025\) to be significant. Only genes with huge differences in expression will have a \(P\) value that low, and could miss out on a lot of important differences just because you wanted to be sure that your results did not include a single false negative.

    An important issue with the Bonferroni correction is deciding what a "family" of statistical tests is. García-Arenzana et al. (2014) tested \(25\) dietary variables, so are these tests one "family," making the critical \(P\) value \(0.05/25\)? But they also measured \(13\) non-dietary variables such as age, education, and socioeconomic status; should they be included in the family of tests, making the critical \(P\) value \(0.05/38\)? And what if in 2015, García-Arenzana et al. write another paper in which they compare \(30\) dietary variables between breast cancer and non-breast cancer patients; should they include those in their family of tests, and go back and reanalyze the data in their 2014 paper using a critical \(P\) value of \(0.05/55\)? There is no firm rule on this; you'll have to use your judgment, based on just how bad a false positive would be. Obviously, you should make this decision before you look at the results, otherwise it would be too easy to subconsciously rationalize a family size that gives you the results you want.

    Assumption

    The Bonferroni correction assumes that the individual tests are independent of each other, as when you are comparing sample A vs. sample B, C vs. D, E vs. F, etc. If you are comparing sample A vs. sample B, A vs. C, A vs. D, etc., the comparisons are not independent; if A is higher than B, there's a good chance that A will be higher than C as well.

    References

    1. García-Arenzana, N., E.M. Navarrete-Muñoz, V. Lope, P. Moreo, S. Laso-Pablos, N. Ascunce, F. Casanova-Gómez, C. Sánchez-Contador, C. Santamariña, N. Aragonés, B.P. Gómez, J. Vioque, and M. Pollán. 2014. Calorie intake, olive oil consumption and mammographic density among Spanish women. International journal of cancer 134: 1916-1925.
    2. Benjamini, Y., and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57: 289-300.
    3. Reiner, A., D. Yekutieli and Y. Benjamini. 2003. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19: 368-375.
    4. Simes, R.J. 1986. An improved Bonferroni procedure for multiple tests of significance. Biometrika 73: 751-754.

    This page titled 11.3: Multiple Comparisons is shared under a not declared license and was authored, remixed, and/or curated by John H. McDonald via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.