Skip to main content
Mathematics LibreTexts

5.7: An Application to Correlation and Variance

  • Page ID
    59026
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Suppose the heights \(h_{1}, h_{2}, \dots, h_{n}\) of \(n\) men are measured. Such a data set is called a sample of the heights of all the men in the population under study, and various questions are often asked about such a sample: What is the average height in the sample? How much variation is there in the sample heights, and how can it be measured? What can be inferred from the sample about the heights of all men in the population? How do these heights compare to heights of men in neighbouring countries? Does the prevalence of smoking affect the height of a man?

    The analysis of samples, and of inferences that can be drawn from them, is a subject called mathematical statistics, and an extensive body of information has been developed to answer many such questions. In this section we will describe a few ways that linear algebra can be used.

    It is convenient to represent a sample \(\{x_{1}, x_{2}, \dots, x_{n}\}\) as a sample vector1 \(\mathbf{x} = \left[ \begin{array}{cccc} x_{1} & x_{2} & \cdots & x_{n} \end{array} \right]\) in \(\mathbb{R}^n\). This being done, the dot product in \(\mathbb{R}^n\) provides a convenient tool to study the sample and describe some of the statistical concepts related to it. The most widely known statistic for describing a data set is the sample mean \(\overline{x}\) defined by2

    \[\overline{x} = \frac{1}{n}(x_1 + x_2 + \dots + x_n) = \frac{1}{n}\sum_{i=1}^{n} x_i \nonumber \]

    The mean \(\overline{x}\) is “typical” of the sample values \(x_{i}\), but may not itself be one of them. The number \(x_{i} - \overline{x}\) is called the deviation of \(x_{i}\) from the mean \(\overline{x}\). The deviation is positive if \(x_{i} > \overline{x}\) and it is negative if \(x_{i} < \overline{x}\). Moreover, the sum of these deviations is zero:

    \[\label{eq:sum_of_deviations} \sum_{i=1}^{n} (x_i - \overline{x}) = \left(\sum_{i=1}^{n} x_i\right) - n\overline{x} = n\overline{x} - n\overline{x} = 0 \]

    This is described by saying that the sample mean \(\overline{x}\) is central to the sample values \(x_{i}\).

    If the mean \(\overline{x}\) is subtracted from each data value \(x_{i}\), the resulting data \(x_{i} - \overline{x}\) are said to be centred. The corresponding data vector is

    \[\mathbf{x}_c = \left[ \begin{array}{cccc} x_1 - \overline{x} & x_2 - \overline{x} & \cdots & x_n - \overline{x} \end{array} \right] \nonumber \]

    and ([eq:sum_of_deviations]) shows that the mean \(\overline{x}_c = 0\). For example, we have plotted the sample \(\mathbf{x} = \left[ \begin{array}{ccccc} -1 & 0 & 1 & 4 & 6 \end{array} \right]\) in the first diagram. The mean is \(\overline{x} = 2\), and the centred sample \(\mathbf{x}_{c} = \left[ \begin{array}{ccccc} -3 & -2 & -1 & 2 & 4 \end{array} \right]\) is also plotted. Thus, the effect of centring is to shift the data by an amount \(\overline{x}\) (to the left if \(\overline{x}\) is positive) so that the mean moves to \(0\).

    Another question that arises about samples is how much variability there is in the sample

    \[\mathbf{x} = \left[ \begin{array}{cccc} x_{1} & x_{2} & \cdots & x_{n} \end{array} \right] \nonumber \]

    that is, how widely are the data “spread out” around the sample mean \(\overline{x}\). A natural measure of variability would be the sum of the deviations of the \(x_{i}\) about the mean, but this sum is zero by ([eq:sum_of_deviations]); these deviations cancel out. To avoid this cancellation, statisticians use the squares \((x_{i} - \overline{x})^{2}\) of the deviations as a measure of variability. More precisely, they compute a statistic called the sample variance \(s_x^2\) defined3 as follows:

    \[s_x^2 = \frac{1}{n - 1}[(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \dots + (x_n - \overline{x})^2] = \frac{1}{n - 1}\sum_{i = 1}^{n} (x_i - \overline{x})^2 \nonumber \]

    The sample variance will be large if there are many \(x_{i}\) at a large distance from the mean \(\overline{x}\), and it will be small if all the \(x_{i}\) are tightly clustered about the mean. The variance is clearly nonnegative (hence the notation \(s_x^2\)), and the square root \(s_{x}\) of the variance is called the sample standard deviation.

    The sample mean and variance can be conveniently described using the dot product. Let

    \[\mathbf{1} = \left[ \begin{array}{cccc} 1 & 1 & \cdots & 1 \end{array} \right] \nonumber \]

    denote the row with every entry equal to \(1\). If \(\mathbf{x} = \left[ \begin{array}{cccc} x_{1} & x_{2} & \cdots & x_{n} \end{array} \right]\), then \(\mathbf{x}\bullet \mathbf{1} = x_{1} + x_{2} + \dots + x_{n}\), so the sample mean is given by the formula

    \[\overline{x} = \frac{1}{n}(\mathbf{x}\bullet \mathbf{1}) \nonumber \]

    Moreover, remembering that \(\overline{x}\) is a scalar, we have \(\overline{x}\mathbf{1} = \left[ \begin{array}{cccc} \overline{x} & \overline{x} & \cdots & \overline{x} \end{array} \right]\), so the centred sample vector \(\mathbf{x}_{c}\) is given by

    \[\mathbf{x}_c = \mathbf{x} - \overline{x}\mathbf{1} = \left[ \begin{array}{cccc} x_1 - \overline{x} & x_2 - \overline{x} & \cdots & x_n - \overline{x} \end{array} \right] \nonumber \]

    Thus we obtain a formula for the sample variance:

    \[s_x^2 = \frac{1}{n - 1} \| \mathbf{x}_c \|^2 = \frac{1}{n - 1} \| \mathbf{x} - \overline{x}\mathbf{1} \|^2 \nonumber \]

    Linear algebra is also useful for comparing two different samples. To illustrate how, consider two examples.

    The following table represents the number of sick days at work per year and the yearly number of visits to a physician for 10 individuals.

    \[\begin{array}{|c|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{\mbox{Individual}} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \textbf{\mbox{Doctor visits}} & 2 & 6 & 8 & 1 & 5 &10 & 3 & 9 & 7 & 4 \\ \textbf{\mbox{Sick days}} & 2 & 4 & 8 & 3 & 5 & 9 & 4 & 7 & 7 & 2 \\ \hline \end{array} \nonumber \]

    The data are plotted in the scatter diagram where it is evident that, roughly speaking, the more visits to the doctor the more sick days. This is an example of a positive correlation between sick days and doctor visits.

    Now consider the following table representing the daily doses of vitamin C and the number of sick days.

    \[\begin{array}{|c|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{\mbox{Individual}} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \textbf{\mbox{Vitamin C}} & 1 & 5 & 7 & 0 & 4 & 9 & 2 & 8 & 6 & 3 \\ \textbf{\mbox{Sick days}} & 5 & 2 & 2 & 6 & 2 & 1 & 4 & 3 & 2 & 5 \\ \hline \end{array} \nonumber \]

    The scatter diagram is plotted as shown and it appears that the more vitamin C taken, the fewer sick days. In this case there is a negative correlation between daily vitamin C and sick days.

    In both these situations, we have paired samples, that is observations of two variables are made for ten individuals: doctor visits and sick days in the first case; daily vitamin C and sick days in the second case. The scatter diagrams point to a relationship between these variables, and there is a way to use the sample to compute a number, called the correlation coefficient, that measures the degree to which the variables are associated.

    To motivate the definition of the correlation coefficient, suppose two paired samples \(\mathbf{x} = \left[ \begin{array}{cccc} x_{1} & x_{2} & \cdots & x_{n} \end{array} \right]\), and \(\mathbf{y} = \left[ \begin{array}{cccc} y_{1} & y_{2} & \cdots & y_{n} \end{array} \right]\) are given and consider the centred samples

    \[\mathbf{x}_c = \left[ \begin{array}{cccc} x_1 - \overline{x} & x_2 - \overline{x} & \cdots & x_n - \overline{x} \end{array} \right] \mbox{ and } \mathbf{y}_c = \left[ \begin{array}{cccc} y_1 - \overline{y} & y_2 - \overline{y} & \cdots & y_n - \overline{y} \end{array} \right] \nonumber \]

    If \(x_{k}\) is large among the \(x_{i}\)’s, then the deviation \(x_{k} - \overline{x}\) will be positive; and \(x_{k} - \overline{x}\) will be negative if \(x_{k}\) is small among the \(x_{i}\)’s. The situation is similar for \(\mathbf{y}\), and the following table displays the sign of the quantity \((x_{i} - \overline{x})(y_{k} - \overline{y})\) in all four cases:

    \[\mbox{Sign of }(x_i - \overline{x})(y_k - \overline{y}): \nonumber \]

    \[\begin{array}{|c|c|c|} \hline & x_i \mbox{ large} & x_i \mbox{ small} \\ \hline y_i \mbox{ large} & \mbox{positive} & \mbox{negative} \\ y_i \mbox{ small} & \mbox{negative} & \mbox{positive} \\ \hline \end{array} \nonumber \]

    Intuitively, if \(\mathbf{x}\) and \(\mathbf{y}\) are positively correlated, then two things happen:

    1. Large values of the \(x_{i}\) tend to be associated with large values of the \(y_{i}\), and
    2. Small values of the \(x_{i}\) tend to be associated with small values of the \(y_{i}\).

    It follows from the table that, if \(\mathbf{x}\) and \(\mathbf{y}\) are positively correlated, then the dot product

    \[\mathbf{x}_c\bullet \mathbf{y}_c = \sum_{i = 1}^{n}(x_i - \overline{x})(y_i - \overline{y}) \nonumber \]

    is positive. Similarly \(\mathbf{x}_{c}\bullet \mathbf{y}_{c}\) is negative if \(\mathbf{x}\) and \(\mathbf{y}\) are negatively correlated. With this in mind, the sample correlation coefficient4 \(r\) is defined by

    \[r = r(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}_c\bullet \mathbf{y}_c}{\| \mathbf{x}_c \|\ \| \mathbf{y}_c \|} \nonumber \]

    Bearing the situation in \(\mathbb{R}^3\) in mind, \(r\) is the cosine of the “angle” between the vectors \(\mathbf{x}_{c}\) and \(\mathbf{y}_{c}\), and so we would expect it to lie between \(-1\) and \(1\). Moreover, we would expect \(r\) to be near \(1\) (or \(-1\)) if these vectors were pointing in the same (opposite) direction, that is the “angle” is near zero (or \(\pi\)).

    This is confirmed by Theorem [thm:017371] below, and it is also borne out in the examples above. If we compute the correlation between sick days and visits to the physician (in the first scatter diagram above) the result is \(r = 0.90\) as expected. On the other hand, the correlation between daily vitamin C doses and sick days (second scatter diagram) is \(r = -0.84\).

    However, a word of caution is in order here. We cannot conclude from the second example that taking more vitamin C will reduce the number of sick days at work. The (negative) correlation may arise because of some third factor that is related to both variables. For example, case it may be that less healthy people are inclined to take more vitamin C. Correlation does not imply causation. Similarly, the correlation between sick days and visits to the doctor does not mean that having many sick days causes more visits to the doctor. A correlation between two variables may point to the existence of other underlying factors, but it does not necessarily mean that there is a causality relationship between the variables.

    Our discussion of the dot product in \(\mathbb{R}^n\) provides the basic properties of the correlation coefficient:

    017371 Let \(\mathbf{x} = \left[ \begin{array}{cccc} x_{1} & x_{2} & \cdots & x_{n} \end{array} \right]\) and \(\mathbf{y} = \left[ \begin{array}{cccc} y_{1} & y_{2} & \cdots & y_{n} \end{array} \right]\) be (nonzero) paired samples, and let \(r = r(\mathbf{x}, \mathbf{y})\) denote the correlation coefficient. Then:

    1. \(-1 \leq r \leq 1\).
    2. \(r = 1\) if and only if there exist \(a\) and \(b > 0\) such that \(y_i = a + bx_i\) for each \(i\).
    3. \(r = -1\) if and only if there exist \(a\) and \(b < 0\) such that \(y_i = a + bx_i\) for each \(i\).

    The Cauchy inequality (Theorem [thm:014907]) proves (1), and also shows that \(r = \pm 1\) if and only if one of \(\mathbf{x}_{c}\) and \(\mathbf{y}_{c}\) is a scalar multiple of the other. This in turn holds if and only if \(\mathbf{y}_{c} = b\mathbf{x}_{c}\) for some \(b \neq 0\), and it is easy to verify that \(r = 1\) when \(b > 0\) and \(r = -1\) when \(b < 0\).

    Finally, \(\mathbf{y}_{c} = b\mathbf{x}_{c}\) means \(y_i - \overline{y} = b(x_i - \overline{x})\) for each \(i\); that is, \(y_{i} = a + bx_{i}\) where \(a = \overline{y} - b\overline{x}\). Conversely, if \(y_{i} = a + bx_{i}\), then \(\overline{y} = a + b\overline{x}\) (verify), so \(y_1 - \overline{y} = (a + bx_i) - (a + b\overline{x}) = b(x_1 - \overline{x})\) for each \(i\). In other words, \(\mathbf{y}_{c} = b\mathbf{x}_{c}\). This completes the proof.

    Properties (2) and (3) in Theorem [thm:017371] show that \(r(\mathbf{x}, \mathbf{y}) = 1\) means that there is a linear relation with positive slope between the paired data (so large \(x\) values are paired with large \(y\) values). Similarly, \(r(\mathbf{x}, \mathbf{y}) = -1\) means that there is a linear relation with negative slope between the paired data (so small \(x\) values are paired with small \(y\) values). This is borne out in the two scatter diagrams above.

    We conclude by using the dot product to derive some useful formulas for computing variances and correlation coefficients. Given samples \(\mathbf{x} = \left[ \begin{array}{cccc} x_{1} & x_{2} & \cdots & x_{n} \end{array} \right]\) and \(\mathbf{y} = \left[ \begin{array}{cccc} y_{1} & y_{2} & \cdots & y_{n} \end{array} \right]\), the key observation is the following formula:

    \[\label{eq:xy_variance_formula} \mathbf{x}_c\bullet \mathbf{y}_c = \mathbf{x}\bullet \mathbf{y} - n\overline{x} \; \overline{y} \]

    Indeed, remembering that \(\overline{x}\) and \(\overline{y}\) are scalars:

    \[\begin{aligned} \mathbf{x}_c\bullet \mathbf{y}_c &= (\mathbf{x} - \overline{x}\mathbf{1})\bullet (\mathbf{y} - \overline{y}\mathbf{1}) \\ &= \mathbf{x}\bullet \mathbf{y} - \mathbf{x}\bullet(\overline{y}\mathbf{1}) - (\overline{x}\mathbf{1})\bullet \mathbf{y} + (\overline{x}\mathbf{1})(\overline{y}\mathbf{1}) \\ &= \mathbf{x}\bullet \mathbf{y} - \overline{y}(\mathbf{x}\bullet \mathbf{1}) - \overline{x}(\mathbf{1}\bullet \mathbf{y}) + \overline{x}\overline{y}(\mathbf{1}\bullet \mathbf{1}) \\ &= \mathbf{x}\bullet \mathbf{y} - \overline{y}(n\overline{x}) - \overline{x}(n\overline{y}) + \overline{x} \; \overline{y}(n) \\ &= \mathbf{x}\bullet \mathbf{y} - n\overline{x} \; \overline{y}\end{aligned} \nonumber \]

    Taking \(\mathbf{y} = \mathbf{x}\) in ([eq:xy_variance_formula]) gives a formula for the variance \(s_x^2 = \frac{1}{n - 1} \| \mathbf{x}_c \|^2\) of \(\mathbf{x}\).

    Variance Formula017424 If \(x\) is a sample vector, then \(s_x^2 = \frac{1}{n - 1} \left( \| \mathbf{x}_c \|^2 - n\overline{x}^2 \right)\).

    We also get a convenient formula for the correlation coefficient, \(r = r(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}_c\bullet \mathbf{y}_c}{\| \mathbf{x}_c \|\ \| \mathbf{y}_c \|}\). Moreover, ([eq:xy_variance_formula]) and the fact that \(s_x^2 = \frac{1}{n - 1} \| \mathbf{x}_c \|^2\) give:

    Correlation Formula017431 If \(\mathbf{x}\) and \(\mathbf{y}\) are sample vectors, then

    \[r = r(\mathbf{x}, \mathbf{y}) = \dfrac{\mathbf{x}\bullet \mathbf{y} - n \overline{x} \; \overline{y}}{(n - 1)s_xs_y} \index{correlation formula} \nonumber \]

    Finally, we give a method that simplifies the computations of variances and correlations.

    Data Scaling017436 Let \(\mathbf{x} = \left[ \begin{array}{cccc} x_{1} & x_{2} & \cdots & x_{n} \end{array} \right]\) and \(\mathbf{y} = \left[ \begin{array}{cccc} y_{1} & y_{2} & \cdots & y_{n} \end{array} \right]\) be sample vectors. Given constants \(a\), \(b\), \(c\), and \(d\), consider new samples \(\mathbf{z} = \left[ \begin{array}{cccc} z_{1} & z_{2} & \cdots & z_{n} \end{array} \right]\) and \(\mathbf{w} = \left[ \begin{array}{cccc} w_{1} & w_{2} & \cdots & w_{n} \end{array} \right]\) where \(z_{i} = a + bx_{i}\), for each \(i\) and \(w_{i} = c + dy_{i}\) for each \(i\). Then:

    1. \(\overline{z} = a + b\overline{x}\)
    2. \(s_z^2 = b^2s_x^2\), so \(s_z = |b|s_x\)
    3. If \(b\) and \(d\) have the same sign, then \(r(\mathbf{x}, \mathbf{y}) = r(\mathbf{z}, \mathbf{w})\).

    The verification is left as an exercise. For example, if \(\mathbf{x} = \left[ \begin{array}{cccccc} 101 & 98 & 103 & 99 & 100 & 97 \end{array} \right]\), subtracting \(100\) yields \(\mathbf{z} = \left[ \begin{array}{cccccc} 1 & -2 & 3 & -1 & 0 & -3 \end{array} \right]\). A routine calculation shows that \(\overline{z} = -\frac{1}{3}\) and \(s_z^2 = \frac{14}{3}\), so \(\overline{x} = 100 - \frac{1}{3} = 99.67\), and \(s_z^2 = \frac{14}{3} = 4.67\).

    Exercises for 1

    solutions

    The following table gives IQ scores for 10 fathers and their eldest sons. Calculate the means, the variances, and the correlation coefficient \(r\). (The data scaling formula is useful.)

    \[\begin{array}{|l|c|c|c|c|c|c|c|c|c|c|} \hline & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \textbf{\mbox{Father's IQ}} & 140 & 131 & 120 & 115 & 110 & 106 & 100 & 95 & 91 & 86 \\ \textbf{\mbox{Son's IQ}} & 130 & 138 & 110 & 99 & 109 & 120 & 105 & 99 & 100 & 94 \\ \hline \end{array} \nonumber \]

    The following table gives the number of years of education and the annual income (in thousands) of 10 individuals. Find the means, the variances, and the correlation coefficient. (Again the data scaling formula is useful.)

    \[\begin{array}{|l|c|c|c|c|c|c|c|c|c|c|} \hline \textbf{\mbox{Individual}} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \textbf{\mbox{Years of education}} & 12 & 16 & 13 & 18 & 19 & 12 & 18 & 19 & 12 & 14 \\ \textbf{\mbox{Yearly income}} & 31 & 48 & 35 & 28 & 55 & 40 & 39 & 60 & 32 & 35 \\ \textbf{\mbox{(1000's)}} &&&&&&&&&& \\ \hline \end{array} \nonumber \]

    Let \(X\) denote the number of years of education, and let \(Y\) denote the yearly income (in 1000’s). Then \(\overline{x} = 15.3\), \(s_x^2 = 9.12\) and \(s_{x} = 3.02\), while \(\overline{y} = 40.3\), \(s_y^2 = 114.23\) and \(s_{y} = 10.69\). The correlation is \(r(X, Y) = 0.599\).

    If \(\mathbf{x}\) is a sample vector, and \(\mathbf{x}_{c}\) is the centred sample, show that \(\overline{x}_{c} = 0\) and the standard deviation of \(\mathbf{x}_{c}\) is \(s_{x}\).

    Prove the data scaling formulas found on page : (a), (b), and (c).

    1. Given the sample vector \(\mathbf{x} = \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right]\), let \(\mathbf{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_n \end{array} \right]\) where \(z_{i} = a + bx_{i}\) for each \(i\). By (a) we have \(\overline{z} = a + b\overline{x}\), so

      \[\begin{aligned} s_z^2 &= \frac{1}{n - 1}\sum_{i}(z_i - \overline{z})^2 \\ &= \frac{1}{n - 1}\sum_{i}[(a + bx_i) - (a + b\overline{x})]^2 \\ &= \frac{1}{n - 1}\sum_{i}b^2(x_i - \overline{x})^2 \\ &= b^2s_x^2.\end{aligned} \nonumber \]


    1. We write vectors in \(\mathbb{R}^n\) as row matrices, for convenience.↩
    2. The mean is often called the “average” of the sample values \(x_{i}\), but statisticians use the term “mean”.↩
    3. Since there are \(n\) sample values, it seems more natural to divide by \(n\) here, rather than by \(n - 1\). The reason for using \(n - 1\) is that then the sample variance \(s^{2}x\) provides a better estimate of the variance of the entire population from which the sample was drawn.↩
    4. The idea of using a single number to measure the degree of relationship between different variables was pioneered by Francis Galton (1822–1911). He was studying the degree to which characteristics of an offspring relate to those of its parents. The idea was refined by Karl Pearson (1857–1936) and \(r\) is often referred to as the Pearson correlation coefficient.↩

    This page titled 5.7: An Application to Correlation and Variance is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by W. Keith Nicholson (Lyryx Learning Inc.) via source content that was edited to the style and standards of the LibreTexts platform.