Skip to main content
Mathematics LibreTexts

5.6: Best Approximation and Least Squares

  • Page ID
    58860
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Often an exact solution to a problem in applied mathematics is difficult to obtain. However, it is usually just as useful to find arbitrarily close approximations to a solution. In particular, finding “linear approximations” is a potent technique in applied mathematics. One basic case is the situation where a system of linear equations has no solution, and it is desirable to find a “best approximation” to a solution to the system. In this section best approximations are defined and a method for finding them is described. The result is then applied to “least squares” approximation of data.

    Suppose \(A\) is an \(m \times n\) matrix and \(\mathbf{b}\) is a column in \(\mathbb{R}^m\), and consider the system

    \[A\mathbf{x} = \mathbf{b} \nonumber \]

    of \(m\) linear equations in \(n\) variables. This need not have a solution. However, given any column \(\mathbf{z} \in \mathbb{R}^n\), the distance \(\|\mathbf{b} - A\mathbf{z}\|\) is a measure of how far \(A\mathbf{z}\) is from \(\mathbf{b}\). Hence it is natural to ask whether there is a column \(\mathbf{z}\) in \(\mathbb{R}^n\) that is as close as possible to a solution in the sense that

    \[\| \mathbf{b} - A\mathbf{z} \| \nonumber \]

    is the minimum value of \(\|\mathbf{b} - A\mathbf{x}\|\) as \(\mathbf{x}\) ranges over all columns in \(\mathbb{R}^n\).

    The answer is “yes”, and to describe it define

    \[U = \{A\mathbf{x} \mid \mathbf{x} \mbox{ lies in } \mathbb{R}^n \} \nonumber \]

    This is a subspace of \(\mathbb{R}^n\) (verify) and we want a vector \(A\mathbf{z}\) in \(U\) as close as possible to \(\mathbf{b}\). That there is such a vector is clear geometrically if \(n = 3\) by the diagram. In general such a vector \(A\mathbf{z}\) exists by a general result called the projection theorem that will be proved in Chapter [chap:8] (Theorem [thm:023885]). Moreover, the projection theorem gives a simple way to compute \(\mathbf{z}\) because it also shows that the vector \(\mathbf{b} - A\mathbf{z}\) is orthogonal to every vector \(A\mathbf{x}\) in \(U\). Thus, for all \(\mathbf{x}\) in \(\mathbb{R}^n\),

    \[\begin{aligned} 0 = (A\mathbf{x})\bullet (\mathbf{b} - A\mathbf{z}) = (A\mathbf{x})^T(\mathbf{b} - A\mathbf{z}) &= \mathbf{x}^TA^T(\mathbf{b} - A\mathbf{z}) \\ &= \mathbf{x}\bullet [A^T(\mathbf{b} - A\mathbf{z})]\end{aligned} \nonumber \]

    In other words, the vector \(A^{T}(\mathbf{b} - A\mathbf{z})\) in \(\mathbb{R}^n\) is orthogonal to every vector in \(\mathbb{R}^n\) and so must be zero (being orthogonal to itself). Hence \(\mathbf{z}\) satisfies

    \[(A^TA)\mathbf{z} = A^T\mathbf{b} \nonumber \]

    Normal Equations016724 This is a system of linear equations called the normal equations for \(\mathbf{z}\).

    Note that this system can have more than one solution (see Exercise [ex:5_6_5]). However, the \(n \times n\) matrix \(A^{T}A\) is invertible if (and only if) the columns of \(A\) are linearly independent (Theorem [thm:015672]); so, in this case, \(\mathbf{z}\) is uniquely determined and is given explicitly by \(\mathbf{z} = (A^{T}A)^{-1}A^{T}\mathbf{b}\). However, the most efficient way to find \(\mathbf{z}\) is to apply gaussian elimination to the normal equations.

    This discussion is summarized in the following theorem.

    Best Approximation Theorem016733 Let \(A\) be an \(m \times n\) matrix, let \(\mathbf{b}\) be any column in \(\mathbb{R}^m\), and consider the system

    \[A\mathbf{x} = \mathbf{b} \nonumber \]

    of \(m\) equations in \(n\) variables.

    1. Any solution \(\mathbf{z}\) to the normal equations

      \[(A^TA)\mathbf{z} = A^T\mathbf{b} \nonumber \]

    2. If the columns of \(A\) are linearly independent, then \(A^{T}A\) is invertible and \(\mathbf{z}\) is given uniquely by \(\mathbf{z} = (A^{T}A)^{-1}A^{T}\mathbf{b}\).

    We note in passing that if \(A\) is \(n \times n\) and invertible, then

    \[\mathbf{z} = (A^TA)^{-1} A^T\mathbf{b} = A^{-1}\mathbf{b} \nonumber \]

    is the solution to the system of equations, and \(\|\mathbf{b} - A\mathbf{z}\| = 0\). Hence if \(A\) has independent columns, then \((A^{T}A)^{-1}A^{T}\) is playing the role of the inverse of the nonsquare matrix \(A\). The matrix \(A^{T}(AA^{T})^{-1}\) plays a similar role when the rows of \(A\) are linearly independent. These are both special cases of the generalized inverse of a matrix \(A\) (see Exercise [ex:5_6_14]). However, we shall not pursue this topic here.

    016759 The system of linear equations

    \[ \begin{array}{rlrcr} 3x & - & y & = & 4 \\ x & + & 2y & = & 0 \\ 2x & + & y & = & 1 \end{array} \nonumber \]

    has no solution. Find the vector \(\mathbf{z} = \left[ \begin{array}{r} x_0 \\ y_0 \end{array} \right]\) that best approximates a solution.

    In this case,

    \[A = \left[ \begin{array}{rr} 3 & -1 \\ 1 & 2 \\ 2 & 1 \end{array} \right] \mbox{, so } A^TA = \left[ \begin{array}{rrr} 3 & 1 & 2 \\ -1 & 2 & 1 \end{array} \right] \left[ \begin{array}{rr} 3 & -1 \\ 1 & 2 \\ 2 & 1 \end{array} \right] = \left[ \begin{array}{rr} 14 & 1 \\ 1 & 6 \end{array} \right] \nonumber \]

    is invertible. The normal equations \((A^{T}A)\mathbf{z} = A^{T}\mathbf{b}\) are

    \[\left[ \begin{array}{rr} 14 & 1 \\ 1 & 6 \end{array} \right] \mathbf{z} = \left[ \begin{array}{r} 14 \\ -3 \end{array} \right] \mbox{, so } \mathbf{z} = \frac{1}{83} \left[ \begin{array}{r} 87 \\ -56 \end{array} \right] \nonumber \]

    Thus \(x_0 = \frac{87}{83}\) and \(y_0 = \frac{-56}{83}\). With these values of \(x\) and \(y\), the left sides of the equations are, approximately,

    \[ \def\arraystretch{1.5} \begin{array}{rlrcrcr} 3x_0 & - & y_0 & = & \frac{317}{83} & = & 3.82 \\ x_0 & + & 2y_0 & = & \frac{-25}{83} & = & -0.30 \\ 2x_0 & + & y_0 & = & \frac{118}{83} & = & 1.42 \end{array} \nonumber \]

    This is as close as possible to a solution.

    016776 The average number \(g\) of goals per game scored by a hockey player seems to be related linearly to two factors: the number \(x_{1}\) of years of experience and the number \(x_{2}\) of goals in the preceding 10 games. The data on the following page were collected on four players. Find the linear function \(g = a_{0} + a_{1}x_{1} + a_{2}x_{2}\) that best fits these data.

    \[\begin{array}{|c|c|c|} \hline g & x_1 & x_2 \\ \hline 0.8 & 5 & 3 \\ 0.8 & 3 & 4 \\ 0.6 & 1 & 5 \\ 0.4 & 2 & 1 \\ \hline \end{array} \nonumber \]

    If the relationship is given by \(g = r_{0} + r_{1}x_{1} + r_{2}x_{2}\), then the data can be described as follows:

    \[\left[ \begin{array}{rrr} 1 & 5 & 3 \\ 1 & 3 & 4 \\ 1 & 1 & 5 \\ 1 & 2 & 1 \end{array} \right] \left[ \begin{array}{r} r_0 \\ r_1 \\ r_2 \end{array} \right] = \left[ \begin{array}{r} 0.8 \\ 0.8 \\ 0.6 \\ 0.4 \end{array} \right] \nonumber \]

    Using the notation in Theorem [thm:016733], we get

    \[\begin{aligned} \mathbf{z} &= (A^TA)^{-1}A^T\mathbf{b} \\ &= \frac{1}{42} \left[ \begin{array}{rrr} 119 & -17 & -19 \\ -17 & 5 & 1 \\ -19 & 1 & 5 \end{array} \right] \left[ \begin{array}{rrrr} 1 & 1 & 1 & 1 \\ 5 & 3 & 1 & 2 \\ 3 & 4 & 5 & 1 \end{array} \right] \left[ \begin{array}{r} 0.8 \\ 0.8 \\ 0.6 \\ 0.4 \end{array} \right] = \left[ \begin{array}{r} 0.14 \\ 0.09 \\ 0.08 \end{array} \right]\end{aligned} \nonumber \]

    Hence the best-fitting function is \(g = 0.14 + 0.09x_{1} + 0.08x_{2}\). The amount of computation would have been reduced if the normal equations had been constructed and then solved by gaussian elimination.

    Least Squares Approximation

    In many scientific investigations, data are collected that relate two variables. For example, if \(x\) is the number of dollars spent on advertising by a manufacturer and \(y\) is the value of sales in the region in question, the manufacturer could generate data by spending \(x_{1}, x_{2}, \dots, x_{n}\) dollars at different times and measuring the corresponding sales values \(y_{1}, y_{2}, \dots, y_{n}\).

    Suppose it is known that a linear relationship exists between the variables \(x\) and \(y\)—in other words, that \(y = a + bx\) for some constants \(a\) and \(b\). If the data are plotted, the points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\) may appear to lie on a straight line and estimating \(a\) and \(b\) requires finding the “best-fitting” line through these data points. For example, if five data points occur as shown in the diagram, line 1 is clearly a better fit than line 2. In general, the problem is to find the values of the constants \(a\) and \(b\) such that the line \(y = a + bx\) best approximates the data in question. Note that an exact fit would be obtained if \(a\) and \(b\) were such that \(y_{i} = a + bx_{i}\) were true for each data point \((x_{i}, y_{i})\). But this is too much to expect. Experimental errors in measurement are bound to occur, so the choice of \(a\) and \(b\) should be made in such a way that the errors between the observed values \(y_{i}\) and the corresponding fitted values \(a + bx_{i}\) are in some sense minimized. Least squares approximation is a way to do this.

    The first thing we must do is explain exactly what we mean by the best fit of a line \(y = a + bx\) to an observed set of data points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\). For convenience, write the linear function \(r_{0} + r_{1}x\) as

    \[f(x) = r_0 +r_1x \nonumber \]

    so that the fitted points (on the line) have coordinates \((x_{1}, f(x_{1})), \dots, (x_{n}, f(x_{n}))\).

    The second diagram is a sketch of what the line \(y = f(x)\) might look like. For each \(i\) the observed data point \((x_{i}, y_{i})\) and the fitted point \((x_{i}, f(x_{i}))\) need not be the same, and the distance \(d_{i}\) between them measures how far the line misses the observed point. For this reason \(d_{i}\) is often called the error at \(x_{i}\), and a natural measure of how close the line \(y = f(x)\) is to the observed data points is the sum \(d_{1} + d_{2} + \dots + d_{n}\) of all these errors. However, it turns out to be better to use the sum of squares

    \[S = d_1^2 + d_2^2 + \dotsb + d_n^2 \nonumber \]

    as the measure of error, and the line \(y = f(x)\) is to be chosen so as to make this sum as small as possible. This line is said to be the least squares approximating line for the data points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\).

    The square of the error \(d_i\) is given by \(d_i^2 = [y_i - f(x_i)]^2\) for each \(i\), so the quantity \(S\) to be minimized is the sum:

    \[S = [y_1 - f(x_1)]^2 + [y_2 - f(x_2)]^2 + \dotsb + [y_n - f(x_n)]^2 \nonumber \]

    Note that all the numbers \(x_{i}\) and \(y_{i}\) are given here; what is required is that the function \(f\) be chosen in such a way as to minimize \(S\). Because \(f(x) = r_{0} + r_{1}x\), this amounts to choosing \(r_{0}\) and \(r_{1}\) to minimize \(S\). This problem can be solved using Theorem [thm:016733]. The following notation is convenient.

    \[\mathbf{x} = \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right] \ \mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \quad \mbox{ and } \quad f(\mathbf{x}) = \left[ \begin{array}{c} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{array} \right] = \left[ \begin{array}{c} r_0 + r_1x_1 \\ r_0 + r_1x_2 \\ \vdots \\ r_0 + r_1x_n \\ \end{array} \right] \nonumber \]

    Then the problem takes the following form: Choose \(r_{0}\) and \(r_{1}\) such that

    \[S = [y_1 - f(x_1)]^2 + [y_2 - f(x_2)]^2 + \dotsb + [y_n - f(x_n)]^2 = \| \mathbf{y} - f(\mathbf{x}) \| ^2 \nonumber \]

    is as small as possible. Now write

    \[M = \left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right] \quad \mbox{ and } \quad \mathbf{r} = \left[ \begin{array}{r} r_0 \\ r_1 \end{array} \right] \nonumber \]

    Then \(M\mathbf{r} = f(\mathbf{x})\), so we are looking for a column \(\mathbf{r} = \left[ \begin{array}{r} r_0 \\ r_1 \end{array} \right]\) such that \(\|\mathbf{y} - M\mathbf{r}\|^{2}\) is as small as possible. In other words, we are looking for a best approximation \(\mathbf{z}\) to the system \(M\mathbf{r} = \mathbf{y}\). Hence Theorem [thm:016733] applies directly, and we have

    016878 Suppose that n data points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\) are given, where at least two of \(x_{1}, x_{2}, \dots, x_{n}\) are distinct. Put

    \[\mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \ M = \left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{array} \right] \nonumber \]

    Then the least squares approximating line for these data points has equation

    \[y = z_0 +z_1x \nonumber \]

    where \(\mathbf{z} = \left[ \begin{array}{r} z_0 \\ z_1 \end{array} \right]\) is found by gaussian elimination from the normal equations

    \[(M^TM)\mathbf{z} = M^T\mathbf{y} \nonumber \]

    The condition that at least two of \(x_{1}, x_{2}, \dots, x_{n}\) are distinct ensures that \(M^{T}M\) is an invertible matrix, so \(\mathbf{z}\) is unique:

    \[\mathbf{z} = (M^TM)^{-1}M^T\mathbf{y} \nonumber \]

    016901 Let data points \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{5}, y_{5})\) be given as in the accompanying table. Find the least squares approximating line for these data.

    \[\begin{array}{|c|c|} \hline x & y \\ \hline 1 & 1 \\ 3 & 2 \\ 4 & 3 \\ 6 & 4 \\ 7 & 5 \\ \hline \end{array} \nonumber \]

    In this case we have

    \[\begin{aligned} M^TM &= \left[ \begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_5 \end{array} \right] \left[ \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_5 \end{array} \right] \\ &= \left[ \begin{array}{cc} 5 & x_1 + \dotsb + x_5 \\ x_1 + \dotsb + x_5 & x_1^2 + \dotsb + x_5^2 \\ \end{array} \right] = \left[ \begin{array}{rr} 5 & 21 \\ 21 & 111 \end{array} \right] \\ \mbox{and } M^T\mathbf{y} &= \left[ \begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_5 \end{array} \right] \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_5 \end{array} \right] \\ &= \left[ \begin{array}{c} y_1 + y_2 + \dotsb + y_5 \\ x_1y_1 + x_2y_2 + \dotsb + x_5y_5 \end{array} \right] = \left[ \begin{array}{r} 15 \\ 78 \end{array} \right]\end{aligned} \nonumber \]

    so the normal equations \((M^{T}M)\mathbf{z} = M^{T}\mathbf{y}\) for \(\mathbf{z} = \left[ \begin{array}{r} z_0 \\ z_1 \end{array} \right]\) become

    \[\left[ \begin{array}{rr} 5 & 21 \\ 21 & 111 \end{array} \right] = \left[ \begin{array}{r} z_0 \\ z_1 \end{array} \right] = \left[ \begin{array}{r} 15 \\ 78 \end{array} \right] \nonumber \]

    The solution (using gaussian elimination) is \(\mathbf{z} = \left[ \begin{array}{r} z_0 \\ z_1 \end{array} \right] =\left[ \begin{array}{r} 0.24 \\ 0.66 \end{array} \right]\) to two decimal places, so the least squares approximating line for these data is \(y = 0.24 + 0.66x\). Note that \(M^{T}M\) is indeed invertible here (the determinant is \(114\)), and the exact solution is

    \[\mathbf{z} = (M^TM)^{-1}M^T\mathbf{y} = \frac{1}{114} \left[ \begin{array}{rr} 111 & -21 \\ -21 & 5 \end{array} \right] \left[ \begin{array}{rr} 15 \\ 78 \end{array} \right] = \frac{1}{114} \left[ \begin{array}{r} 27 \\ 75 \end{array} \right] = \frac{1}{38} \left[ \begin{array}{r} 9 \\ 25 \end{array} \right] \nonumber \]

    Least Squares Approximating Polynomials

    Suppose now that, rather than a straight line, we want to find a polynomial

    \[y = f(x) = r_0 + r_1x + r_2x^2 + \dots + r_mx^m \nonumber \]

    of degree \(m\) that best approximates the data pairs \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\). As before, write

    \[\mathbf{x} = \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right] \ \mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \quad \mbox{ and } \quad f(\mathbf{x}) = \left[ \begin{array}{c} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{array} \right] \nonumber \]

    For each \(x_{i}\) we have two values of the variable \(y\), the observed value \(y_{i}\), and the computed value \(f(x_{i})\). The problem is to choose \(f(x)\)—that is, choose \(r_{0}, r_{1}, \dots, r_{m}\) —such that the \(f(x_{i})\) are as close as possible to the \(y_{i}\). Again we define “as close as possible” by the least squares condition: We choose the \(r_{i}\) such that

    \[\| \mathbf{y} - f(\mathbf{x}) \| ^2 = [y_1 - f(x_1)]^2 + [y_2 - f(x_2)]^2 + \dotsb + [y_n - f(x_n)]^2 \nonumber \]

    is as small as possible.

    Least Squares Approximation016945 A polynomial \(f(x)\) satisfying this condition is called a least squares approximating polynomial of degree \(m\) for the given data pairs.

    If we write

    \[M = \def\arraystretch{1.3} \left[ \begin{array}{ccccc} 1 & x_1 & x_1^2 & \cdots & x_1^m \\ 1 & x_2 & x_2^2 & \cdots & x_2^m \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^m \\ \end{array} \right] \quad \mbox{ and } \quad \mathbf{r} = \left[ \begin{array}{c} r_0 \\ r_1 \\ \vdots \\ r_m \end{array} \right] \nonumber \]

    we see that \(f(\mathbf{x}) = M\mathbf{r}\). Hence we want to find \(\mathbf{r}\) such that \(\|\mathbf{y} - M\mathbf{r}\|^{2}\) is as small as possible; that is, we want a best approximation \(\mathbf{z}\) to the system \(M\mathbf{r} = \mathbf{y}\). Theorem [thm:016733] gives the first part of Theorem [thm:016951].

    016951 Let \(n\) data pairs \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\) be given, and write

    \[\mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \ M = \def\arraystretch{1.3} \left[ \begin{array}{ccccc} 1 & x_1 & x_1^2 & \cdots & x_1^m \\ 1 & x_2 & x_2^2 & \cdots & x_2^m \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_n & x_n^2 & \cdots & x_n^m \\ \end{array} \right] \ \mathbf{z} = \left[ \begin{array}{c} z_0 \\ z_1 \\ \vdots \\ z_m \end{array} \right] \nonumber \]

    1. If \(\mathbf{z}\) is any solution to the normal equations

      \[(M^TM)\mathbf{z} = M^T\mathbf{y} \nonumber \]

      \[z_0 + z_1x + z_2x^2 + \dots + z_mx^m \nonumber \]

      is a least squares approximating polynomial of degree m for the given data pairs.

    2. If at least \(m + 1\) of the numbers \(x_{1}, x_{2}, \dots, x_{n}\) are distinct (so \(n \geq m + 1\)), the matrix \(M^{T}M\) is invertible and \(\mathbf{z}\) is uniquely determined by

      \[\mathbf{z} = (M^TM)^{-1}M^T\mathbf{y} \nonumber \]

    It remains to prove (2), and for that we show that the columns of \(M\) are linearly independent (Theorem [thm:015672]). Suppose a linear combination of the columns vanishes:

    \[r_0 \left[ \begin{array}{c} 1 \\ 1 \\ \vdots \\ 1 \end{array} \right] + r_1 \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right] + \dots + r_m \left[ \begin{array}{c} x_1^m \\ x_2^m \\ \vdots \\ x_n^m \end{array} \right] = \left[ \begin{array}{c} 0 \\ 0 \\ \vdots \\ 0 \\ \end{array} \right] \nonumber \]

    If we write \(q(x) = r_{0} + r_{1}x + \dots + r_{m}x^{m}\), equating coefficients shows that

    \[q(x_{1}) = q(x_{2}) = \dots = q(x_{n}) = 0 \nonumber \]

    Hence \(q(x)\) is a polynomial of degree \(m\) with at least \(m + 1\) distinct roots, so \(q(x)\) must be the zero polynomial (see Appendix [chap:appdpolynomials] or Theorem [thm:020203]). Thus \(r_{0} = r_{1} = \dots = r_{m} = 0\) as required.

    016988 Find the least squares approximating quadratic \(y = z_{0} + z_{1}x + z_{2}x^{2}\) for the following data points.

    \[(-3, 3), (-1, 1), (0, 1), (1, 2), (3, 4) \nonumber \]

    This is an instance of Theorem [thm:016951] with \(m = 2\). Here

    \[\mathbf{y} = \left[ \begin{array}{r} 3\\ 1\\ 1\\ 2\\ 4 \end{array} \right] \ M = \left[ \begin{array}{rrr} 1 & -3 & 9 \\ 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 3 & 9 \end{array} \right] \nonumber \]

    Hence,

    \[M^TM = \left[ \begin{array}{rrrrr} 1 & 1 & 1 & 1 & 1 \\ -3 & -1 & 0 & 1 & 3 \\ 9 & 1 & 0 & 1 & 9 \end{array} \right] \left[ \begin{array}{rrr} 1 & -3 & 9 \\ 1 & -1 & 1 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 3 & 9 \end{array} \right] = \left[ \begin{array}{rrr} 5 & 0 & 20 \\ 0 & 20 & 0 \\ 20 & 0 & 164 \end{array} \right] \nonumber \]

    \[M^T\mathbf{y} = \left[ \begin{array}{rrrrr} 1 & 1 & 1 & 1 & 1 \\ -3 & -1 & 0 & 1 & 3 \\ 9 & 1 & 0 & 1 & 9 \end{array} \right] \left[ \begin{array}{r} 3 \\ 1 \\ 1 \\ 2 \\ 4 \end{array} \right] = \left[ \begin{array}{r} 11 \\ 4 \\ 66 \end{array} \right] \nonumber \]

    The normal equations for \(\mathbf{z}\) are

    \[\left[ \begin{array}{rrr} 5 & 0 & 20 \\ 0 & 20 & 0 \\ 20 & 0 & 164 \end{array} \right] \mathbf{z} = \left[ \begin{array}{r} 11 \\ 4 \\ 66 \end{array} \right] \quad \mbox{ whence } \mathbf{z} = \left[ \begin{array}{r} 1.15 \\ 0.20 \\ 0.26 \end{array} \right] \nonumber \]

    This means that the least squares approximating quadratic for these data is \(y = 1.15 + 0.20x + 0.26x^{2}\).

    Other Functions

    There is an extension of Theorem [thm:016951] that should be mentioned. Given data pairs \((x_{1}, y_{1}), (x_{2}, y_{2}), \\ \dots, (x_{n}, y_{n})\), that theorem shows how to find a polynomial

    \[f(x) = r_0 + r_1x + \dots + r_mx^m \nonumber \]

    such that \(\|\mathbf{y} - f(\mathbf{x})\|^{2}\) is as small as possible, where \(\mathbf{x}\) and \(f(\mathbf{x})\) are as before. Choosing the appropriate polynomial \(f(x)\) amounts to choosing the coefficients \(r_{0}, r_{1}, \dots, r_{m}\), and Theorem [thm:016951] gives a formula for the optimal choices. Here \(f(x)\) is a linear combination of the functions \(1, x, x^{2}, \dots, x^{m}\) where the \(r_{i}\) are the coefficients, and this suggests applying the method to other functions. If \(f_{0}(x), f_{1}(x), \dots, f_{m}(x)\) are given functions, write

    \[f(x) = r_0f_0(x) + r_1f_1(x) + \dots + r_mf_m(x) \nonumber \]

    where the \(r_{i}\) are real numbers. Then the more general question is whether \(r_{0}, r_{1}, \dots, r_{m}\) can be found such that \(\|\mathbf{y} - f(\mathbf{x})\|^{2}\) is as small as possible where

    \[f(\mathbf{x}) = \left[ \begin{array}{c} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_m) \end{array} \right] \nonumber \]

    Such a function \(f(\mathbf{x})\) is called a least squares best approximation for these data pairs of the form \(r_{0}f_{0}(x) + r_{1}f_{1}(x) + \dots + r_{m}f_{m}(x)\), \(r_{i}\) in \(\mathbb{R}\). The proof of Theorem [thm:016951] goes through to prove

    017041 Let \(n\) data pairs \((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\) be given, and suppose that \(m + 1\) functions \(f_{0}(x), f_{1}(x), \dots, f_{m}(x)\) are specified. Write

    \[\mathbf{y} = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_n \end{array} \right] \quad M = \left[ \begin{array}{cccc} f_0(x_1) & f_1(x_1) & \cdots & f_m(x_1) \\ f_0(x_2) & f_1(x_2) & \cdots & f_m(x_2) \\ \vdots & \vdots & & \vdots \\ f_0(x_n) & f_1(x_n) & \cdots & f_m(x_n) \\ \end{array} \right] \quad \mathbf{z} = \left[ \begin{array}{c} z_1 \\ z_2 \\ \vdots \\ z_m \end{array} \right] \nonumber \]

    1. If \(\mathbf{z}\) is any solution to the normal equations

      \[(M^TM)\mathbf{z} = M^T\mathbf{y} \nonumber \]

      \[z_0f_0(x) + z_1f_1(x) + \dots + z_mf_m(x) \nonumber \]

      is the best approximation for these data among all functions of the form \(r_{0}f_{0}(x) + r_{1}f_{1}(x) + \dots + r_{m}f_{m}(x)\) where the \(r_{i}\) are in \(\mathbb{R}\).

    2. If \(M^{T}M\) is invertible (that is, if \(rank \;(M) = m + 1\)), then \(\mathbf{z}\) is uniquely determined; in fact, \(\mathbf{z} = (M^{T}M)^{-1}(M^{T}\mathbf{y})\).

    Clearly Theorem [thm:017041] contains Theorem [thm:016951] as a special case, but there is no simple test in general for whether \(M^{T}M\) is invertible. Conditions for this to hold depend on the choice of the functions \(f_{0}(x), f_{1}(x), \dots, f_{m}(x)\).

    017077 Given the data pairs \((-1, 0)\), \((0, 1)\), and \((1, 4)\), find the least squares approximating function of the form \(r_{0}x + r_{1}2^{x}\).

    The functions are \(f_{0}(x) = x\) and \(f_{1}(x) = 2^{x}\), so the matrix \(M\) is

    \[M = \left[ \begin{array}{cc} f_0(x_1) & f_1(x_1) \\ f_0(x_2) & f_1(x_2) \\ f_0(x_3) & f_1(x_3) \end{array} \right] = \left[ \begin{array}{rl} -1 & 2^{-1} \\ 0 & 2^0 \\ 1 & 2^1 \end{array} \right] =\frac{1}{2} \left[ \begin{array}{rr} -2 & 1 \\ 0 & 2 \\ 2 & 4 \end{array} \right] \nonumber \]

    In this case \(M^TM = \frac{1}{4} \left[ \begin{array}{rr} 8 & 6 \\ 6 & 21 \end{array} \right]\) is invertible, so the normal equations

    \[\frac{1}{4} \left[ \begin{array}{rr} 8 & 6 \\ 6 & 21 \end{array} \right] \mathbf{z} = \left[ \begin{array}{r} 4 \\ 9 \end{array} \right] \nonumber \]

    have a unique solution \(\mathbf{z} = \frac{1}{11} \left[ \begin{array}{r} 10 \\ 16 \end{array} \right].\) Hence the best-fitting function of the form \(r_0x + r_12^x\) is \(\overline{f}(x) = \frac{10}{11}x + \frac{16}{11}2^x\). Note that \(\overline{f}(\mathbf{x}) = \left[ \begin{array}{c} \overline{f}(-1) \\ \overline{f}(0) \\ \overline{f}(1) \\ \end{array} \right] = \def\arraystretch{1.5} \left[ \begin{array}{r} \frac{-2}{11} \\ \frac{16}{11} \\ \frac{42}{11} \end{array} \right]\), compared with \(\mathbf{y} = \left[ \begin{array}{r} 0 \\ 1 \\ 4 \end{array} \right]\)


    This page titled 5.6: Best Approximation and Least Squares is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by W. Keith Nicholson (Lyryx Learning Inc.) via source content that was edited to the style and standards of the LibreTexts platform.