# 7.5: Using Singular Value Decompositions

- Page ID
- 82512

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)\(\newcommand{\twovec}[2]{\begin{pmatrix} #1 \\ #2 \end{pmatrix} } \)

\(\newcommand{\threevec}[3]{\begin{pmatrix} #1 \\ #2 \\ #3 \end{pmatrix} } \)

We've now seen what singular value decompositions are, how to construct them, and how they provide important information about a matrix such as orthonormal bases for the four fundamental subspaces. This puts us in a good position to begin using singular value decompositions to solve a wide variety of problems.

Given the fact that singular value decompositions so immediately convey fundamental data about a matrix, it seems natural that some of our previous work can be reinterpreted in terms of singular value decompositions. Therefore, we'll take some time in this section to revisit some familiar issues, such as least squares problems and principal component analysis, while also looking at some new applications.

### Preview Activity 7.5.1.

Suppose that \(A = U\Sigma V^T\) where

vectors \(\mathbf u_j\) form the columns of \(U\text{,}\) and vectors \(\mathbf v_j\) form the columns of \(V\text{.}\)

- What are the dimensions of the matrices \(A\text{,}\) \(U\text{,}\) and \(V\text{?}\)
- What is the rank of \(A\text{?}\)
- Describe how to find an orthonormal basis for \(Col(A)\text{.}\)
- Describe how to find an orthonormal basis for \(Nul(A)\text{.}\)
- If the columns of \(Q\) form an orthonormal basis for \(Col(A)\text{,}\) what is \(Q^TQ\text{?}\)
- How would you form a matrix that projects vectors orthogonally onto \(Col(A)\text{?}\)

## Least squares problems

Least squares problems, which we explored in Section 6.5, arise when we are confronted with an inconsistent linear system \(A\mathbf x=\mathbf b\text{.}\) Since there is no solution to the system, we instead find the vector \(\mathbf x\) minimizing the distance between \(\mathbf b\) and \(A\mathbf x\text{.}\) That is, we find the vector \(\widehat{\mathbf{x}}\text{,}\) the least squares approximate solution, by solving \(A\widehat{\mathbf{x}}=\widehat{\mathbf{b}}\) where \(\widehat{\mathbf{b}}\) is the orthogonal projection of \(\mathbf b\) onto the column space of \(A\text{.}\)

If we have a singular value decomposition \(A=U\Sigma V^T\text{,}\) then the number of nonzero singular values \(r\) tells us the rank of \(A\text{,}\) and the first \(r\) columns of \(U\) form an orthonormal basis for \(Col(A)\text{.}\) This basis may be used to project vectors onto \(Col(A)\) and hence to solve least squares problems.

Before exploring this connection further, we will introduce Sage as a tool for automating the construction of singular value decompositions. One new feature is that we need to declare our matrix to consist of floating point entries. We do this by including `RDF`

inside the matrix definition, as illustrated in the following cell.

### Activity 7.5.2.

Consider the equation \(A\mathbf x=\mathbf b\) where

- Find a singular value decomposition for \(A\) using the Sage cell below. What are singular values of \(A\text{?}\)
- What is \(r\text{,}\) the rank of \(A\text{?}\) How can we identify an orthonormal basis for \(Col(A)\text{?}\)
- Form the reduced singular value decomposition \(U_r\Sigma_rV_r^T\) by constructing the matrix \(U_r\text{,}\) consisting of the first \(r\) columns of \(U\text{,}\) the matrix \(V_r\text{,}\) consisting of the first \(r\) columns of \(V\text{,}\) and \(\Sigma_r\text{,}\) an \(r\times r\) diagonal matrix. Verify that \(A=U_r\Sigma_r V_r^T\text{.}\)
You may find it convenient to remember that, if

`B`

is a matrix defined in Sage, then`B.matrix_from_columns( list )`

and`B.matrix_from_rows( list )`

can be used to extract columns or rows from`B`

. For instance,`B.matrix_from_rows([0,1,2])`

provides a matrix formed from the first three rows of`B`

. - How does the reduced singular value decomposition provide a matrix whose columns are an orthonormal basis for \(Col(A)\text{?}\)
- Explain why a least squares approximate solution \(\widehat{\mathbf{x}}\) satisfies
\begin{equation*} A\widehat{\mathbf{x}} = U_rU_r^T\mathbf b. \end{equation*}
- What is the product \(V_r^TV_r\) and why does it have this form?
- Explain why
\begin{equation*} \widehat{\mathbf{x}} = V_r\Sigma_r^{-1}U_r^T\mathbf b \end{equation*}
is a least squares approximate solution by simplifying \(A\widehat{\mathbf{x}} = (U_r\Sigma_r V_r^T)(V_r\Sigma_r^{-1}U_r^T \mathbf b)\text{.}\)

Now use this expression to find \(\widehat{\mathbf{x}}\text{.}\)

This activity demonstrates the power of a singular value decomposition to find a least squares approximate solution for an equation \(A\mathbf x = \mathbf b\text{.}\) Because it immediately provides an orthonormal basis for \(Col(A)\text{,}\) something that we've had to construct by the Gram-Schmidt process in the past, we can easily project \(\mathbf b\) onto \(Col(A)\text{,}\) which results in a simple expression for \(\widehat{\mathbf{x}}\text{.}\)

If \(A=U_r\Sigma_r V_r^T\) is a reduced singular value decomposition of \(A\text{,}\) then a least squares approximate solution to \(A\mathbf x=\mathbf b\) is given by

If the columns of \(A\) are linearly independent, then the equation \(A\widehat{\mathbf{x}} = \widehat{\mathbf{b}}\) has only one solution so there is a unique least squares approximate solution \(\widehat{\mathbf{x}}\text{.}\) Otherwise, the expression in Proposition 7.5.1 produces the solution to \(A\widehat{\mathbf{x}}=\widehat{\mathbf{b}}\) having the shortest length.

The matrix \(A^+ = V_r\Sigma_r^{-1}U_r^T\) is known as the *Moore-Penrose psuedoinverse* of \(A\text{.}\) When \(A\) is invertible, \(A^{-1} = A^+\text{.}\)

## Rank \(k\) approximations

If we have a singular value decomposition for a matrix \(A\text{,}\) we can form a sequence of matrices \(A_k\) that approximate \(A\) with increasing accuracy. This may feel familiar to calculus students who have seen the way in which a function \(f(x)\) can be approximated by a linear function, a quadratic function, and so forth with increasing accuracy.

We'll begin with a singular value decomposition of a rank \(r\) matrix \(A\) so that \(A=U\Sigma V^T\text{.}\) To create the approximating matrix \(A_k\text{,}\) we keep the first \(k\) singular values and set the others to zero. For instance, if \(\Sigma = \begin{bmatrix} 22 & 0 & 0 & 0 & 0 \\ 0 & 14 & 0 & 0 & 0 \\ 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix} \text{,}\) we can form matrices

and define \(A_1 = U\Sigma^{(1)}V^T\) and \(A_2 = U\Sigma^{(2)}V^T\text{.}\) Because \(A_k\) has \(k\) nonzero singular values, we know that \(Rank(A_k) = k\text{.}\) In fact, there is a sense in which \(A_k\) is the closest matrix to \(A\) among all rank \(k\) matrices.

### Activity 7.5.3.

Let's consider a matrix \(A=U\Sigma V^T\) where

Evaluating the following cell will create the matrices `U`

, `V`

, and `Sigma`

. Notice how the `diagonal_matrix`

command provides a convenient way to form the diagonal matrix \(\Sigma\text{.}\)

- Form the matrix \(A=U\Sigma V^T\text{.}\) What is \(Rank(A)\text{?}\)
- Now form the approximating matrix \(A_1=U\Sigma^{(1)} V^T\text{.}\) What is \(Rank(A_1)\text{?}\)
- Find the error in the approximation \(A\approx A_1\) by finding \(A-A_1\text{.}\)
- Now find \(A_2 = U\Sigma^{(2)} V^T\) and the error \(A-A_2\text{.}\) What is \(Rank(A_2)\text{?}\)
- Find \(A_3 = U\Sigma^{(3)} V^T\) and the error \(A-A_3\text{.}\) What is \(Rank(A_3)\text{?}\)
- What would happen if we were to compute \(A_4\text{?}\)
- What do you notice about the error \(A-A_k\) as \(k\) increases?

In this activity, the approximating matrix \(A_k\) has rank \(k\) because its singular value decomposition has \(k\) nonzero singular values. We then saw how the difference between \(A\) and the approximations \(A_k\) decreases as \(k\) increases, which means that the sequence \(A_k\) forms better approximations as \(k\) increases.

Another way to represent \(A_k\) is with a reduced singular value decomposition so that \(A_k = U_k\Sigma_kV_k^T\) where

Notice that the rank \(1\) matrix \(A_1\) then has the form \(A_1 = \mathbf u_1\begin{bmatrix}\sigma_1\end{bmatrix} \mathbf v_1^T = \sigma_1\mathbf u_1\mathbf v_1^T\) and that we can similarly write:

Given two vectors \(\mathbf u\) and \(\mathbf v\text{,}\) the matrix \(\mathbf u~\mathbf v^T\) is called the *outer product* of \(\mathbf u\) and \(\mathbf v\text{.}\) (The dot product \(\mathbf u\cdot\mathbf v=\mathbf u^T\mathbf v\) is sometimes called the *inner product*.) An outer product will always be a rank \(1\) matrix so we see above how \(A_k\) is obtained by adding together \(k\) rank \(1\) matrices, each of which gets us one step closer to the original matrix \(A\text{.}\)

## Principal component analysis

In Section 7.3, we explored principal component analysis as a technique to reduce the dimension of a dataset. In particular, we constructed the covariance matrix \(C\) from a demeaned data matrix and saw that the eigenvalues and eigenvectors of \(C\) tell us about the variance of the dataset in different directions. We referred to the eigenvectors of \(C\) as *principal components* and found that projecting the data onto a subspace defined by the first few principal components frequently gave us a way to visualize the dataset. As we added more principal components, we retained more information about the original dataset. This feels similar to the rank \(k\) approximations we have just seen so let's explore the connection.

Suppose that we have a dataset with \(N\) points, that \(A\) represents the demeaned data matrix, that \(A = U\Sigma V^T\) is a singular value decomposition, and that the singular values are \(A\) are denoted as \(\sigma_i\text{.}\) It follows that the covariance matrix

Notice that \(\frac1N \Sigma\Sigma^T\) is a diagonal matrix whose diagonal entries are \(\frac1N\sigma_i^2\text{.}\) Therefore, it follows that

is an orthgonal diagonalization of \(C\) showing that

- the principal components of the dataset, which are the eigenvectors of \(C\text{,}\) are given by the columns of \(U\text{.}\) In other words, the left singular vectors of \(A\) are the principal components of the dataset.
- the variance in the direction of a principal component is the associated eigenvalue of \(C\) and therefore
\begin{equation*} V_{\mathbf u_i} = \frac1N\sigma_i^2. \end{equation*}

### Activity 7.5.4.

Let's revisit the iris data set that we studied in Section 7.3. Remember that there are four measurements given for each of 150 irises and that each iris belongs to one of three species.

Evaluating the following cell will load the dataset and define the demeaned data matrix \(A\) whose shape is \(4\times150\text{.}\)

- Find the singular values of \(A\) using the command
`A.singular_values()`

and use them to determine the variance \(V_{\mathbf u_j}\) in the direction of each of the four principal components. What is the fraction of variance retained by the first two principal components? - We will now write the matrix \(\Gamma = \Sigma V^T\) so that \(A = U\Gamma\text{.}\) Suppose that a demeaned data point, say, the 100th column of \(A\text{,}\) is written as a linear combination of principal components:
\begin{equation*} \mathbf x = c_1\mathbf u_1+c_2\mathbf u_2+c_3\mathbf u_3+c_4\mathbf u_4. \end{equation*}
Explain why \(\fourvec{c_1}{c_2}{c_3}{c_4}\text{,}\) the vector of coordinates of \(\mathbf x\) in the basis of principal components, appears as 100th column of \(\Gamma\text{.}\)

- Suppose that we now project this demeaned data point \(\mathbf x\) orthogonally onto the subspace spanned by the first two principal components \(\mathbf u_1\) and \(\mathbf u_2\text{.}\) What are the coordinates of the projected point in this basis and how can we find them in the matrix \(\Gamma\text{?}\)
- Alternatively, consider the approximation \(A_2=U_2\Sigma_2V_2^T\) of the demeaned data matrix \(A\text{.}\) Explain why the 100th column of \(A_2\) represents the projection of \(\mathbf x\) onto the two-dimensional subspace spanned by the first two principal components, \(\mathbf u_1\) and \(\mathbf u_2\text{.}\) Then explain why the coefficients in that projection, \(c_1\mathbf u_1 + c_2\mathbf u_2\text{,}\) form the two-dimensional vector \(\twovec{c_1}{c_2}\) that is the 100th column of \(\Gamma_2=\Sigma_2 V_2^T\text{.}\)
- Now we've seen that the columns of \(\Gamma_2 = \Sigma_2 V_2^T\) form the coordinates of the demeaned data points projected on to the two-dimensional subspace spanned by \(\mathbf u_1\) and \(\mathbf u_2\text{.}\) In the cell below, find a singular value decomposition of \(A\) and use it to form the matrix
`Gamma2`

. When you evaluate this cell, you will see a plot of the projected demeaned data plots, similar to the one we created in Section 7.3.

In our first encounter with principal component analysis, we began with a demeaned data matrix \(A\text{,}\) formed the covariance matrix \(C\text{,}\) and used the eigenvalues and eigenvectors of \(C\) to project the demeaned data onto a smaller dimensional subspace. In this section, we have seen that a singular value decomposition of \(A\) provides a more direct route: the left singular vectors of \(A\) form the principal components and the approximating matrix \(A_k\) represents the data points projected onto the subspace spanned by the first \(k\) principal components. The coordinates of a projected demeaned data point are given by the columns of \(\Gamma_k = \Sigma_kV_k^T\text{.}\)

## Image compressing and denoising

In addition to principal component analysis, the approximations \(A_k\) of a matrix \(A\) obtained from a singular value decomposition can be used in image processing. Remember that we studied the JPEG compression algorithm, whose foundation is the change of basis defined by the Discrete Cosine Transform, in Section 3.3. We will now see how a singular value decomposition provides another tool for both compressing images and removing noise in them.

### Activity 7.5.5.

Evaluating the following cell loads some data that we'll use in this activity. To begin, it defines and displays a \(25\times15\) matrix \(A\text{.}\)

- If we interpret 0 as black and 1 as white, this matrix represents an image as shown below.
We will explore how the singular value decomposition helps us to compress this image.
- By inspecting the image represented by \(A\text{,}\) identify a basis for \(Col(A)\) and determine \(Rank(A)\text{.}\)
- The following cell plots the singular values of \(A\text{.}\) Explain how this plot verifies that the rank is what you found in the previous part.
- There is a command
`approximate(A, k)`

that creates the approximation \(A_k\text{.}\) Use the cell below to define \(k\) and look at the images represented by the first few approximations. What is the smallest value of \(k\) for which \(A=A_k\text{?}\) - Now we can see how the singular value decomposition allows us to compress images. Since this is a \(25\times15\) matrix, we need \(25\cdot15=375\) numbers to represent the image. However, we can also reconstruct the image using a small number of singular values and vectors:
\begin{equation*} A = A_k = \sigma_1\mathbf u_1\mathbf v_1^T + \sigma_2\mathbf u_2\mathbf v_2^T + \ldots + \sigma_k\mathbf u_k\mathbf v_k^T. \end{equation*}
What are the dimensions of the singular vectors \(\mathbf u_i\) and \(\mathbf v_i\text{?}\) Between the singular vectors and singular values, how many numbers do we need to reconstruct \(A_k\) for the smallest \(k\) for which \(A=A_k\text{?}\) This is the compressed size of the image.

- The
*compression ratio*is the ratio of the uncompressed size to the compressed size. What compression ratio does this represent?

- Next we'll explore an example based on a photograph.
- Consider the following image consisting of an array of \(316\times310\) pixels stored in the matrix \(A\text{.}\)
Plot the singular values of \(A\text{.}\)

- Use the cell below to study the approximations \(A_k\) for \(k=1, 10, 20, 50, 100\text{.}\)
Notice how the approximating image \(A_k\) more closely approximates the original image \(A\) as \(k\) increases.
What is the compression ratio when \(k=50\text{?}\) What is the compression ratio when \(k=100\text{?}\) Notice how a higher compression ratio leads to a lower quality reconstruction of the image.

- Consider the following image consisting of an array of \(316\times310\) pixels stored in the matrix \(A\text{.}\)
- A second, related application of the singular value decomposition to image processing is called
*denoising*. For example, consider the image represented by the matrix \(A\) below. This image is similar to the image of the letter "O" we first studied in this activity, but there are splotchy regions in the background that result, perhaps, from scanning the image. We think of the splotchy regions as noise, and our goal is to improve the quality of the image by reducing the noise.- Plot the singular values below. How are the singular values of this matrix similar to those represented by the clean image that we considered earlier and how are they different?
- There is a natural point where the singular values dramatically decrease so it makes sense to think of the noise as being formed by the small singular values. To denoise the image, we will therefore replace \(A\) by its approximation \(A_k\text{,}\) where \(k\) is the point at which the singular values drop off. This has the effect of setting the small singular values to zero and hence eliminating the noise. Choose an appropriate value of \(k\) below and notice that the new image appears to be somewhat cleaned up as a result of removing the noise.

Several examples illustrating how the singular value decomposition compresses images are available at this page from Tim Baumann.

## Analyzing Supreme Court cases

As we've seen, a singular value decomposition concentrates the most important features of a matrix into the first singular values and singular vectors. We will now use this observation to extract meaning from a large dataset giving the voting records of Supreme Court justices. A similar analysis appears in the paper A pattern analysis of the second Rehnquist U.S. Supreme Court by Lawrence Sirovich.

The makeup of the Supreme Court was unusually stable during a period from 1994-2005 when it was led by Chief Justice William Rehnquist. This is sometimes called the *second Rehnquist court*. The justices during this period were:

- William Rehnquist
- Antonin Scalia
- Clarence Thomas
- Anthony Kennedy
- Sandra Day O'Connor
- John Paul Stevens
- David Souter
- Ruth Bader Ginsburg
- Stephen Breyer

During this time, there were 911 cases in which all nine judges voted. We would like to understand patterns in their voting.

### Activity 7.5.6.

Evaluating the following cell loads and displays a dataset describing the votes of each justice in these 911 cases. More specifically, an entry of +1 means that the justice represented by the row voted with the majority in the case represented by the column. An entry of -1 means that justice was in the minority. This information is also stored in the \(9\times911\) matrix \(A\text{.}\)

The justices are listed, very roughly, in order from more conservative to more progressive.

In this activity, it will be helpful to visualize the entries in various matrices and vectors. The next cell displays the first 50 columns of the matrix \(A\) with white representing an entry of +1, red representing -1, and black representing 0.

- Plot the singular values of \(A\) below. Describe the significance of this plot, including the relative contributions from the singular values \(\sigma_k\) as \(k\) increases.
- Form the singular value decomposition \(A=U\Sigma V^T\) and the matrix of coefficients \(\Gamma\) so that \(A=U\Gamma\text{.}\)
- We will now study a particular case, the second case appearing as the column of \(A\) indexed by
`1`

. There is a command`display_column(A, k)`

that provides a visual display of the \(k^{th}\) column of a matrix \(A\text{.}\) Describe the justices' votes in the second case. - Also, display the first left singular vector \(\mathbf u_1\text{,}\) the column of \(U\) indexed by \(0\text{,}\) and the column of \(\Gamma\) holding the coefficients that express the second case as a linear combination of left singular vectors. What does this tell us about how the second case is constructed as a linear combination of left singular vectors? What is the significance of the first left singular vector \(\mathbf u_1\text{?}\)
- Let's now study the \(48^{th}\) case, which is represented by the column of \(A\) indexed by
`47`

. Describe the voting pattern in this case. - Display the second left singular vector \(\mathbf u_2\) and the vector of coefficients that express the \(48^{th}\) case as a linear combination of left singular vectors. Describe how this case is constructed as a linear combination of singular vectors. What is the significance of the second left singular vector \(\mathbf u_2\text{?}\)
- The data in Table 7.5.2 describes the number of cases decided by each possible vote count.
Table 7.5.2. Number of cases by vote count Vote count # of cases 9-0 405 8-1 89 7-2 111 6-3 118 5-4 188 - Cases decided by a 5-4 vote are often the most impactful as they represent a sharp divide among the justices and, often, society at large. For that reason, we will now focus on the 5-4 decisions. Evaluating the next cell forms the \(9\times188\) matrix \(B\) consisting of 5-4 decisions.
Form the singular value decomposition of \(B=U\Sigma V^T\) along with the matrix \(\Gamma\) of coefficients so that \(B=U\Gamma\) and display the first left singular vector \(\mathbf u_1\text{.}\) Study how the \(7^{th}\) case, indexed by
`6`

, is constructed as a linear combination of left singular vectors. What does this singular vector tell us about the make up of the court and whether it leans towards the conservatives or progressives? - Display the second left singular vector \(\mathbf u_2\) and study how the \(6^{th}\) case, indexed by
`5`

, is constructed as a linear combination of left singular vectors. What does \(\mathbf u_2\) tell us about the relative importance of the justices' voting records? - By a
*swing vote*, we mean a justice who is less inclined to vote with a particular bloc of justices but instead swings from one bloc to another with the potential to sway close decisions. What do the singular vectors \(\mathbf u_1\) and \(\mathbf u_2\) tell us about the presence of voting blocs on the court and the presence of a swing vote? Which justice represents the swing vote?

## Summary

This section has demonstrated some uses of the singular value decomposition. Because the singular values appear in decreasing order, the decomposition has the effect of concentrating the most important features of the matrix into the first singular values and singular vectors.

- Because the first left singular vectors form an orthonormal basis for \(Col(A)\text{,}\) a singular value decomposition provides a convenient way to project vectors onto \(Col(A)\) and therefore to solve least squares problems.
- A singular value decomposition of a rank \(r\) matrix \(A\) leads to a series of approximations \(A_k\) of \(A\) where
\begin{align*} A \approx A_1 & = \sigma_1\mathbf u_1\mathbf v_1^T\\ A \approx A_2 & = \sigma_1\mathbf u_1\mathbf v_1^T + \sigma_2\mathbf u_2\mathbf v_2^T\\ A \approx A_3 & = \sigma_1\mathbf u_1\mathbf v_1^T + \sigma_2\mathbf u_2\mathbf v_2^T + \sigma_3\mathbf u_3\mathbf v_3^T\\ \vdots & \\ A = A_r & = \sigma_1\mathbf u_1\mathbf v_1^T + \sigma_2\mathbf u_2\mathbf v_2^T + \sigma_3\mathbf u_3\mathbf v_3^T + \ldots + \sigma_r\mathbf u_r\mathbf v_r^T \end{align*}
In each case, \(A_k\) is the rank \(k\) matrix that is closest to \(A\text{.}\)

- If \(A\) is a demeaned data matrix, the left singular vectors give the principal components of \(A\) and the variance in the direction of a principal component can be easily expressed in terms of the corresponding singular value.
- The singular value decomposition has many applications. In this section, we looked at how the decomposition is used in image processing through the techniques of compression and denoising.
- Because the first few left singular vectors contain the most important features of a matrix, we can use a singular value decomposition to extract meaning from a large dataset as we did when analyzing the voting patterns of the second Rehnquist court.

## Exercises 7.5.7Exercises

Suppose that

- Find the singular values of \(A\text{.}\) What is \(Rank(A)\text{?}\)
- Find the sequence of matrices \(A_1\text{,}\) \(A_2\text{,}\) \(A_3\text{,}\) and \(A_4\) where \(A_k\) is the rank \(k\) approximation of \(A\text{.}\)

Suppose we would like to find the best quadratic function

fitting the points

- Set up a linear system \(A\mathbf x = \mathbf b\) describing the coefficients \(\mathbf x = \threevec{\beta_0}{\beta_1}{\beta_2}\text{.}\)
- Find the singular value decomposition of \(A\text{.}\)
- Use the singular value decomposition to find the least squares approximate solution \(\widehat{\mathbf{x}}\text{.}\)

Remember that the outer product of two vector \(\mathbf u\) and \(\mathbf v\) is the matrix \(\mathbf u~\mathbf v^T\text{.}\)

- Suppose that \(\mathbf u = \twovec{2}{-3}\) and \(\mathbf v=\threevec201\text{.}\) Evaluate the outer product \(\mathbf u~\mathbf v^T\text{.}\) To get a clearer sense of how this works, perform this operation without using technology.
How is each of the columns of \(\mathbf u~\mathbf v^T\) related to \(\mathbf u\text{?}\)

- Suppose \(\mathbf u\) and \(\mathbf v\) are general vectors. What is \(Rank(\mathbf u~\mathbf v^T)\) and what is a basis for its column space \(Col(\mathbf u~\mathbf v^T)\text{?}\)
- Suppose that \(\mathbf u\) is a unit vector. What is the effect of multiplying a vector by the matrix \(\mathbf u~\mathbf u^T\text{?}\)

Evaluating the following cell loads in a dataset recording some features of 1057 houses. Notice how the lot size varies over a relatively small range compared to the other features. For this reason, in addition to demeaning the data, we'll scale each feature by dividing by its standard deviation so that the range of values is similar for each feature. The matrix \(A\) holds the result.

- Find the singular values of \(A\) and use them to determine the variance in the direction of the principal components.
- For what fraction of the variance do the first two principal components account?
- Find a singular value decomposition of \(A\) and construct the matrix \(2\times1057\) matrix \(B\) whose entries are the coordinates of the demeaned data points projected on to the two-dimensional subspace spanned by the first two principal components. You can plot the projected data points using
`list_plot(B.columns())`

. - Study the entries in the first two principal components \(\mathbf u_1\) and \(\mathbf u_2\text{.}\) Would a more expensive house lie on the left, right, top, or bottom of the plot you constructed?
- In what ways does a house that lies on the far left of the plot you constructed differ from an average house? In what ways does a house that lies near the top of the plot you constructed differ from an average house?

Let's revisit the voting records of justices on the second Rehnquist court. Evaluating the following cell will load the voting records of the justices in the 188 cases decided by a 5-4 vote and store them in the matrix \(A\text{.}\)

- The cell above also defined the 188-dimensional vector \(\mathbf v\) whose entries are all 1. What does the product \(A\mathbf v\) represent? Use the following cell to evaluate this product.
- How does the product \(A\mathbf v\) tell us which justice voted in the majority most frequently? What does this say about the presence of a swing vote on the court?
- How does this product tell us whether we should characterize this court as leaning conservative or progressive?
- How does this product tell us about the presence of a second swing vote on the court?
- Study the left singular vector \(\mathbf u_3\) and describe how it reinforces the fact that there was a second swing vote. Who was this second swing vote?

The following cell loads a dataset that describes the percentages with which justices on the second Rehnquist court agreed with one another. For instance, the entry in the first row and second column is 72.78, which means that Justices Rehnquist and Scalia agreed with each other in 72.78% of the cases.

- Examine the matrix \(A\text{.}\) What special structure does this matrix have and why should we expect it to have this structure?
- Plot the singular values of \(A\) below. For what value of \(k\) would the approximation \(A_k\) be a reasonable approximation of \(A\text{?}\)
- Find a singular value decomposition \(A=U\Sigma V^T\) and examine the matrices \(U\) and \(V\) using, for instance,
`n(U, 3)`

. What do you notice about the relationship between \(U\) and \(V\) and why should we expect this relationship to hold? - The command
`approximate(A, k)`

will form the approximating matrix \(A_k\text{.}\) Study the matrix \(A_1\) using the`display_matrix`

command. Which justice or justices seem to be most agreeable, that is, most likely to agree with other justices? Which justice is least agreeable? - Examine the difference \(A_2-A_1\) and describe how this tells us about the presence of voting blocs and swing votes on the court.

Suppose that \(A=U_r\Sigma_rV_r^T\) is a reduced singular value decomposition of the \(m\times n\) matrix \(A\text{.}\) The matrix \(A^+ = V_r\Sigma_r^{-1}U_r^T\) is called the *Moore-Penrose inverse* of \(A\text{.}\)

- Explain why \(A^+\) is an \(n\times m\) matrix.
- If \(A\) is an invertible, square matrix, explain why \(A^+=A^{-1}\text{.}\)
- Explain why \(AA^+\mathbf b=\widehat{\mathbf{b}}\text{,}\) the orthogonal projection of \(\mathbf b\) onto \(Col(A)\text{.}\)
- Explain why \(A^+A\mathbf x=\widehat{\mathbf{x}}\text{,}\) the orthogonal projection of \(\mathbf x\) onto \(Col(A^T)\text{.}\)

In Subsection 5.1, we saw how some linear algebraic computations are sensitive to round off error made by a computer. A singular value decomposition can help us understand when this situation can occur.

For instance, consider the matrices

The entries in these matrices are quite close to one another, but \(A\) is invertible while \(B\) is not. It seems like \(A\) is *almost* singular. In fact, we can measure how close a matrix is to being singular by forming the *condition number*, \(\sigma_1/\sigma_n\text{,}\) the ratio of the largest to smallest singular value. If \(A\) were singular, the condition number would be undefined because the singular value \(\sigma_n=0\text{.}\) Therefore, we will think of matrices with large condition numbers as being close to singular.

- Define the matrix \(A\) and find a singular value decomposition. What is the condition number of \(A\text{?}\)
- Define the left singular vectors \(\mathbf u_1\) and \(\mathbf u_2\text{.}\) Compare the results \(A^{-1}\mathbf b\) when
- \(\mathbf b=\mathbf u_1+\mathbf u_2\text{.}\)
- \(\mathbf b=2\mathbf u_1+\mathbf u_2\text{.}\)

Notice how a small change in the vector \(\mathbf b\) leads to a small change in \(A^{-1}\mathbf b\text{.}\)

- Now compare the results \(A^{-1}\mathbf b\) when
- \(\mathbf b=\mathbf u_1+\mathbf u_2\text{.}\)
- \(\mathbf b=\mathbf u_1+2\mathbf u_2\text{.}\)

Notice now how a small change in \(\mathbf b\) leads to a large change in \(A^{-1}\mathbf b\text{.}\)

- Previously, we saw that, if we write \(\mathbf x\) in terms of left singular vectors \(\mathbf x=c_1\mathbf v_1+c_2\mathbf v_2\text{,}\) then we have
\begin{equation*} \mathbf b=A\mathbf x = c_1\sigma_1\mathbf u_1 + c_2\sigma_2\mathbf u_2. \end{equation*}
If we write \(\mathbf b=d_1\mathbf u_1+d_2\mathbf u_2\text{,}\) explain why \(A^{-1}\mathbf b\) is sensitive to small changes in \(d_2\text{.}\)

Generally speaking, a square matrix \(A\) with a large condition number will demonstrate this type of behavior so that the computation of \(A^{-1}\) is likely to be affected by round off error. We call such a matrix *ill-conditioned*.