Least squares and pseudo-inverses

To appreciate the connections between solutions of the system \(Ax=b\) and least squares, we begin with two illustrative examples.

Overdetermined systems:

\(A\) is \(m \times n\) with \(m > n\). In this case there are more equations than unknowns, \(A^{\top}A\) is \(n\times n\) and \(AA^{\top}\) is \(m\times m\). The connection with the pseudo-inverse is that
\[
x = (A^{\top}A)^{-1}A^{\top}b = A^+b
\] is the particular \(x\) that minimizes \(\|Ax-b\|\).
Example: Solve the system \(
(2\ 3\ 4\ 6)^\top(x_1) = (4\ 6\ 8\ 10)^\top.\)
Using \(A^{\top}A = (2\ 3\ 4\ 6)(2\ 3\ 4\ 6)^\top=65\),
\[
x = (A^{\top}A)^{-1}A^{\top}b = \frac{1}{65}\begin{pmatrix}2 &3 &4 &6\end{pmatrix}\begin{pmatrix}4\\6\\8\\10\end{pmatrix} = \frac{118}{65}.
\] Considering the least squares problem, \(\|Ax – b\|^2 = (2x-4)^2+(3x-6)^2+(4x-8)^2+(6x-10)^2\) which has a minimum at \(2(2x-4)2+2(3x-6)3+2(4x-8)4+2(6x-10)6 = 0\) so that
\[
x = \frac{8+18+32+60}{4+9+16+36} = \frac{118}{65}.
\]

Underdetermined systems:

\(A\) is \(m \times n\) with \(m < n\). In this case there are more unknowns than equations, and the connection with the pseudo-inverse is that \[ x = A^{\top}(AA^{\top})^{-1}b = A^+b \] is the particular \(x\) that minimizes \(\|x\|\) amongst all of the possible solutions. Example: Solve the system \((1\ -1\ 0)(x_1\ x_2\ x_3)^\top = (2).\)
In this case we find \(AA^{\top} = (1\ -1\ 0)(1\ -1\ 0)^\top = 2\) and
\[x = A^{\top}(AA^{\top})^{-1}b = \begin{pmatrix}1\\-1\\0\end{pmatrix}\frac{1}{2}2 = \begin{pmatrix}1\\-1\\0\end{pmatrix}.
\]

To see the connection with the least squares problem, the system is row reduced (it is already row reduced) and we can choose \(x_1\) as the pivot with \(x_2, x_3\) as free parameters. Letting \(x_2 = s\) and \(x_3 = t\) we have \(x_1 = 2+s\) so that the general solution is
\[x = \begin{pmatrix}2+s\\ s\\ t\end{pmatrix} = \begin{pmatrix}2\\0\\0\end{pmatrix} + s\begin{pmatrix}1\\1\\0\end{pmatrix} + t\begin{pmatrix}0\\0\\1\end{pmatrix}
\] for any \(s,t\in\mathbb{R}\). Considering \(\|x\|\), we have \(\|x\|^2 = (2+s)^2 + s^2 + t^2\) which has a minimum at \(2(2+s)+2s = 0\) and \(2t = 0\) or \(t = 0, s = -1\) giving \(x = (1\ -1\ 0)^\top\) as before.

Symmetric matrices:

Decomposing the structure of the matrix \(A\) can help understand the resulting solutions and in the case of a symmetric matrix, the eigenvectors form an orthogonal set which allows one to expand \(A = UDU^\top\) where \(U\) is the orthogonal matrix \((U^{-1}=U^\top)\) with the eigenvectors as columns and \(D\) a diagonal matrix with the corresponding eigenvalues as the diagonal elements. Also, the eigenvalues could be anything, but if we specify that we want a non-negative definite matrix then the eigenvalues must be greater than or equal to zero.

Example: Write \(A_1 = \begin{pmatrix}2 &1\\ 1 &2\end{pmatrix}\) in the form \(A_1 = UDU^\top\).
A quick calculation gives \(\lambda_1 = 3\) with corresponding eigenvector \(\mathbf{\xi}^{(1)} = \frac{1}{\sqrt{2}}(1\ 1)^\top\) and a second pair \(\lambda_2 = 1, \mathbf{\xi}^{(2)} = \frac{1}{\sqrt{2}}(1\ -1)^\top.\) This gives the decompositon \[A_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ 1 &-1\end{pmatrix}\begin{pmatrix}3 &0\\ 0 &1\end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ 1 &-1\end{pmatrix}^\top.\] Using this decomposition the inverse of a matrix is easily computed by replacing the diagonal elements of \(D\) with their reciprocals so that for example \[A_1^{-1} = \frac{1}{3}\begin{pmatrix}2& -1\\ -1& 2\end{pmatrix} = \frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ 1 &-1\end{pmatrix}\begin{pmatrix}{\small\frac{1}{3}} &0\\ 0 &1\end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ 1 &-1\end{pmatrix}^\top.\]

What happens if one of the eigenvalues is zero? This will not effect the decomposition of \(A\), in fact if we decompose \(A_2 = \begin{pmatrix}1&-1\\-1&1\end{pmatrix}\) with eigenvalues \(\lambda_1 = 2, \lambda_2 = 0\) and corresponding eigenvectors \(\xi^{(1)} = \frac{1}{\sqrt{2}}(1\ -1)^\top\), \(\xi^{(2)} = \frac{1}{\sqrt{2}}(1\ 1)^\top\) then \[ A_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ -1 &1\end{pmatrix}\begin{pmatrix}2 &0\\ 0 &0\end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ -1 &1\end{pmatrix}^\top.\] Notice that this matrix is not invertible since one of the eigenvalues is zero. But what if we took the reciprocal of all the nonzero diagonal elements to form \[A_3 = \frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ -1 &1\end{pmatrix}\begin{pmatrix}{\small\frac{1}{2}} &0\\ 0 &0\end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix}1 &1\\ -1 &1\end{pmatrix}^\top = \frac{1}{4}\begin{pmatrix}1& -1\\ -1 &1\end{pmatrix}.\] Sadly, \(A_3\) is not the inverse of \(A_2\). This would be very surprising since \(A_2\) is not invertible. So what then is \(A_3\)? Well, \(A_2\) and \(A_3\) are pseudo-inverses. This can be generalized later to non-square matrices by constructing the SVD of a matrix. As a refresher please watch the following video:

Here is how the pseudo-inverse is connected to the solution of a least-squares problem. If the linear system \(Ax = b\) has any solutions, then they will have the form\[x = A^+ b + \left(I – A^+ A\right)\xi\] for some arbitrary vector \(\xi\). Multiplying by \(A\) on the left gives that condition that (\(A = A A^+ A\) is a property of \(A^+\))\[Ax = A A^+ b + \left(A – A A^+ A\right)\xi = A A^+ b = b.\] So for a any solution to exist we need the admissibility condition \(AA^+ b = b\). For linear systems \(A x = b\) with non-unique solutions as in the underdetermined case, the pseudo-inverse may be used to construct the solution of minimum Euclidean norm \(\|x\|\) among all solutions. If \(A x = b\) is admissible (\(AA^+ b = b\)), the vector \(y = A^+b\) is a solution, and satisfies \(\|y\| \le \|x\|\) for all solutions.

One final example should tie this all together.
Example: Consider finding the solution to \(A_2x = b\) that minimizes \(\|x\|\).
Row reducing \(A_2x = b, b = (b_1\ b_2)^\top\) reveals the \(b_2=-b_1\) to ensure a solution, and
\[
\begin{pmatrix}x_1\\x_2\end{pmatrix} = \begin{pmatrix}b_1\\0\end{pmatrix} + s\begin{pmatrix}1\\1\end{pmatrix}
\] for any \(s\in\mathbb{R}\). Continuing, \(\|x\|^2 = (b_1+s)^2+s^2\) which is minimized when \(2(b_1+s)+2s=0\) or when \(s = -\frac{b_1}{2}\) so that the solution is
\[
\begin{pmatrix}x_1\\x_2\end{pmatrix} = \begin{pmatrix}b_1\\0\end{pmatrix} – \frac{b_1}{2}\begin{pmatrix}1\\1\end{pmatrix}=\frac{b_1}{2}\begin{pmatrix}1\\-1\end{pmatrix}.
\] Using the pseudo-inverse of \(A_2\), \(A_3 = A_2^+\) gives the admissibility condition \(A_2A_2^+A_2b = b\) which simplifies to \(b_2 = -b_1\) and the solution
\[
\begin{pmatrix}x_1\\x_2\end{pmatrix} = A_2^+b = \frac{1}{4}\begin{pmatrix}1&-1\\-1&1\end{pmatrix}=\frac{1}{4}\begin{pmatrix}b_1-b_2\\-b_1+b_2\end{pmatrix}=\frac{b_1}{2}\begin{pmatrix}1\\-1\end{pmatrix}.
\]

Summary

At the beginning of this post, the Moore-Penrose pseudo-inverse generalized the idea of an inverse to non-square matrices and another notion of pseudo-inverse arose for symmetric matrices that have at least one zero eigenvalue. This second notion can be generalized (using the SVD) to non-square matrices and matrices that are not symmetric where the eigenvectors are not guaranteed to form an orthonormal set. In all cases, the pseudo-inverse is implicitly tied to the notion of finding solutions with minimal norm.

Leave a Comment