อภิธานศัพท์

เลือกหนึ่งในคำหลักทางด้านซ้าย ...

Linear AlgebraOrthogonality

เวลาอ่านหนังสือ: ~45 min

The orthogonal complement V^\perp of a vector space V\subset \mathbb{R}^n is the set of vectors in \mathbb{R}^n which are orthogonal to every vector in V. For example, the orthogonal complement a two-dimensional subspace V of \mathbb{R}^3 is the through the origin perpendicular to the plane of vectors in V.

Exercise
The orthogonal complement of the span of the columns of a matrix A is equal to .

Solution. The correct answer is the null space of A, because for a vector \mathbf{x} to be orthogonal to all of the columns of A, the equation A' \mathbf{x} = \boldsymbol{0} must hold, by the matrix product dot formula. In other words, \mathbf{x} must be in the null space of A'.

Furthermore, if \mathbf{x} is in the null space of A', then it is orthogonal to any linear combination of the columns of A, since

\begin{align*}(c_1 \mathbf{a}_1 + c_2 \mathbf{a}_2 + \cdots + c_n \mathbf{a}_n) \cdot \mathbf{x} &= c_{1}\mathbf{a}_{1}\cdot\mathbf{x}+c_{2} \mathbf{a}_{2}\cdot\mathbf{x} +\cdots+c_{n}\mathbf{a}_{n}\cdot\mathbf{x}_{n} \\ &= 0 + 0 + \cdots + 0 \\ &= 0.\end{align*}

Therefore, orthogonal complement of the span of the columns of a matrix A is equal to the null space of A'.

Orthogonal decomposition

For any vectors \mathbf{u} and \mathbf{v} in \mathbb{R}^n, it is always possible to write \mathbf{u} as the sum of a multiple of \mathbf{v} and a vector which is perpendicular to \mathbf{v}:

Exercise (Orthogonal decomposition)
Suppose that \mathbf{u} and \mathbf{v} are nonzero vectors in \mathbb{R}^n. Solve the equation

\begin{align*}k\mathbf{v} \cdot (\mathbf{u} - k\mathbf{v}) = 0\end{align*}

for k to find the multiple of \mathbf{v} which is perpendicular to its difference with \mathbf{u}.

Solution. The equation gives k\mathbf{u}\cdot \mathbf{v} - k^2 |\mathbf{v}|^2 = 0, which implies that

\begin{align*}k = \frac{\mathbf{u}\cdot \mathbf{v}}{|\mathbf{v}|^2}.\end{align*}

If \mathbf{u} is equal to k\mathbf{v} + \mathbf{w} where \mathbf{w} is perpendicular to \mathbf{v}, then we call k\mathbf{v} the projection of \mathbf{u} onto \mathbf{v}. As the geometric intuition suggests, the projection of \mathbf{u} onto \mathbf{v} is the closest vector to \mathbf{u} among all vectors parallel to \mathbf{v}.

Orthogonal bases can be much more geometrically and algebraically useful than bases which are not orthogonal. Fortunately, it is always possible to orthogonalize a basis without changing its span:

Theorem (Gram-Schmidt)
Every vector space V\subset \mathbb{R}^n has an orthogonal basis.

Proof. Suppose that \mathcal{V} = {\mathbf{v}_1, \ldots, \mathbf{v}_k} is a basis of \mathbb{R}^n. We will build our orthogonal basis by orthogonalizing \mathcal{V} one vector at a time.

Define \mathbf{b}_1 = \mathbf{v}_1, and then define \mathbf{b}_2 to be the part of \mathbf{v}_2 which is orthogonal to \mathbf{b}_1:

\begin{align*}\mathbf{b}_2 = \mathbf{v}_2 - \frac{\mathbf{b}_1 \cdot \mathbf{v}_2}{|\mathbf{b}_1|^2} \mathbf{b}_1.\end{align*}

Similarly, we project \mathbf{v}_3 onto \mathbf{b}_1 and onto \mathbf{b}_2 and subtract off both of these projections:

\begin{align*}\mathbf{b}_3 = \mathbf{v}_3 - \frac{\mathbf{b}_2 \cdot \mathbf{v}_3}{|\mathbf{b}_2|^2}\mathbf{b}_2 - \frac{\mathbf{b}_1 \cdot \mathbf{v}_3}{|\mathbf{b}_1|^2}\mathbf{b}_1.\end{align*}

Then \{\mathbf{b}_1, \mathbf{b}_2, \mathbf{b}_3\} has the same span as \{\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3\} and is pairwise orthogonal. We may continue this procedure (project each new \mathbf{v}_i onto each of the preceding \mathbf{b}'s and subtract off all of these projections) until we reach the end of the list, thereby obtaining an orthogonal basis of V.

Theorem
Suppose V\subset \mathbb{R}^n is a vector space. Then every vector \mathbf{u} \in \mathbb{R}^n can be written as the sum of a vector in V and a vector in V^\perp.

Proof. Consider an orthogonal basis \{\mathbf{v}_1, \ldots, \mathbf{v}_k\} of V, and define

\begin{align*}\mathbf{v} = \frac{\mathbf{v}_1 \cdot \mathbf{u}}{|\mathbf{v}_1|^2}\mathbf{v}_1 + \cdots + \frac{\mathbf{v}_k \cdot \mathbf{u}}{|\mathbf{v}_k|^2}\mathbf{v}_k\end{align*}

Then \mathbf{v} is in V and \mathbf{u} - \mathbf{v} is to all of the \mathbf{v}_i's and therefore to every vector in V.

Exercise
Suppose that V is a d-dimensional vector space in \mathbb{R}^n. Show that there is a basis of \mathbb{R}^n whose first d vectors form a basis for V and whose last n-d vectors form a basis for V^\perp.

Solution. Suppose that \{\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{d}\} is a basis for V and \{\mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{e}\} is a basis for W = V^\perp. We claim that

\begin{align*}\{\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{d}, \mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{e} \}\end{align*}

is a basis for \mathbb{R}^n.

First, it's linearly independent because no vector is in the span of the preceding vectors: (1) the \mathbf{v}_i's are linearly , so none of them is in the span of the preceding vectors. And (2) if, for some i, the vector \mathbf{w}_i is in the span of the preceding vectors in the list, then it can be written as \mathbf{v} + \mathbf{w} for some vector \mathbf{v} in V and some vector \mathbf{w} in the span of \mathbf{w}_1, \ldots, \mathbf{w}_{i-1}. Dotting both sides of the equation

\begin{align*}\mathbf{v} + \mathbf{w} = \mathbf{w}_i\end{align*}

by \mathbf{v}, we find that |\mathbf{v}|^2 = 0, which implies that \mathbf{v} = \boldsymbol{0}. Thus \mathbf{w} = \mathbf{w}_i, which is not compatible with the fact that the \mathbf{w}_i's form a .

Therefore, we conclude that \{\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{d}, \mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{e}\} is linearly independent.

Finally, the list \{\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{d}, \mathbf{w}_{1},\mathbf{w}_{2},\ldots,\mathbf{w}_{e} \} spans \mathbb{R}^n since every vector in \mathbb{R}^n can be written as a sum of a vector in V and a vector in V^\perp.

Orthogonal matrices

Suppose we can write a given transformation T as a composition involving (i) a single transformation \Lambda which scales space along the coordinate axes, and (ii) some other transformations which preserve distances and angles—like rotations and reflections in \mathbb{R}^2 or \mathbb{R}^3. Such a decomposition of T would be useful because it isolates the space-distorting behavior of T in the simple transformation \Lambda. In preparation for developing such a decomposition, let's study transformations which are distance-preserving and angle-preserving.

A transformation x\mapsto U\mathbf{x} from \mathbb{R}^n to \mathbb{R}^n is distance-preserving if the norm of \mathbf{x} is the same as the norm of U\mathbf{x} for all \mathbf{x} \in \mathbb{R}^n. Using dot products, we can write the distance-preserving condition as

\begin{align*}\mathbf{x} \cdot \mathbf{x} = (U\mathbf{x}) \cdot (U\mathbf{x})\end{align*}

If the transformation preserves angles as well as distances, then (U\mathbf{x}) \cdot (U\mathbf{y}) must also be equal to \mathbf{x} \cdot \mathbf{y} for all \mathbf{x} and \mathbf{y} in \mathbb{R}^n. Rewriting this equation using transposes, we see that we want

\begin{align*}\mathbf{x}' \mathbf{y} = \mathbf{x}' U' U\mathbf{y}\end{align*}

for all \mathbf{x} and \mathbf{y} in \mathbb{R}^n. This identity only holds if U' U is equal to the identity matrix. This leads us to the following definition.

Definition (Orthogonal matrix)
A square matrix U is orthogonal if U' U is equal to the identity matrix.

Equivalently, we can say that a square matrix is orthogonal if and only if its columns are orthonormal, which means that they are orthogonal and have unit norm. If a non-square matrix U satisfies U' U = I, then we refer to U as a matrix with orthonormal columns.

Exercise
(i) Explain why, for an m \times n matrix U with orthonormal columns, we must have m \geq n. (ii) Explain why the rank of U is n.

Solution.

  • We first recall that the number of linearly independent columns in U cannot be greater than m because the range of U is a subspace of \mathbb{R}^m. Now, by definition, the n columns of U are orthogonal and non-zero. These columns must be linearly independent, so n \leq m.

  • The rank of U is equal to the number of linearly independent columns in U, which is n in this case.

If U is an m\times n matrix with orthonormal columns and if n < m, then U U' is an m\times m matrix of rank n and therefore cannot be the identity matrix. In fact, U U' is a projection matrix:

Exercise
Show that if U is an m \times n matrix with orthonormal columns, then UU' is the matrix of the transformation which projects each vector in \mathbb{R}^m onto the n-dimensional subspace of \mathbb{R}^m spanned by the columns of U.

Solution. The transformation which maps a vector \mathbf{w} onto the span of the columns \mathbf{u}_1, \ldots \mathbf{u}_n of U is given by

\begin{align*}T(\mathbf{w}) = \frac{\mathbf{u}_{1}\cdot\mathbf{w}}{|\mathbf{u}_{1}|^2}\mathbf{u}_{1} +\frac{\mathbf{u}_{2}\cdot\mathbf{w}}{|\mathbf{u}_{2}|^2}\mathbf{u}_{2}+ \cdots+\frac{\mathbf{u}_{n}\cdot\mathbf{w}}{|\mathbf{u}_{n}|^2}\mathbf{u}_{n}.\end{align*}

All of the denominators in this formula are equal to 1 because the columns of U are unit vectors. So

\begin{align*}T(\mathbf{w}) = \mathbf{u}_1(\mathbf{u}_1\cdot \mathbf{w})+ \cdots + \mathbf{u}_n(\mathbf{u}_n\cdot \mathbf{w}).\end{align*}

The vector whose components are the expressions in parentheses, namely [\mathbf{u}_1\cdot \mathbf{w}, \ldots, \mathbf{u}_n\cdot \mathbf{w}], is equal to U' \mathbf{w}, by the definition of the matrix-vector product. Applying that definition a second time (interpreting \mathbf{u}_1(\mathbf{u}_1\cdot \mathbf{w})+ \cdots + \mathbf{u}_n(\mathbf{u}_n\cdot \mathbf{w}) as a linear combination of the \mathbf{u}_i's with weights given by the parenthetical dot products), we find that T(\mathbf{w}) = UU' \mathbf{w}.

Exercise
Let \mathbf{v} be a vector in \mathbb{R}^n, and consider the linear transformation T: \mathbb{R}^{n} \to \mathbb{R} defined as T(\mathbf{x}) = \mathbf{v} \cdot \mathbf{x}. What is the rank of T? Geometrically describe the null space of T.

Solution. The rank of T is 1 if \mathbf{v} \neq \mathbf{0}, otherwise the rank is 0. Geometrically, the null space of T is the set of vectors in \mathbb{R}^n that are orthogonal to \mathbf{v}.

An application: linear regression

One of the most common methods in statistics is linear regression. Given n columns of numerical data, we seek a linear combination of the first n-1 columns which gets as close as possible to the last column. This can be helpful if the last column contains values you want to predict and the other columns contain data which is accessible at the time of prediction. For example, the last column might contain points scored by a given player in a Game 7 of a playoff series, while the previous n-1 = 6 columns contain the number of points scored by that player in the first 6 games of that series.

Suppose that the first n-1 columns are arranged into a matrix A and the last column is called \mathbf{b}. Since represents an arbitrary linear combination of the columns of A, we are looking for the value of \mathbf{x} which minimizes the squared norm |A\mathbf{x} - \mathbf{b}|^2. Geometrically, it makes sense that if \mathbf{x} is chosen minimally, then A\mathbf{x} - \mathbf{b} will be orthogonal to every column of A. In other words, we will have

\begin{align*}A'(A\mathbf{x} - \mathbf{b}) = 0,\end{align*}

which implies that \mathbf{x} = (A'A)^{-1}(A'\mathbf{b}), assuming A'A is invertible. This intuition is accurate, and the equation we derived is called the normal equation.

Exercise
Use the code below to build a random 100 × 6 matrix whose first five columns are linearly dependent and whose sixth column is not in the span of the first five. Use the normal equation to try to solve for the weights of the linear combination of the first five columns which gets closest to the sixth column. What goes wrong?

Note: np.linalg.solve(A,b) A \ b solves the equation A\mathbf{x} = \mathbf{b} for \mathbf{x}.

import numpy as np
A = np.random.randint(0,2,(100,6))
A[:,4] = A[:,3] + A[:,2]
A = rand(0:1, 100, 6)
A[:, 5] = A[:, 4] + A[:, 3]

Solution We try

b = A[:,5]
A = A[:,:-1]
np.linalg.solve(A.T @ A, A.T @ b)
b = A[:,5]
A = A[:,1:end-1]
(A' * A) \ (A' * b)

and we get an error telling us that A'A is not invertible. This makes sense, because A'A has the same rank as A, and we know A is rank deficient. Since there are different ways of combining the columns of A to get the vector in its column space which is as close as possible to \mathbf{b}, it is not possible that we would have gotten a unique answer using this method.

Exercise
Try the previous exercise again, but this time with the linear dependence relation holding only approximately. What goes wrong this time?

import numpy as np
A = 1.0*np.random.randint(0,2,(100,5))
b = np.random.randint(0,2,(100,))
A[:,4] = A[:,3] + A[:,2] + 1e-3*np.random.standard_normal(100)
A = 1.0*rand(0:1, 100, 5)
b = 1.0*rand(0:1, 100)
A[:,4] = A[:,3] + A[:,2] + 1e-3*randn(100)

Solution. We take a look at the solution:

plt.bar(range(5),np.linalg.solve(A.T @ A, A.T @ b))
using Plots
bar(1:5,(A' * A) \ (A' * b), label = "solution components")

We see that it gives large and oppositely-signed coefficients for the last three vectors. We can tell that the optimization process is leveraging the tiny difference between the last vector and the sum of the two before it to "reach" in the direction of \mathbf{b}. Although we did not get a singularity error this time, the result is no less undesirable, because predictions which depend on tiny differences between measured values are clearly not going to be useful. We will see what we can do about this problem when we develop the singular value decomposition.

Bruno
Bruno Bruno