39 Chapter 39 Least Squares Solution of Linear Systems
by taratuta
Comments
Transcript
39 Chapter 39 Least Squares Solution of Linear Systems
39 Least Squares Solution of Linear Systems Per Christian Hansen Technical University of Denmark Hans Bruun Nielsen Technical University of Denmark 39.1 39.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-1 39.2 Least Squares Data Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-3 39.3 Geometric and Algebraic Aspects . . . . . . . . . . . . . . . . . . . . 39-4 39.4 Orthogonal Factorizations. . . . . . . . . . . . . . . . . . . . . . . . . . . 39-5 39.5 Least Squares Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-6 39.6 Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-7 39.7 Up- and Downdating of QR Factorization . . . . . . . . . . . 39-8 39.8 Damped Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-9 39.9 Rank Revealing Decompositions . . . . . . . . . . . . . . . . . . . . . 39-11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-13 Basic Concepts (See Chapter 5 for additional information.) Definitions: Given a vector b ∈ Rm and a matrix A ∈ Rm×n with m > n, the least squares problem is to find a vector x0 ∈ Rn that minimizes the Euclidean length of the difference between Ax and b: Problem LS: Find x0 satisfying b − Ax0 2 = min b − Ax2 . Such an x0 is called a least squares solution. For any vector x the vector r = r(x) = b − Ax is the residual vector. The residual of a least squares solution is denoted by r0 . The least squares problem is consistent if b ∈ range(A). A basic solution, x0B , is a least squares solution with at least n − rank(A) zero components. The minimum-norm least squares solution, x0M , is the least squares solution of minimum Euclidean norm. In a weighted least squares problem, we are also given weights w i ≥ 0 for i = 1, . . . , m, and the objective is to minimize W(b − Ax)2 , where W = diag(w 1 , . . . , w m ). This is an important special case of the generalized least squares problem: Problem GLS: LetAx + Bv = b, whereB ∈ Rm× p with p ≤ m. Find xG , vG such that v2 is minimized. Note that Bv plays the role of the residual vector. 39-1 39-2 Handbook of Linear Algebra In the total least squares problem, we allow for errors both in the vector b and in the matrix A: Problem TLS: Let (A + E )x + r = b, where E ∈ Rm×n with n ≤ m. Find xT such that (E , r) F is minimized. · F denotes the Frobenius norm. In this chapter, Ai,: and A:, j denote the vectors given by the elements in the i th row and j th column of matrix A, respectively. Similarly, Ak:l ,: (or A:,k:l ) is the submatrix consisting of rows (or columns) k through l of A. In the examples in this chapter, the computation is done with about 16 digits accuracy, but the displayed results are rounded to fewer digits. Facts: (See, e.g., Chapters 1 and 2 in [Bjo96].) 1. If m > n = rank(A), then the least squares solution x0 is analytically equivalent to the solution to the normal equations AT A x = AT b. 2. If the least squares problem is consistent, then r0 = 0. 3. The least squares solution is unique if m ≥ n and A has full rank. 4. If the system is underdetermined (m < n) or if A is rank deficient, then the solution to Problem LS is not unique. Also, a basic solution is not unique. 5. The minimum-norm least squares solution x0M is always unique. 6. If m > n = rank(A), then the least squares solution can be written as x0 = A† b, where the matrix A† is the Moore–Penrose generalized inverse or pseudoinverse of A. (See Section 5.7.) In general, A† produces the minimum-norm solution: x0M = A† b. 7. If B is nonsingular, then xG minimizes B −1 (b − Ax)2 . 8. If the covariance matrix for b has the Cholesky factorization Cov(b) = C T C , then xG is the best linear unbiased estimate (BLUE) in the general linear model with B = C T . 9. If Cov(b) has full rank, then xG is the solution to the least squares problem min C −T (b − Ax)2 . 10. In particular, if Cov(b) = σ 2 I , then xG = x0 and Cov(x0 ) = σ 2 (AT A)−1 . 11. An English translation of the original work on least squares problems by C. F. Gauss is available in [Gau95]. Examples: 1. Consider problem LS with m = 3 and n = 2: ⎡ 1 ⎢ min ⎣1 1 ⎤ ⎤ ⎡ 1 0.75 ⎥ x1 ⎢ ⎥ − ⎣1.13⎦ . 2⎦ x2 3 1.39 2 The associated normal equations and the least squares solution are 3 6 6 14 x1 3.27 = , x2 7.18 The residual vector corresponding to x0 is ⎡ ⎤ −0.02 ⎢ ⎥ r0 = r(x0 ) = b − Ax0 = ⎣ 0.04⎦ , −0.02 and AT r0 = 0. 0.45 x0 = . 0.32 39-3 Least Squares Solution of Linear Systems 2. If we use the weights w 1 = 10 and w 2 = w 3 = 1, the problem is changed to ⎡ 10 ⎢ min ⎣ 1 1 ⎤ ⎡ ⎤ 10 7.5 ⎥ x ⎢ ⎥ 2 ⎦ 1 − ⎣1.13⎦ x2 3 1.39 2 whose least squares solution and corresponding residual are x0 = ⎡ ⎤ −0.00024 ⎢ ⎥ r0 = ⎣ 0.04790⎦ . −0.02395 0.41838 , 0.33186 Note that the first component of r0 is reduced when w 1 is increased from 1 to 10. 39.2 Least Squares Data Fitting Definitions: Given m data points (ti , yi ), i = 1, . . . , m, and n linearly independent functions f j , j = 1, . . . , n (with m > n), find the linear combination F (x, t) = n x j f j (t) j =1 that minimizes the sum of squared residuals yi − F (x, ti ) at the data points: min x m 2 yi − F (x, ti ) . i =1 The coefficients xi are the components of the least squares solution to min b − Ax2 , where the columns A:, j of A are samples of f j at ti and the elements of b are the values yi : Ai j = f j (ti ), bi = yi , i = 1, . . . , m j = 1, . . . , n. The solution F (x0 , t) is said to fit the data in the least squares sense. Facts: (See, e.g., Chapter 4 in [Bjo96].) 1. The fit can be made more robust to outliers by solving a weighted least squares problem min W(b − Ax)2 with W = diag(w 1 , . . . , w m ), w i = ψ(r i ) = ψ(bi − Ai,: x); ψ being a convex function. This problem is usually solved by an iteratively reweighted least squares algorithm. 2. In orthogonal distance fitting, instead of minimizing the residuals one minimizes the orthogonal distances between the fitting function F and the data points. Important examples are fitting of a circle, an arc, or an ellipse to data points. Examples: 1. Given f 1 (t) = 1 and f 2 (t) = t, find the least squares fit to the data points (1, 0.75), (2, 1.13), and (3, 1.39). We get the A, b, and x0 from Example 1 in section 39.1, and F (x0 , t) = 0.45 + 0.32t. 39-4 Handbook of Linear Algebra If the third data point is changed to (3, 13.9), then the least squares solution changes to x0 = ( −7.890, 6.575)T , and the least squares fit becomes F (x0 , t) = −7.890 + 6.575t; this illustrates the sensitivity to outliers. 39.3 Geometric and Algebraic Aspects Definitions: The columns of A ∈ Rm×n span the range of A, while the nullspace or kernel of A is the set of solutions to the homogeneous system Ax = 0: range(A) = {z = Ax | x ∈ Rn }, ker(A) = {x ∈ Rn | Ax = 0}. The four fundamental subspaces associated with Problem LS are range(A), ker(AT ), ker(A), and range(AT ). (See Section 2.4 for more information.) Facts: The first three facts can be found in [Str88, Sec. 2.4]; the remaining facts are discussed in [Bjo96, Chap. 1]. p denotes the rank of A: p = rank(A). 1. If p = n < m, then the vector 0 of all zeros is the only element in ker( A). 2. The spaces range(A) and ker(AT ) are subspaces of Rm with dimensions p and m− p, respectively. The two spaces are orthogonal complements, i.e., yT z = 0 for any pair (y ∈ range(A), z ∈ ker(AT )), and range(A) ⊕ ker(AT ) = Rm . 3. The spaces ker(A) and range(AT ) are subspaces of Rn with dimensions n− p and p, respectively. The two spaces are orthogonal complements. 4. The least squares residual vector r0 = b − Ax0 is an element in ker(AT ). Combining this with the definition of r, we get the so-called augmented system associated with Problem LS: I AT A 0 r x b . 0 = If p = n, then the augmented system is nonsingular and the solution components are r0 and x0 . 5. The vector Ax0 is the orthogonal projection of b onto range(A). 6. The vector r0 is the orthogonal projection of b onto ker(AT ). , where is a A 7. If p < n, then the columns in A can be reordered such that A = A has p columns, range(A) = range( A). The permutation permutation matrix and the submatrix A is not unique. 8. The orthogonal projectors onto range( A) and ker(A) are given by AA† and I − A† A, respectively. Examples: 1. The Figure 39.1 illustrates Facts 5 and 6 in the case m = 3, n = p = 2. b> r 60 . .. ... .... ..... ...... ....... . range(A) ....... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . . . . . . . . . . . . . . . . . . . ..... . . . . . . . . . . . . . . . . . . . ..... : ...... . . . . . . . . . . . . . . . Ax ... . . . . . . . . . . . . . . . . . . . . . . . 0. . . . ... . . ................ . . . . . ................... . ....................... ........................ ........................ FIGURE 39.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... ... .. . 39-5 Least Squares Solution of Linear Systems 2. For the problem in Example 1 in Section 39.1, both range( A) and ker(AT ) are subspaces of R3 given by, respectively, ⎡ ⎤ ⎡ ⎤ ⎡ 1 1 ⎢ ⎥ ⎢ ⎥ range(A) = α ⎣1⎦ + β ⎣2⎦ , 1 3 ⎤ 1 ⎢ ⎥ ker(AT ) = γ ⎣−2⎦ , 1 with α, β, γ ∈ R. 3. Fact 4 can be used to derive the normal equations: ⇒ r = b − Ax ∈ ker(AT ) 39.4 AT (b − Ax) = 0 . Orthogonal Factorizations (See Section 5.5 and Section 38.4 for additional information on orthogonal factorizations.) Definitions: The real matrix Q is orthogonal if it is square and satisfies Q T Q = I . A QR factorization of a matrix A ∈ Rm×n with m ≥ n has the form A = QR = Q R 0 R , = Q ∈ Rn×n is upper triangular, and Q = Q :,1:n . The form where Q ∈ Rm×m is orthogonal, R ∈ Rm×n , R A = Q R is the so-called “reduced” (or “skinny”) QR factorization. The singular value decomposition (SVD) of A ∈ Rm×n has the form A = U VT , where U ∈ Rm×m and V ∈ Rn×n are orthogonal matrices and = diag(σ1 , . . . , σr ) ∈ Rm×n has diagonal elements σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0, r = min{m, n} . Letting u j and v j denote the j th column in U and V , respectively, we can write A= r σ j u j vTj . j =1 k For k < r the matrix j =1 σ j u j vTj is called the truncated SVD approximation to A. (See Sections 5.6, 17, and 45 for more information about the singular value decomposition.) Facts: Except for Facts 1 and 8 see [Bjo96, Chap. 1]. Also see Section 39.5. = rank(A). 1. [Str88, Chap. 3]. A QR factorization preserves rank: rank(R) = rank( R) 1 = D R 2, 2. A QR factorization is not unique, but two factorizations Q 1 R1 and Q 2 R2 always satisfy R where D is diagonal with Dii = ±1. and the upper triangular Cholesky factor C for the normal equations matrix 3. The triangular factor R T , where D is diagonal with D ii = ±1. A A always satisfy R = DC 4. If A has full rank and has the QR factorization A = Q R, then x0 can be found by back substitution x = Q T b. in the upper triangular system R :,1:n 39-6 Handbook of Linear Algebra 5. If Q has not been saved, then we can use forward and back substitution to solve the seminormal T R x = AT b. For reasons of numerical stability, this must be followed by one step of equations: R iterative refinement; the complete process is called the corrected seminormal equations method. 6. Let rank(A) = p ≤ n ≤ m and A = Q R. The columns in Q :,1: p and Q :, p+1:m are orthonormal bases of range(A) and ker(AT ), respectively. 7. Let A = U V T . Then p = rank(A) is equal to the number of strictly positive singular values: σ1 ≥ · · · ≥ σ p > 0, σ p+1 = · · · = σmin{m,n} = 0. The columns in U:,1: p and U:, p+1:m are orthonormal bases of range( A) and ker(AT ), respectively, and the columns in V:,1: p and V:, p+1:n are orthonormal bases of range(AT ) and ker(A), respectively. 8. [GV96, Chap. 12]. The TLS solution can be computed as follows: compute the SVD of the coefficient T . If the smallest singular V matrix A augmented with the right-hand side b, i.e., [ A , b ] = U value σn+1 is simple, and if β = Vn+1,n+1 = 0, then 1:n,n+1 , xT = −β −1 V :,n+1 V T E T = −σn+1 U 1:n,n+1 , :,n+1 . rT = −σn+1 β U Examples: T with V 1. For the problem from Example 1 in Section 39.1, we find [ A , b ] = U ⎡ ⎤ ⎡ 4.515 ⎥ = diag ⎢ ⎣0.6198⎦ , 0.0429 :,3 U ⎤ ⎡ 0.4248 ⎢ ⎥ = ⎣−0.8107⎦ , 0.4029 ⎤ 0.3950 ⎥ :,3 = ⎢ V ⎣ 0.2796⎦ . −0.8751 Thus, xT = 0.4513 , 0.3195 ⎡ ⎤ −0.0072 −0.0051 ⎢ ⎥ E T = ⎣ 0.0137 0.0097⎦ , −0.0068 −0.0048 ⎡ ⎤ −0.015939 ⎢ ⎥ rT = ⎣ 0.030421⎦ . −0.015118 T In Example 1 in section 39.1 we found x0 = 0.4500 0.3200 . The difference between x0 and xT is small because the problem is almost consistent and A is well conditioned; see Section 39.6 The elements in rT are about 80% of the elements in r0 given in Example 1 in Section 39.1. 39.5 Least Squares Algorithms Definitions: By least squares algorithms we mean algorithms for computing the least squares solution efficiently and stably on a computer. The algorithms should take into account the size of the matrix and, if applicable, also its structure. Facts: For real systems the following facts can be found in, e.g., [Bjo96], [Bjo04], and [LH95]. 1. The algorithm which is least sensitive to the influence of rounding errors is based on the QR factorization of A: R. (a) Compute the reduced QR factorization A = Q T b (can be computed during the QR factorization algorithm (b) Compute the vector β = Q without forming Q explicitly). −1 β via back substitution. (c) Compute x = R The use of this algorithm was first suggested in [Gol65]. 39-7 Least Squares Solution of Linear Systems 2. If A is well conditioned, the normal equations can be used instead: (a) Compute the normal equation system M = AT A and d = AT b. (b) Compute the Cholesky factorization M = C T C and y = C −T d (the vector y can be computed during the Cholesky algorithm). (c) Compute x = C −1 y via back substitution. 3. If A or AT A is a Toeplitz matrix, use an algorithm that utilizes this structure to obtain the computational complexity O(mn). 4. If A is large and sparse, use a sparse QR factorization algorithm that avoids storing the matrix Q. If solving a system with the same A but a different right-hand side, use the corrected seminormal equations. 5. Alternatively, if A is large and sparse, it may be preferable to use the augmented system approach because it may lead to less fill-in. Then a symmetric indefinite solver, such as the LDLT factorization, must be used, cf. [GV96, Sec. 4.4]. 6. If A is large and the matrix-vector multiplications with A and AT can be computed easily, then use the conjugate gradient algorithm on the normal equations. Several implementations are available; CGLS is the classical formulation; LSQR is more accurate for ill-conditioned matrices. 39.6 Sensitivity Definitions: For A ∈ Rm×n with p = rank(A) ≤ min(m, n) the condition number for the least squares problem is given by κ(A) = σ1 /σ p , where σ1 ≥ σ2 ≥ · · · ≥ σ p > 0 are the nonzero singular values of A. x0M denote the minimum-norm least squares solutions to the problems min b − Ax2 Let x0M and − Ax 2 , the latter being a perturbation of the former. Define the quantities and min b − A2 , δA = A δb = b − b2 , η= δA δA = κ(A) . σp A2 Facts: (See, e.g., [Bjo96, Sec. 1.4] or [LH95, Chap. 9].) = rank(A). 1. If η < 1, then the rank is not changed by the perturbation rank( A) x0M is bounded as 2. If η < 1, then the relative perturbation in x0M − x0M 2 κ(A) ≤ x0M 2 1−η δA δb + ηr0 2 + A2 A2 x0M 2 + η, r0 = b − Ax0M . If rank(A) = n, then the last term η is omitted. If the problem is consistent, i.e., r0 = 0, then the relative error can be expected to grow linearly with κ. For r0 = 0 the contribution ηr0 2 and the definition of η show that the relative error may grow as κ(A)2 . 3. The condition number for the normal equations matrix is κ(AT A) = κ(A)2 . Due to the finite computer precision, information may be lost when the normal equations are formed; see, e.g., [Bjo96, Sec. 2.2] and Example 2 below. 4. Component-wise perturbation theory applies when component-wise perturbation bounds are available for the errors in A and b; if − A| ≤ |A| |A and | b − b| ≤ |b| , where the absolute values and inequalities are interpreted componentwise, then x − x| = |A† | (|b| + |A| |x|) + |(AT A)−1 | |E T | |r| | 39-8 Handbook of Linear Algebra and x − x∞ ≤ |A† | (|A| |x| + |b|) + |(AT A)−1 | |AT | |r| . ∞ ∞ See [Bjo96, Secs. 1.4.5–6] for further references about component-wise perturbation analysis and a posteriori error estimation. Examples: 1. We consider the problem from Example 1 in Section 39.1 and two perturbed versions of it, ⎡ 1 ⎢ A = ⎣1 1 ⎤ ⎡ 1 ⎥ 2⎦ , 3 ⎤ 0.75 ⎢ ⎥ b = ⎣1.13⎦ , 1.39 ⎡ 0.8 =⎢ A ⎣ 0.95 1.1 ⎤ 1.1 ⎥ 2 ⎦, 2.95 ⎡ ⎤ 0.79 ⎢ ⎥ b = ⎣1.23⎦ . 1.30 The matrix has full rank, so the least squares solution is unique (and equal to the minimum-norm least squares solution). The vectors 0.4500 , x0 = 0.3200 0.5967 x1 = , 0.2550 0.8935 x2 = 0.1280 − Ax2 , and b − Ax 2 , respectively. are the minimizers of b − Ax2 , b A2 /A2 = 0.4230 < 1. The matrix has condition number κ(A) = 6.793, and η = κ(A)∗ A− The relative errors and their upper bounds are − b2 x1 − x0 2 b = 0.291 ≤ κ(A) = 0.423 x0 2 A2 x0 2 x2 − x0 2 1 = 0.875 ≤ x0 2 1−η − b2 + ηb − Ax0 2 b η + κ(A) A2 x0 2 = 1.575 . 2. Consider the following matrix A and the corresponding normal equation matrix: ⎡ 1 ⎢ A = ⎣δ 0 ⎤ 1 ⎥ 0⎦ , δ 1+δ 2 A A= 1 T 1 . 1+δ 2 √ If |δ| ≤ (where is the machine precision), then the quantity 1 + δ 2 is represented by 1 on the computer and, therefore, the computed AT A is singular. If we use Householder transformations −1 −1 √ , so information about δ is preserved. to compute the QR factorization, we get R = 0 δ 2 39.7 Up- and Downdating of QR Factorization Definitions: Given A ∈ Rm×n with m > n and its QR factorization, as well as a row vector aT witha ∈Rn , updating A = Q R of the factorization means computing the QR factorization of the augmented matrix T = A a from the QR factors of A. Similarly, downdating means computing the QR factors of A from those of A. 2 Up- and downdating algorithms require only O(mn) flops (compared to the O(mn ) flops of recomputing the QR factors). 39-9 Least Squares Solution of Linear Systems Facts: The following facts can be found in Chapter 3 of [Bjo96]. without knowledge of Q by means a sequence of n Givens 1. The matrix R can be updated to R transformations G 1 , . . . , G n . then takes the form 2. The updating of Q to Q = Q Q 0 0 G T · · · G nT . 1 1 to R requires the first row Q 1,: of Q. Let G 1, . . . , G m−n be Givens rotations such 3. Downdating of R T that R = G m−n · · · G 1 ( R Q 1,: ) is upper triangular. Then R = R 2:n+1,2:n . is not available, then its first row Q 1,: can be computed by the LINPACK/ Saunders algorithm 4. If Q or via hyperbolic rotations. If A is available, the corrected seminormal equations provide a more accurate algorithm. 5. Up- and downdating algorithms are also available for the cases where a column is appended to or deleted from A. M = M ± a aT is 6. Up- and downdating of the Cholesky factor under a rank-one modification analytically equivalent to updating R from the QR factorization. In the downdating case the matrix M must be positive (semi)definite. Examples: 1. Let ⎡ ⎤ ⎡ ⎤⎡ 1 0.5774 −0.7071 −0.4082 1.732 ⎥ ⎢ ⎥⎢ 2⎦ = Q R = ⎣0.5774 0 0.8165⎦ ⎣ 0 3 0.5774 0.7071 −0.4082 0 1 ⎢ A = ⎣1 1 ⎤ 3.464 ⎥ 1.414⎦ . 0 If aT = ( 1 4 ), then ⎡ 1 ⎢ ⎢ = ⎢1 A ⎣1 1 ⎤ 1 2⎥ ⎥ R ⎥=Q 3⎦ 4 ⎡ with ⎤ 2 5 ⎢ ⎥ 0 2.236 ⎢ ⎥ =⎢ R ⎥ . 0 ⎦ ⎣0 0 0 is computed by augmenting R with aT and applying two left Givens rotations The updated factor R G 1 and G 2 to row pairs (1,4) and (2,4), respectively: R aT 39.8 ⎡ ⎤ ⎡ 1.732 3.464 2 ⎢ ⎢ 1.414⎥ ⎢ 0 ⎥ G 1 ⎢0 =⎢ ⎥ −→ ⎢ 0 ⎦ ⎣ 0 ⎣0 1 4 0 ⎤ ⎡ ⎤ 5 2 5 ⎢ ⎥ 1.414⎥ ⎥ G 2 ⎢0 2.236⎥ ⎥ −→ ⎢ ⎥ . 0 ⎦ 0 ⎦ ⎣0 1.732 0 0 Damped Least Squares Definitions: The damped least squares solution is the solution to the problem AT A + α I x = AT b, where α > 0 and A and b are real. Damped least squares is also known as ridge regression and Tikhonov (or Phillips) regularization. 39-10 Handbook of Linear Algebra Facts: (See, e.g., [Han98] for further details.) 1. The two formulations min{b − Ax22 + αx22 } b A √ min − x 0 αI and 2 are analytically equivalent. 2. The damping (controlled by the parameter α) reduces the variance of the solution, at the cost of introducing bias. 3. If Cov(b) = I , then the covariance matrix for the damped least squares solution xα is Cov(xα ) = AT A + α I −1 AT A AT A + α I −1 = V 2 ( 2 + α I )−2 V T , where and V are from the SVD of A. Hence, Cov(xα )2 = max σi2 /(σi2 + α)2 ≤ (4α)−1 , i while Cov(x0 )2 = (AT A)−1 2 = σn−2 , which can be much larger. 4. The expected value of xα is E(xα ) = AT A + α I −1 AT A x0 , which introduces a bias because E(xα ) = E(x0 ) when α > 0. 5. The damped least squares problem can take the more general form min b − Ax22 + αBx22 ⇔ (AT A + α B T B) x = AT b , where B · 2 defines a (semi)norm. The solution to this problem is unique when the nullspaces of A and B intersect trivially. 6. [Han98, Chap. 7]. Some algorithms for computing α are the discrepancy principle, generalized cross validation, and the L-curve criterion. Examples: 1. Let δ = 10−5 and consider the matrix and vectors ⎡ 1 ⎢ A = ⎣1 1 ⎤ 1 ⎥ 1+δ ⎦, 1 + 2δ ⎡ ⎤ 1 x0 = , 1 b = A x0 , 0 ⎥ =b+⎢ b ⎣0⎦ . δ Obviously x0 is the least squares solution to Ax + r = b with r0 = 0. The minimizer of b − Ax2 is x0 = ( 0.5 1.5 )T , showing that for this problem the least squares solution is very sensitive to perturbations (the condition number is κ(A) = 2.4·105 ). Using α = 10−8 , we obtain the damped least squares solutions 0.999995 xα = 1.000005 and 0.995 xα = . 1.005 xα and x0 to the perturbed Comparing the damped and the undamped least squares solutions xα is a better approximation to the unperturbed solution x0 least squares problem, we see that than x0 . 39-11 Least Squares Solution of Linear Systems 39.9 Rank Revealing Decompositions Definitions: A rank revealing decomposition is a two-sided orthogonal decomposition of the form R T A=U RV =U VT, 0 where U and V are orthogonal, and R is upper triangular and reveals the (numerical) rank of A in the size of its diagonal elements. The numerical rank kτ of A, with respect to the threshold τ , is defined as kτ = min rank(A + E ) subject to E 2 ≤ τ. Facts: (See, e.g., [Bjo96, Sec. 1.7.3–6], [Han98, Sec. 2.2], and [Ste98, Chap. 5].) 1. [GV96, p. 73] The numerical rank kτ is equal to the number of singular values greater than τ , i.e., σkτ > τ ≥ σkτ +1 . 2. The singular value decomposition is rank revealing with the middle matrix R = . The SVD is difficult to update. 3. If A is exactly rank deficient with rank(A) = p, then there always exists a pivoted QR factorization with R of the form A = QR R11 R= 0 R12 , 0 R11 ∈ R p× p , = rank(R11 ) = p , rank( R) V T , where and a complete orthogonal decomposition of the form A = U R R11 R= 0 0 , 0 R11 ∈ R p× p , = rank(R11 ) = p . rank( R) −1 T U:,1: p . The pseudoinverse of A is A† = V:,1: p R11 −1 T Q :,1: p b. The 4. A basic solution can be computed from the pivoted QR factorization as x0B = R11 minimum-norm least squares solution is given in terms of the complete orthogonal decomposition −1 T U:,1: p b. as x0M = V:,1: p R11 such 5. The rank revealing QR (RRQR) decomposition is a pivoted QR factorization A = Q R that σi /c i ≤ σi (R1:i,1:i ) ≤ σi ≤ R1:i,1:i 2 ≤ c i σi , i = 1, . . . , n, where σi is the i th singular value of A, ci = i (n − i ) + min(i, n − i ) , and σi (R1:i,1:i ) denotes the smallest singular value of R1:i,1:i . The RRQR factorization can be used to estimate the numerical rank kτ . The RRQR factorization is not unique. V T , such that, for 6. The URV decomposition is a two-sided orthogonal decomposition A = U R k = 1, . . . , n, σi c̆ k ≤ σi (R1:i,1:i ) ≤ σi , i = 1, . . . , k 39-12 Handbook of Linear Algebra and σi ≤ σi −k (Rk+1:n,k+1:n ) ≤ σi /c̆ k , i = k + 1, . . . , n, where c̆ k = 1 − Rk+1:n,k+1:n 22 σk (R1:i,1:i )2 − Rk+1:n,k+1:n 22 1/2 . There is also a ULV decomposition with a lower triangular middle matrix; both can be used to estimate the numerical rank of A. 7. The RRQR, URV, and ULV decompositions can be updated, at slightly more cost than the QR factorization. Examples: 1. The rank of A is revealed by the zero element in the (3,3) position of R: ⎡ 1 ⎢ ⎢2 A=⎢ ⎣3 4 ⎤ 2 3 4 5 ⎡ 3 4⎥ ⎥ ⎥=QR 5⎦ 6 ⎤ 5.477 7.303 9.129 ⎢ ⎥ R=⎣ 0 0.816 1.633⎦ . 0 0 0 with Here the QR factorization is rank revealing (U = Q and V = I ). 2. Pivoting must be used to ensure that a QR factorization is rank revealing. The “standard column pivoting” often works well in connection with Householder transformations; here the pivot column in each stage is chosen to maximize the norm of the leading column of the submatrix A(k) k:m,k:n to be reduced. Example: ⎡ 4 ⎢ A = ⎣2 0 2 1 0 ⎤ 2 ⎥ 2⎦, 1 ⎡ A(1) ⎡ −4.4721 −2.2361 ⎢ = H1 A = ⎣ 0 0 0 0 −4.4721 ⎢ A(1) = ⎣ 0 0 −2.6833 0.8944 1 ⎡ = A(2) R ⎤ −2.2361 ⎥ 0 ⎦, 0 −4.4721 ⎢ = H2 A(1) = ⎣ 0 0 ⎤ −2.6833 ⎥ 0.8944 ⎦, 1 ⎡ 1 ⎢ = ⎣0 0 −2.6833 −1.3416 0 ⎤ 0 0 1 0 ⎥ 1⎦, 0 ⎤ −2.2361 ⎥ 0 ⎦. 0 3. The standard column pivoting strategy is not guaranteed to reveal the numerical rank; hence, the development of the RRQR and URV decompositions. Least Squares Solution of Linear Systems 39-13 References [Bjo96] Å. Björck. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, 1996. [Bjo04] Å. Björck. The calculation of linear least squares problems. Acta Numerica (2004):1–53, Cambridge University Press, Cambridge, 2004. [Gau95] C.F. Gauss. Theory of the Combination of Observations Least Subject to Errors. (Translated by G. W. Stewart.) SIAM, Philadelphia, 1995. [Gol65] G.H. Golub. Numerical methods for solving least squares problems. Numer. Math., 7:206–216, 1965. [GV96] G.H. Golub and C.F. Van Loan. Matrix Computations, 3rd ed., Johns Hopkins University Press, Baltimore, MD, 1996. [Han98] P.C. Hansen. Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion. SIAM, Philadelphia, 1998. [LH95] C.L. Lawson and R.J. Hanson. Solving Least Squares Problems. Classics in Applied Mathematics, SIAM, Philadelphia, 1995. [Ste98] G.W. Stewart. Matrix Algorithms Volume I: Basic Decompositions. SIAM, Philadelphia, 1998. [Str88] G. Strang. Linear Algebra and Its Applications, 3rd ed., Saunders College Publishing, Fort Worth, TX, 1988.