Comments
Description
Transcript
The method of least squares
31.6 THE METHOD OF LEAST SQUARES The other possibility is that λ is an independent parameter and not a function of the parameters a. In this case, the extended log-likelihood function is ln L = N ln λ − λ + N ln P (xi |a), (31.89) i=1 where we have omitted terms not depending on λ or a. Differentiating with respect to λ and setting the result equal to zero, we find that the ML estimate of λ is simply λ̂ = N. By differentiating (31.89) with respect to the parameters ai and setting the results equal to zero, we obtain the usual ML estimates âi of their values. In this case, however, the errors in our estimates will be larger, in general, than those in the standard likelihood approach, since they must include the effect of statistical uncertainty in the parameter λ. 31.6 The method of least squares The method of least squares is, in fact, just a special case of the method of maximum likelihood. Nevertheless, it is so widely used as a method of parameter estimation that it has acquired a special name of its own. At the outset, let us suppose that a data sample consists of a set of pairs (xi , yi ), i = 1, 2, . . . , N. For example, these data might correspond to the temperature yi measured at various points xi along some metal rod. For the moment, we will suppose that the xi are known exactly, whereas there exists a measurement error (or noise) ni on each of the values yi . Moreover, let us assume that the true value of y at any position x is given by some function y = f(x; a) that depends on the M unknown parameters a. Then yi = f(xi ; a) + ni . Our aim is to estimate the values of the parameters a from the data sample. Bearing in mind the central limit theorem, let us suppose that the ni are drawn from a Gaussian distribution with no systematic bias and hence zero mean. In the most general case the measurement errors ni might not be independent but be described by an N-dimensional multivariate Gaussian with non-trivial covariance matrix N, whose elements Nij = Cov[ni , nj ] we assume to be known. Under these assumptions it follows from (30.148), that the likelihood function is L(x, y; a) = 1 exp − 12 χ2 (a) , (2π)N/2 |N|1/2 1271 STATISTICS where the quantity denoted by χ2 is given by the quadratic form χ2 (a) = N [yi − f(xi ; a)](N−1 )ij [yj − f(xj ; a)] = (y − f)T N−1 (y − f). (31.90) i,j=1 In the last equality, we have rewritten the expression in matrix notation by defining the column vector f with elements fi = f(xi ; a). We note that in the (common) special case in which the measurement errors ni are independent, their 2 ), where σi is covariance matrix takes the diagonal form N = diag(σ12 , σ22 , . . . , σN the standard deviation of the measurement error ni . In this case, the expression (31.90) for χ2 reduces to 2 N yi − f(xi ; a) . χ2 (a) = σi i=1 The least-squares (LS) estimators âLS of the parameter values are defined as those that minimise the value of χ2 (a); they are usually determined by solving the M equations ∂χ2 =0 for i = 1, 2, . . . , M. (31.91) ∂ai a=âLS Clearly, if the measurement errors ni are indeed Gaussian distributed, as assumed above, then the LS and ML estimators of the parameters a coincide. Because of its relative simplicity, the method of least squares is often applied to cases in which the ni are not Gaussian distributed. The resulting estimators âLS are not the ML estimators, and the best that can be said in justification is that the method is an obviously sensible procedure for parameter estimation that has stood the test of time. Finally, we note that the method of least squares is easily extended to the case in which each measurement yi depends on several variables, which we denote by xi . For example, yi might represent the temperature measured at the (threedimensional) position xi in a room. In this case, the data is modelled by a function y = f(xi ; a), and the remainder of the above discussion carries through unchanged. 31.6.1 Linear least squares We have so far made no restriction on the form of the function f(x; a). It so happens, however, that, for a model in which f(x; a) is a linear function of the parameters a1 , a2 , . . . , aM , one can always obtain analytic expressions for the LS estimators âLS and their variances. The general form of this kind of model is f(x; a) = M i=1 1272 ai hi (x), (31.92) 31.6 THE METHOD OF LEAST SQUARES where {h1 (x), h2 (x), . . . , hM (x)} is some set of linearly independent fixed functions of x, often called the basis functions. Note that the functions hi (x) themselves may be highly non-linear functions of x. The ‘linear’ nature of the model (31.92) refers only to its dependence on the parameters ai . Furthermore, in this case, it may be shown that the LS estimators âi have zero bias and are minimum-variance, irrespective of the probability density function from which the measurement errors ni are drawn. In order to obtain analytic expressions for the LS estimators âLS , it is convenient to write (31.92) in the form f(x; a) = M Rij aj , (31.93) j=1 where Rij = hj (xi ) is an element of the response matrix R of the experiment. The expression for χ2 given in (31.90) can then be written, in matrix notation, as χ2 (a) = (y − Ra)T N−1 (y − Ra). (31.94) The LS estimates of the parameters a are now found, as shown in (31.91), by differentiating (31.94) with respect to the ai and setting the resulting expressions equal to zero. Denoting by ∇χ2 the vector with elements ∂χ2 /∂ai , we find ∇χ2 = −2RT N−1 (y − Ra). (31.95) This can be verified by writing out the expression (31.94) in component form and differentiating directly. Verify result (31.95) by formulating the calculation in component form. To make the derivation less cumbersome, let us adopt the summation convention discussed in section 26.1, in which it is understood that any subscript that appears exactly twice in any term of an expression is to be summed over all the values that a subscript in that position can take. Thus, writing (31.94) in component form, we have χ2 (a) = (yi − Rik ak )(N −1 )ij (yj − Rjl al ). Differentiating with respect to ap gives ∂χ2 = −Rik δkp (N −1 )ij (yj − Rjl al ) + (yi − Rik ak )(N −1 )ij (−Rjl δlp ) ∂ap = −Rip (N −1 )ij (yj − Rjl al ) − (yi − Rik ak )(N −1 )ij Rjp , (31.96) where δij is the Kronecker delta symbol discussed in section 26.1. By swapping the indices i and j in the second term on the RHS of (31.96) and using the fact that the matrix N−1 is symmetric, we obtain ∂χ2 = −2Rip (N −1 )ij (yj − Rjk ak ) ∂ap = −2(R T )pi (N −1 )ij (yj − Rjk ak ). (31.97) If we denote the vector with components ∂χ2 /∂ap , p = 1, 2, . . . , M, by ∇χ2 and write the RHS of (31.97) in matrix notation, we recover the result (31.95). 1273 STATISTICS Setting the expression (31.95) equal to zero at a = â, we find −2RT N−1 y + 2RT N−1 Râ = 0. Provided the matrix RT N−1 R is not singular, we may solve this equation for â to obtain â = (RT N−1 R)−1 RT N−1 y ≡ Sy, (31.98) thus defining the M×N matrix S. It follows that the LS estimates âi , i = 1, 2, . . . , M, are linear functions of the original measurements yj , j = 1, 2, . . . , N. Moreover, using the error propagation formula (30.141) derived in subsection 30.12.3, we find that the covariance matrix of the estimators âi is given by V ≡ Cov[âi , âj ] = SNST = (RT N−1 R)−1 . (31.99) The two equations (31.98) and (31.99) contain the complete method of least squares. In particular, we note that, if one calculates the LS estimates using (31.98) then one has already obtained their covariance matrix (31.99). Prove result (31.99). Using the definition of S given in (31.98), the covariance matrix (31.99) becomes V = SNST = [(RT N−1 R)−1 RT N−1 ]N[(RT N−1 R)−1 RT N−1 ]T . Using the result (AB · · · C)T = CT · · · BT AT for the transpose of a product of matrices and noting that, for any non-singular matrix, (A−1 )T = (AT )−1 we find V = (RT N−1 R)−1 RT N−1 N(NT )−1 R[(RT N−1 R)T ]−1 = (RT N−1 R)−1 RT N−1 R(RT N−1 R)−1 = (RT N−1 R)−1 , where we have also used the fact that N is symmetric and so NT = N. It is worth noting that one may also write the elements of the (inverse) covariance matrix as 2 2 1 ∂ χ −1 , (V )ij = 2 ∂ai ∂aj a=â which is the same as the Fisher matrix (31.36) in cases where the measurement errors are Gaussian distributed (and so the log-likelihood is ln L = −χ2 /2). This proves, at least for this case, our earlier statement that the LS estimators are minimum-variance. In fact, since f(x; a) is linear in the parameters a, one can write χ2 exactly as M 1 ∂2 χ2 χ2 (a) = χ2 (â) + (ai − âi )(aj − âj ), 2 ∂ai ∂aj a=â i,j=1 which is quadratic in the parameters ai . Hence the form of the likelihood function 1274 31.6 THE METHOD OF LEAST SQUARES y 7 6 5 4 3 2 1 x 0 0 1 2 4 3 5 Figure 31.9 A set of data points with error bars indicating the uncertainty σ = 0.5 on the y-values. The straight line is y = m̂x + ĉ, where m̂ and ĉ are the least-squares estimates of the slope and intercept. L ∝ exp(−χ2 /2) is Gaussian. From the discussions of subsections 31.3.6 and 31.5.6, it follows that the ‘surfaces’ χ2 (a) = c, where c is a constant, bound ellipsoidal confidence regions for the parameters ai . The relationship between the value of the constant c and the confidence level is given by (31.39). An experiment produces the following data sample pairs (xi , yi ): xi : yi : 1.85 2.26 2.72 3.10 2.81 3.80 3.06 4.11 3.42 4.74 3.76 4.31 4.31 5.24 4.47 4.03 4.64 5.69 4.99 6.57 where the xi -values are known exactly but each yi -value is measured only to an accuracy of σ = 0.5. Assuming the underlying model for the data to be a straight line y = mx + c, find the LS estimates of the slope m and intercept c and quote the standard error on each estimate. The data are plotted in figure 31.9, together with error bars indicating the uncertainty in the yi -values. Our model of the data is a straight line, and so we have f(x; c, m) = c + mx. In the language of (31.92), our basis functions are h1 (x) = 1 and h2 (x) = x and our model parameters are a1 = c and a2 = m. From (31.93) the elements of the response matrix are Rij = hj (xi ), so that 1 x1 1 x2 , (31.100) R= . ... .. 1 xN where xi are the data values and N = 10 in our case. Further, since the standard deviation on each measurement error is σ, we have N = σ 2 I, where I is the N × N identity matrix. Because of this simple form for N, the expression (31.98) for the LS estimates reduces to 1 T R y = (RT R)−1 RT y. (31.101) σ2 Note that we cannot expand the inverse in the last line, since R itself is not square and â = σ 2 (RT R)−1 1275 STATISTICS hence does not possess an inverse. Inserting the form for R in (31.100) into the expression (31.101), we find −1 ĉ i yi i 1 i x2i = m̂ i x i yi i xi i xi 2 1 Nȳ x −x̄ . = Nxy −x̄ 1 N(x2 − x̄2 ) We thus obtain the LS estimates xy − x̄ ȳ x2 ȳ − x̄ xy = ȳ − m̂x̄, (31.102) − x2 − x̄2 where the last expression for ĉ shows that the best-fit line passes through the ‘centre of mass’ (x̄, ȳ) of the data sample. To find the standard errors on our results, we must calculate the covariance matrix of the estimators. This is given by (31.99), which in our case reduces to σ2 x2 −x̄ . (31.103) V = σ 2 (RT R)−1 = −x̄ 1 N(x2 − x̄2 ) m̂ = x2 x̄2 and ĉ = The standard error on each estimator is simply √ √ the positive square root of the corresponding diagonal element, i.e. σĉ = V11 and σm̂ = V22 , and the covariance of the estimators m̂ and ĉ is given by Cov[ĉ, m̂] = V12 = V21 . Inserting the data sample averages and moments into (31.102) and (31.103), we find c = ĉ ± σĉ = 0.40 ± 0.62 and m = m̂ ± σm̂ = 1.11 ± 0.17. The ‘best-fit’ straight line y = m̂x + ĉ is plotted in figure 31.9. For comparison, the true values used to create the data were m = 1 and c = 1. The extension of the method to fitting data to a higher-order polynomial, such as f(x; a) = a1 + a2 x + a3 x2 , is obvious. However, as the order of the polynomial increases the matrix inversions become rather complicated. Indeed, even when the matrices are inverted numerically, the inversion is prone to numerical instabilities. A better approach is to replace the basis functions hm (x) = xm , m = 1, 2, . . . , M, with a set of polynomials that are ‘orthogonal over the data’, i.e. such that N hl (xi )hm (xi ) = 0 for l = m. i=1 Such a set of polynomial basis functions can always be found by using the Gram– Schmidt orthogonalisation procedure presented in section 17.1. The details of this approach are beyond the scope of our discussion but we note that, in this case, the matrix RT R is diagonal and may be inverted easily. 31.6.2 Non-linear least squares If the function f(x; a) is not linear in the parameters a then, in general, it is not possible to obtain an explicit expression for the LS estimates â. Instead, one must use an iterative (numerical) procedure, which we now outline. In practice, 1276