...

The method of least squares

by taratuta

on
Category: Documents
113

views

Report

Comments

Transcript

The method of least squares
31.6 THE METHOD OF LEAST SQUARES
The other possibility is that λ is an independent parameter and not a function
of the parameters a. In this case, the extended log-likelihood function is
ln L = N ln λ − λ +
N
ln P (xi |a),
(31.89)
i=1
where we have omitted terms not depending on λ or a. Differentiating with
respect to λ and setting the result equal to zero, we find that the ML estimate of
λ is simply
λ̂ = N.
By differentiating (31.89) with respect to the parameters ai and setting the results
equal to zero, we obtain the usual ML estimates âi of their values. In this case,
however, the errors in our estimates will be larger, in general, than those in the
standard likelihood approach, since they must include the effect of statistical
uncertainty in the parameter λ.
31.6 The method of least squares
The method of least squares is, in fact, just a special case of the method of
maximum likelihood. Nevertheless, it is so widely used as a method of parameter
estimation that it has acquired a special name of its own. At the outset, let us
suppose that a data sample consists of a set of pairs (xi , yi ), i = 1, 2, . . . , N. For
example, these data might correspond to the temperature yi measured at various
points xi along some metal rod.
For the moment, we will suppose that the xi are known exactly, whereas there
exists a measurement error (or noise) ni on each of the values yi . Moreover, let
us assume that the true value of y at any position x is given by some function
y = f(x; a) that depends on the M unknown parameters a. Then
yi = f(xi ; a) + ni .
Our aim is to estimate the values of the parameters a from the data sample.
Bearing in mind the central limit theorem, let us suppose that the ni are drawn
from a Gaussian distribution with no systematic bias and hence zero mean. In the
most general case the measurement errors ni might not be independent but be
described by an N-dimensional multivariate Gaussian with non-trivial covariance
matrix N, whose elements Nij = Cov[ni , nj ] we assume to be known. Under these
assumptions it follows from (30.148), that the likelihood function is
L(x, y; a) =
1
exp − 12 χ2 (a) ,
(2π)N/2 |N|1/2
1271
STATISTICS
where the quantity denoted by χ2 is given by the quadratic form
χ2 (a) =
N
[yi − f(xi ; a)](N−1 )ij [yj − f(xj ; a)] = (y − f)T N−1 (y − f).
(31.90)
i,j=1
In the last equality, we have rewritten the expression in matrix notation by
defining the column vector f with elements fi = f(xi ; a). We note that in the
(common) special case in which the measurement errors ni are independent, their
2
), where σi is
covariance matrix takes the diagonal form N = diag(σ12 , σ22 , . . . , σN
the standard deviation of the measurement error ni . In this case, the expression
(31.90) for χ2 reduces to
2
N yi − f(xi ; a)
.
χ2 (a) =
σi
i=1
The least-squares (LS) estimators âLS of the parameter values are defined as
those that minimise the value of χ2 (a); they are usually determined by solving
the M equations
∂χ2 =0
for i = 1, 2, . . . , M.
(31.91)
∂ai a=âLS
Clearly, if the measurement errors ni are indeed Gaussian distributed, as assumed
above, then the LS and ML estimators of the parameters a coincide. Because
of its relative simplicity, the method of least squares is often applied to cases in
which the ni are not Gaussian distributed. The resulting estimators âLS are not the
ML estimators, and the best that can be said in justification is that the method is
an obviously sensible procedure for parameter estimation that has stood the test
of time.
Finally, we note that the method of least squares is easily extended to the case
in which each measurement yi depends on several variables, which we denote
by xi . For example, yi might represent the temperature measured at the (threedimensional) position xi in a room. In this case, the data is modelled by a
function y = f(xi ; a), and the remainder of the above discussion carries through
unchanged.
31.6.1 Linear least squares
We have so far made no restriction on the form of the function f(x; a). It so
happens, however, that, for a model in which f(x; a) is a linear function of the
parameters a1 , a2 , . . . , aM , one can always obtain analytic expressions for the LS
estimators âLS and their variances. The general form of this kind of model is
f(x; a) =
M
i=1
1272
ai hi (x),
(31.92)
31.6 THE METHOD OF LEAST SQUARES
where {h1 (x), h2 (x), . . . , hM (x)} is some set of linearly independent fixed functions
of x, often called the basis functions. Note that the functions hi (x) themselves may
be highly non-linear functions of x. The ‘linear’ nature of the model (31.92) refers
only to its dependence on the parameters ai . Furthermore, in this case, it may
be shown that the LS estimators âi have zero bias and are minimum-variance,
irrespective of the probability density function from which the measurement errors
ni are drawn.
In order to obtain analytic expressions for the LS estimators âLS , it is convenient
to write (31.92) in the form
f(x; a) =
M
Rij aj ,
(31.93)
j=1
where Rij = hj (xi ) is an element of the response matrix R of the experiment. The
expression for χ2 given in (31.90) can then be written, in matrix notation, as
χ2 (a) = (y − Ra)T N−1 (y − Ra).
(31.94)
The LS estimates of the parameters a are now found, as shown in (31.91), by
differentiating (31.94) with respect to the ai and setting the resulting expressions
equal to zero. Denoting by ∇χ2 the vector with elements ∂χ2 /∂ai , we find
∇χ2 = −2RT N−1 (y − Ra).
(31.95)
This can be verified by writing out the expression (31.94) in component form and
differentiating directly.
Verify result (31.95) by formulating the calculation in component form.
To make the derivation less cumbersome, let us adopt the summation convention discussed
in section 26.1, in which it is understood that any subscript that appears exactly twice in
any term of an expression is to be summed over all the values that a subscript in that
position can take. Thus, writing (31.94) in component form, we have
χ2 (a) = (yi − Rik ak )(N −1 )ij (yj − Rjl al ).
Differentiating with respect to ap gives
∂χ2
= −Rik δkp (N −1 )ij (yj − Rjl al ) + (yi − Rik ak )(N −1 )ij (−Rjl δlp )
∂ap
= −Rip (N −1 )ij (yj − Rjl al ) − (yi − Rik ak )(N −1 )ij Rjp ,
(31.96)
where δij is the Kronecker delta symbol discussed in section 26.1. By swapping the indices
i and j in the second term on the RHS of (31.96) and using the fact that the matrix N−1
is symmetric, we obtain
∂χ2
= −2Rip (N −1 )ij (yj − Rjk ak )
∂ap
= −2(R T )pi (N −1 )ij (yj − Rjk ak ).
(31.97)
If we denote the vector with components ∂χ2 /∂ap , p = 1, 2, . . . , M, by ∇χ2 and write the
RHS of (31.97) in matrix notation, we recover the result (31.95). 1273
STATISTICS
Setting the expression (31.95) equal to zero at a = â, we find
−2RT N−1 y + 2RT N−1 Râ = 0.
Provided the matrix RT N−1 R is not singular, we may solve this equation for â to
obtain
â = (RT N−1 R)−1 RT N−1 y ≡ Sy,
(31.98)
thus defining the M×N matrix S. It follows that the LS estimates âi , i = 1, 2, . . . , M,
are linear functions of the original measurements yj , j = 1, 2, . . . , N. Moreover,
using the error propagation formula (30.141) derived in subsection 30.12.3, we
find that the covariance matrix of the estimators âi is given by
V ≡ Cov[âi , âj ] = SNST = (RT N−1 R)−1 .
(31.99)
The two equations (31.98) and (31.99) contain the complete method of least
squares. In particular, we note that, if one calculates the LS estimates using
(31.98) then one has already obtained their covariance matrix (31.99).
Prove result (31.99).
Using the definition of S given in (31.98), the covariance matrix (31.99) becomes
V = SNST
= [(RT N−1 R)−1 RT N−1 ]N[(RT N−1 R)−1 RT N−1 ]T .
Using the result (AB · · · C)T = CT · · · BT AT for the transpose of a product of matrices and
noting that, for any non-singular matrix, (A−1 )T = (AT )−1 we find
V = (RT N−1 R)−1 RT N−1 N(NT )−1 R[(RT N−1 R)T ]−1
= (RT N−1 R)−1 RT N−1 R(RT N−1 R)−1
= (RT N−1 R)−1 ,
where we have also used the fact that N is symmetric and so NT = N. It is worth noting that one may also write the elements of the (inverse)
covariance matrix as
2 2 1
∂ χ
−1
,
(V )ij =
2 ∂ai ∂aj a=â
which is the same as the Fisher matrix (31.36) in cases where the measurement
errors are Gaussian distributed (and so the log-likelihood is ln L = −χ2 /2). This
proves, at least for this case, our earlier statement that the LS estimators are
minimum-variance. In fact, since f(x; a) is linear in the parameters a, one can
write χ2 exactly as
M 1 ∂2 χ2
χ2 (a) = χ2 (â) +
(ai − âi )(aj − âj ),
2
∂ai ∂aj a=â
i,j=1
which is quadratic in the parameters ai . Hence the form of the likelihood function
1274
31.6 THE METHOD OF LEAST SQUARES
y
7
6
5
4
3
2
1
x
0
0
1
2
4
3
5
Figure 31.9 A set of data points with error bars indicating the uncertainty
σ = 0.5 on the y-values. The straight line is y = m̂x + ĉ, where m̂ and ĉ are
the least-squares estimates of the slope and intercept.
L ∝ exp(−χ2 /2) is Gaussian. From the discussions of subsections 31.3.6 and
31.5.6, it follows that the ‘surfaces’ χ2 (a) = c, where c is a constant, bound
ellipsoidal confidence regions for the parameters ai . The relationship between the
value of the constant c and the confidence level is given by (31.39).
An experiment produces the following data sample pairs (xi , yi ):
xi :
yi :
1.85
2.26
2.72
3.10
2.81
3.80
3.06
4.11
3.42
4.74
3.76
4.31
4.31
5.24
4.47
4.03
4.64
5.69
4.99
6.57
where the xi -values are known exactly but each yi -value is measured only to an accuracy
of σ = 0.5. Assuming the underlying model for the data to be a straight line y = mx + c,
find the LS estimates of the slope m and intercept c and quote the standard error on each
estimate.
The data are plotted in figure 31.9, together with error bars indicating the uncertainty in
the yi -values. Our model of the data is a straight line, and so we have
f(x; c, m) = c + mx.
In the language of (31.92), our basis functions are h1 (x) = 1 and h2 (x) = x and our model
parameters are a1 = c and a2 = m. From (31.93) the elements of the response matrix are
Rij = hj (xi ), so that


1 x1
 1 x2 


,
(31.100)
R= .
... 

 ..
1 xN
where xi are the data values and N = 10 in our case. Further, since the standard deviation
on each measurement error is σ, we have N = σ 2 I, where I is the N × N identity matrix.
Because of this simple form for N, the expression (31.98) for the LS estimates reduces to
1 T
R y = (RT R)−1 RT y.
(31.101)
σ2
Note that we cannot expand the inverse in the last line, since R itself is not square and
â = σ 2 (RT R)−1
1275
STATISTICS
hence does not possess an inverse. Inserting the form for R in (31.100) into the expression
(31.101), we find
−1 ĉ
i yi
i 1 i x2i
=
m̂
i x i yi
i xi
i xi
2
1
Nȳ
x −x̄
.
=
Nxy
−x̄
1
N(x2 − x̄2 )
We thus obtain the LS estimates
xy − x̄ ȳ
x2 ȳ − x̄ xy
= ȳ − m̂x̄,
(31.102)
−
x2 − x̄2
where the last expression for ĉ shows that the best-fit line passes through the ‘centre
of mass’ (x̄, ȳ) of the data sample. To find the standard errors on our results, we must
calculate the covariance matrix of the estimators. This is given by (31.99), which in our
case reduces to
σ2
x2 −x̄ .
(31.103)
V = σ 2 (RT R)−1 =
−x̄
1
N(x2 − x̄2 )
m̂ =
x2
x̄2
and
ĉ =
The standard error on each estimator
is simply
√
√ the positive square root of the corresponding
diagonal element, i.e. σĉ = V11 and σm̂ = V22 , and the covariance of the estimators m̂
and ĉ is given by Cov[ĉ, m̂] = V12 = V21 . Inserting the data sample averages and moments
into (31.102) and (31.103), we find
c = ĉ ± σĉ = 0.40 ± 0.62
and
m = m̂ ± σm̂ = 1.11 ± 0.17.
The ‘best-fit’ straight line y = m̂x + ĉ is plotted in figure 31.9. For comparison, the true
values used to create the data were m = 1 and c = 1. The extension of the method to fitting data to a higher-order polynomial, such
as f(x; a) = a1 + a2 x + a3 x2 , is obvious. However, as the order of the polynomial
increases the matrix inversions become rather complicated. Indeed, even when the
matrices are inverted numerically, the inversion is prone to numerical instabilities.
A better approach is to replace the basis functions hm (x) = xm , m = 1, 2, . . . , M,
with a set of polynomials that are ‘orthogonal over the data’, i.e. such that
N
hl (xi )hm (xi ) = 0
for l = m.
i=1
Such a set of polynomial basis functions can always be found by using the Gram–
Schmidt orthogonalisation procedure presented in section 17.1. The details of this
approach are beyond the scope of our discussion but we note that, in this case,
the matrix RT R is diagonal and may be inverted easily.
31.6.2 Non-linear least squares
If the function f(x; a) is not linear in the parameters a then, in general, it is
not possible to obtain an explicit expression for the LS estimates â. Instead, one
must use an iterative (numerical) procedure, which we now outline. In practice,
1276
Fly UP