...

Sample statistics

by taratuta

on
Category: Documents
167

views

Report

Comments

Transcript

Sample statistics
STATISTICS
therefore consider the xi as a set of N random variables. In the most general case,
these random variables will be described by some N-dimensional joint probability
density function P (x1 , x2 , . . . , xN ).§ In other words, an experiment consisting of N
measurements is considered as a single random sample from the joint distribution
(or population) P (x), where x denotes a point in the N-dimensional data space
having coordinates (x1 , x2 , . . . , xN ).
The situation is simplified considerably if the sample values xi are independent.
In this case, the N-dimensional joint distribution P (x) factorises into the product
of N one-dimensional distributions,
P (x) = P (x1 )P (x2 ) · · · P (xN ).
(31.1)
In the general case, each of the one-dimensional distributions P (xi ) may be
different. A typical example of this occurs when N independent measurements
are made of some quantity x but the accuracy of the measuring procedure varies
between measurements.
It is often the case, however, that each sample value xi is drawn independently
from the same population. In this case, P (x) is of the form (31.1), but, in addition,
P (xi ) has the same form for each value of i. The measurements x1 , x2 , . . . , xN
are then said to form a random sample of size N from the one-dimensional
population P (x). This is the most common situation met in practice and, unless
stated otherwise, we will assume from now on that this is the case.
31.2 Sample statistics
Suppose we have a set of N measurements x1 , x2 , . . . , xN . Any function of these
measurements (that contains no unknown parameters) is called a sample statistic,
or often simply a statistic. Sample statistics provide a means of characterising the
data. Although the resulting characterisation is inevitably incomplete, it is useful
to be able to describe a set of data in terms of a few pertinent numbers. We now
discuss the most commonly used sample statistics.
§
In this chapter, we will adopt the common convention that P (x) denotes the particular probability
density function that applies to its argument, x. This obviates the need to use a different letter
for the PDF of each new variable. For example, if X and Y are random variables with different
PDFs, then properly one should denote these distributions by f(x) and g(y), say. In our shorthand
notation, these PDFs are denoted by P (x) and P (y), where it is understood that the functional
form of the PDF may be different in each case.
1222
31.2 SAMPLE STATISTICS
188.7
168.1
204.7
189.8
193.2
166.3
169.0
200.0
Table 31.1 Experimental data giving eight measurements of the round trip
time in milliseconds for a computer ‘packet’ to travel from Cambridge UK to
Cambridge MA.
31.2.1 Averages
The simplest number used to characterise a sample is the mean, which for N
values xi , i = 1, 2, . . . , N, is defined by
x̄ =
N
1 xi .
N
(31.2)
i=1
In words, the sample mean is the sum of the sample values divided by the number
of values in the sample.
Table 31.1 gives eight values for the round trip time in milliseconds for a computer ‘packet’
to travel from Cambridge UK to Cambridge MA. Find the sample mean.
Using (31.2) the sample mean in milliseconds is given by
x̄ = 18 (188.7 + 204.7 + 193.2 + 169.0 + 168.1 + 189.8 + 166.3 + 200.0)
1479.8
=
= 184.975.
8
Since the sample values in table 31.1 are quoted to an accuracy of one decimal place, it is
usual to quote the mean to the same accuracy, i.e. as x̄ = 185.0. Strictly speaking the mean given by (31.2) is the arithmetic mean and this is by
far the most common definition used for a mean. Other definitions of the mean
are possible, though less common, and include
(i) the geometric mean,
x̄g =
N
1/N
xi
,
(31.3)
i=1
(ii) the harmonic mean,
x̄h = N
N
i=1
1/xi
,
(31.4)
(iii) the root mean square,
x̄rms =
1223
N
i=1
N
x2i
1/2
.
(31.5)
STATISTICS
It should be noted that, x̄, x̄h and x̄rms would remain well defined even if some
sample values were negative, but the value of x̄g could then become complex.
The geometric mean should not be used in such cases.
Calculate x̄g , x̄h and x̄rms for the sample given in table 31.1.
The geometric mean is given by (31.3) to be
x̄g = (188.7 × 204.7 × · · · × 200.0)1/8 = 184.4.
The harmonic mean is given by (31.4) to be
x̄h =
8
= 183.9.
(1/188.7) + (1/204.7) + · · · + (1/200.0)
Finally, the root mean square is given by (31.5) to be
1/2
= 185.5. x̄rms = 18 (188.72 + 204.72 + · · · + 200.02 )
Two other measures of the ‘average’ of a sample are its mode and median. The
mode is simply the most commonly occurring value in the sample. A sample may
possess several modes, however, and thus it can be misleading in such cases to
use the mode as a measure of the average of the sample. The median of a sample
is the halfway point when the sample values xi (i = 1, 2, . . . , N) are arranged in
ascending (or descending) order. Clearly, this depends on whether the size of
the sample, N, is odd or even. If N is odd then the median is simply equal to
x(N+1)/2 , whereas if N is even the median of the sample is usually taken to be
1
2 (xN/2 + x(N/2)+1 ).
Find the mode and median of the sample given in table 31.1.
From the table we see that each sample value occurs exactly once, and so any value may
be called the mode of the sample.
To find the sample median, we first arrange the sample values in ascending order and
obtain
166.3, 168.1, 169.0, 188.7, 189.8, 193.2, 200.0, 204.7.
Since the number of sample values N = 8, which is even, the median of the sample is
1
(x4
2
+ x5 ) = 12 (188.7 + 189.8) = 189.25. 31.2.2 Variance and standard deviation
The variance and standard deviation both give a measure of the spread of values
in a sample about the sample mean x̄. The sample variance is defined by
s2 =
N
1 (xi − x̄)2 ,
N
i=1
1224
(31.6)
31.2 SAMPLE STATISTICS
and the sample standard deviation is the positive square root of the sample
variance, i.e.
<
=
N
=1 (xi − x̄)2 .
(31.7)
s=>
N
i=1
Find the sample variance and sample standard deviation of the data given in table 31.1.
We have already found that the sample mean is 185.0 to one decimal place. However,
when the mean is to be used in the subsequent calculation of the sample variance it is
better to use the most accurate value available. In this case the exact value is 184.975, and
so using (31.6),
1
(188.7 − 184.975)2 + · · · + (200.0 − 184.975)2
8
1608.36
=
= 201.0,
8
s2 =
where once again we have quoted
√ the result to one decimal place. The sample standard
deviation is then given by s = 201.0 = 14.2. As it happens, in this case the difference
between the true mean and the rounded value is very small compared with the variation
of the individual readings about the mean and using the rounded value has a negligible
effect; however, this would not be so if the difference were comparable to the sample
standard deviation. Using the definition (31.7), it is clear that in order to calculate the standard
deviation of a sample we must first calculate the sample mean. This requirement
can be avoided, however, by using an alternative form for s2 . From (31.6), we see
that
s2 =
N
1 (xi − x̄)2
N
i=1
N
N
N
1 2
1 2
1 x̄
=
xi −
2xi x̄ +
N
N
N
i=1
=
x2
i=1
− 2x̄ + x̄ =
2
2
x2
i=1
− x̄
2
We may therefore write the sample variance s2 as
2
s =
x2
N
1 2
− x̄ =
xi −
N
2
i=1
N
1 xi
N
2
,
(31.8)
i=1
from which the sample standard deviation is found by taking the positive square
N 2
root. Thus, by evaluating the quantities N
i=1 xi and
i=1 xi for our sample, we
can calculate the sample mean and sample standard deviation at the same time.
1225
STATISTICS
N 2
Calculate N
i=1 xi and
i=1 xi for the data given in table 31.1 and hence find the mean
and standard deviation of the sample.
From table 31.1, we obtain
N
xi = 188.7 + 204.7 + · · · + 200.0 = 1479.8,
i=1
N
x2i = (188.7)2 + (204.7)2 + · · · + (200.0)2 = 275 334.36.
i=1
Since N = 8, we find as before (quoting the final results to one decimal place)
1479.8
= 185.0,
x̄ =
8
s=
275 334.36
−
8
1479.8
8
2
= 14.2. 31.2.3 Moments and central moments
By analogy with our discussion of probability distributions in section 30.5, the
sample mean and variance may also be described respectively as the first moment
and second central moment of the sample. In general, for a sample xi , i =
1, 2, . . . , N, we define the rth moment mr and rth central moment nr as
mr =
N
1 r
xi ,
N
(31.9)
i=1
nr =
N
1 (xi − m1 )r .
N
(31.10)
i=1
Thus the sample mean x̄ and variance s2 may also be written as m1 and n2
respectively. As is common practice, we have introduced a notation in which
a sample statistic is denoted by the Roman letter corresponding to whichever
Greek letter is used to describe the corresponding population statistic. Thus, we
use mr and nr to denote the rth moment and central moment of a sample, since
in section 30.5 we denoted the rth moment and central moment of a population
by µr and νr respectively.
This notation is particularly useful, since the rth central moment of a sample,
mr , may be expressed in terms of the rth- and lower-order sample moments nr in a
way exactly analogous to that derived in subsection 30.5.5 for the corresponding
population statistics. As discussed in the previous section, the sample variance is
given by s2 = x2 − x̄2 but this may also be written as n2 = m2 − m21 , which is to be
compared with the corresponding relation ν2 = µ2 −µ21 derived in subsection 30.5.3
for population statistics. This correspondence also holds for higher-order central
1226
31.2 SAMPLE STATISTICS
moments of the sample. For example,
n3 =
N
1 (xi − m1 )3
N
i=1
N
1 3
=
(xi − 3m1 x2i + 3m21 xi − m31 )
N
i=1
= m3 − 3m1 m2 + 3m21 m1 − m31
= m3 − 3m1 m2 + 2m31 ,
(31.11)
which may be compared with equation (30.53) in the previous chapter.
Mirroring our discussion of the normalised central moments γr of a population
in subsection 30.5.5, we can also describe a sample in terms of the dimensionless
quantities
nk
nk
gk = k/2 = k ;
s
n
2
g3 and g4 are called the sample skewness and kurtosis. Likewise, it is common to
define the excess kurtosis of a sample by g4 − 3.
31.2.4 Covariance and correlation
So far we have assumed that each data item of the sample consists of a single
number. Now let us suppose that each item of data consists of a pair of numbers,
so that the sample is given by (xi , yi ), i = 1, 2, . . . , N.
We may calculate the sample means, x̄ and ȳ, and sample variances, s2x and
2
sy , of the xi and yi values individually but these statistics do not provide any
measure of the relationship between the xi and yi . By analogy with our discussion
in subsection 30.12.3 we measure any interdependence between the xi and yi in
terms of the sample covariance, which is given by
Vxy =
N
1 (xi − x̄)(yi − ȳ)
N
i=1
= (x − x̄)(y − ȳ)
= xy − x̄ȳ.
(31.12)
Writing out the last expression in full, we obtain the form most useful for
calculations, which reads
N N N
1 1 xi yi − 2
xi
yi .
Vxy =
N
N
i=1
i=1
1227
i=1
STATISTICS
rxy = 0.0
rxy = 0.1
rxy = 0.5
rxy = −0.9
rxy = 0.99
y
x
rxy = −0.7
Figure 31.1 Scatter plots for two-dimensional data samples of size N = 1000,
with various values of the correlation r. No scales are plotted, since the value
of r is unaffected by shifts of origin or changes of scale in x and y.
We may also define the closely related sample correlation by
rxy =
Vxy
,
sx sy
which can take values between −1 and +1. If the xi and yi are independent then
Vxy = 0 = rxy , and from (31.12) we see that xy = x̄ȳ. It should also be noted
that the value of rxy is not altered by shifts in the origin or by changes in the
scale of the xi or yi . In other words, if x = ax + b and y = cy + d, where a,
b, c, d are constants, then rx y = rxy . Figure 31.1 shows scatter plots for several
two-dimensional random samples xi , yi of size N = 1000, each with a different
value of rxy .
Ten UK citizens are selected at random and their heights and weights are found to be as
follows (to the nearest cm or kg respectively):
Person
Height (cm)
Weight (kg)
A
194
75
B
168
53
C
177
72
D
180
80
E
171
75
F
190
75
G
151
57
H
169
67
I
175
46
J
182
68
Calculate the sample correlation between the heights and weights.
In order to find the sample correlation, we begin by calculating the following sums (where
xi are the heights and yi are the weights)
xi = 1757,
yi = 668,
i
i
1228
Fly UP