Comments
Description
Transcript
Sample statistics
STATISTICS therefore consider the xi as a set of N random variables. In the most general case, these random variables will be described by some N-dimensional joint probability density function P (x1 , x2 , . . . , xN ).§ In other words, an experiment consisting of N measurements is considered as a single random sample from the joint distribution (or population) P (x), where x denotes a point in the N-dimensional data space having coordinates (x1 , x2 , . . . , xN ). The situation is simplified considerably if the sample values xi are independent. In this case, the N-dimensional joint distribution P (x) factorises into the product of N one-dimensional distributions, P (x) = P (x1 )P (x2 ) · · · P (xN ). (31.1) In the general case, each of the one-dimensional distributions P (xi ) may be different. A typical example of this occurs when N independent measurements are made of some quantity x but the accuracy of the measuring procedure varies between measurements. It is often the case, however, that each sample value xi is drawn independently from the same population. In this case, P (x) is of the form (31.1), but, in addition, P (xi ) has the same form for each value of i. The measurements x1 , x2 , . . . , xN are then said to form a random sample of size N from the one-dimensional population P (x). This is the most common situation met in practice and, unless stated otherwise, we will assume from now on that this is the case. 31.2 Sample statistics Suppose we have a set of N measurements x1 , x2 , . . . , xN . Any function of these measurements (that contains no unknown parameters) is called a sample statistic, or often simply a statistic. Sample statistics provide a means of characterising the data. Although the resulting characterisation is inevitably incomplete, it is useful to be able to describe a set of data in terms of a few pertinent numbers. We now discuss the most commonly used sample statistics. § In this chapter, we will adopt the common convention that P (x) denotes the particular probability density function that applies to its argument, x. This obviates the need to use a different letter for the PDF of each new variable. For example, if X and Y are random variables with different PDFs, then properly one should denote these distributions by f(x) and g(y), say. In our shorthand notation, these PDFs are denoted by P (x) and P (y), where it is understood that the functional form of the PDF may be different in each case. 1222 31.2 SAMPLE STATISTICS 188.7 168.1 204.7 189.8 193.2 166.3 169.0 200.0 Table 31.1 Experimental data giving eight measurements of the round trip time in milliseconds for a computer ‘packet’ to travel from Cambridge UK to Cambridge MA. 31.2.1 Averages The simplest number used to characterise a sample is the mean, which for N values xi , i = 1, 2, . . . , N, is defined by x̄ = N 1 xi . N (31.2) i=1 In words, the sample mean is the sum of the sample values divided by the number of values in the sample. Table 31.1 gives eight values for the round trip time in milliseconds for a computer ‘packet’ to travel from Cambridge UK to Cambridge MA. Find the sample mean. Using (31.2) the sample mean in milliseconds is given by x̄ = 18 (188.7 + 204.7 + 193.2 + 169.0 + 168.1 + 189.8 + 166.3 + 200.0) 1479.8 = = 184.975. 8 Since the sample values in table 31.1 are quoted to an accuracy of one decimal place, it is usual to quote the mean to the same accuracy, i.e. as x̄ = 185.0. Strictly speaking the mean given by (31.2) is the arithmetic mean and this is by far the most common definition used for a mean. Other definitions of the mean are possible, though less common, and include (i) the geometric mean, x̄g = N 1/N xi , (31.3) i=1 (ii) the harmonic mean, x̄h = N N i=1 1/xi , (31.4) (iii) the root mean square, x̄rms = 1223 N i=1 N x2i 1/2 . (31.5) STATISTICS It should be noted that, x̄, x̄h and x̄rms would remain well defined even if some sample values were negative, but the value of x̄g could then become complex. The geometric mean should not be used in such cases. Calculate x̄g , x̄h and x̄rms for the sample given in table 31.1. The geometric mean is given by (31.3) to be x̄g = (188.7 × 204.7 × · · · × 200.0)1/8 = 184.4. The harmonic mean is given by (31.4) to be x̄h = 8 = 183.9. (1/188.7) + (1/204.7) + · · · + (1/200.0) Finally, the root mean square is given by (31.5) to be 1/2 = 185.5. x̄rms = 18 (188.72 + 204.72 + · · · + 200.02 ) Two other measures of the ‘average’ of a sample are its mode and median. The mode is simply the most commonly occurring value in the sample. A sample may possess several modes, however, and thus it can be misleading in such cases to use the mode as a measure of the average of the sample. The median of a sample is the halfway point when the sample values xi (i = 1, 2, . . . , N) are arranged in ascending (or descending) order. Clearly, this depends on whether the size of the sample, N, is odd or even. If N is odd then the median is simply equal to x(N+1)/2 , whereas if N is even the median of the sample is usually taken to be 1 2 (xN/2 + x(N/2)+1 ). Find the mode and median of the sample given in table 31.1. From the table we see that each sample value occurs exactly once, and so any value may be called the mode of the sample. To find the sample median, we first arrange the sample values in ascending order and obtain 166.3, 168.1, 169.0, 188.7, 189.8, 193.2, 200.0, 204.7. Since the number of sample values N = 8, which is even, the median of the sample is 1 (x4 2 + x5 ) = 12 (188.7 + 189.8) = 189.25. 31.2.2 Variance and standard deviation The variance and standard deviation both give a measure of the spread of values in a sample about the sample mean x̄. The sample variance is defined by s2 = N 1 (xi − x̄)2 , N i=1 1224 (31.6) 31.2 SAMPLE STATISTICS and the sample standard deviation is the positive square root of the sample variance, i.e. < = N =1 (xi − x̄)2 . (31.7) s=> N i=1 Find the sample variance and sample standard deviation of the data given in table 31.1. We have already found that the sample mean is 185.0 to one decimal place. However, when the mean is to be used in the subsequent calculation of the sample variance it is better to use the most accurate value available. In this case the exact value is 184.975, and so using (31.6), 1 (188.7 − 184.975)2 + · · · + (200.0 − 184.975)2 8 1608.36 = = 201.0, 8 s2 = where once again we have quoted √ the result to one decimal place. The sample standard deviation is then given by s = 201.0 = 14.2. As it happens, in this case the difference between the true mean and the rounded value is very small compared with the variation of the individual readings about the mean and using the rounded value has a negligible effect; however, this would not be so if the difference were comparable to the sample standard deviation. Using the definition (31.7), it is clear that in order to calculate the standard deviation of a sample we must first calculate the sample mean. This requirement can be avoided, however, by using an alternative form for s2 . From (31.6), we see that s2 = N 1 (xi − x̄)2 N i=1 N N N 1 2 1 2 1 x̄ = xi − 2xi x̄ + N N N i=1 = x2 i=1 − 2x̄ + x̄ = 2 2 x2 i=1 − x̄ 2 We may therefore write the sample variance s2 as 2 s = x2 N 1 2 − x̄ = xi − N 2 i=1 N 1 xi N 2 , (31.8) i=1 from which the sample standard deviation is found by taking the positive square N 2 root. Thus, by evaluating the quantities N i=1 xi and i=1 xi for our sample, we can calculate the sample mean and sample standard deviation at the same time. 1225 STATISTICS N 2 Calculate N i=1 xi and i=1 xi for the data given in table 31.1 and hence find the mean and standard deviation of the sample. From table 31.1, we obtain N xi = 188.7 + 204.7 + · · · + 200.0 = 1479.8, i=1 N x2i = (188.7)2 + (204.7)2 + · · · + (200.0)2 = 275 334.36. i=1 Since N = 8, we find as before (quoting the final results to one decimal place) 1479.8 = 185.0, x̄ = 8 s= 275 334.36 − 8 1479.8 8 2 = 14.2. 31.2.3 Moments and central moments By analogy with our discussion of probability distributions in section 30.5, the sample mean and variance may also be described respectively as the first moment and second central moment of the sample. In general, for a sample xi , i = 1, 2, . . . , N, we define the rth moment mr and rth central moment nr as mr = N 1 r xi , N (31.9) i=1 nr = N 1 (xi − m1 )r . N (31.10) i=1 Thus the sample mean x̄ and variance s2 may also be written as m1 and n2 respectively. As is common practice, we have introduced a notation in which a sample statistic is denoted by the Roman letter corresponding to whichever Greek letter is used to describe the corresponding population statistic. Thus, we use mr and nr to denote the rth moment and central moment of a sample, since in section 30.5 we denoted the rth moment and central moment of a population by µr and νr respectively. This notation is particularly useful, since the rth central moment of a sample, mr , may be expressed in terms of the rth- and lower-order sample moments nr in a way exactly analogous to that derived in subsection 30.5.5 for the corresponding population statistics. As discussed in the previous section, the sample variance is given by s2 = x2 − x̄2 but this may also be written as n2 = m2 − m21 , which is to be compared with the corresponding relation ν2 = µ2 −µ21 derived in subsection 30.5.3 for population statistics. This correspondence also holds for higher-order central 1226 31.2 SAMPLE STATISTICS moments of the sample. For example, n3 = N 1 (xi − m1 )3 N i=1 N 1 3 = (xi − 3m1 x2i + 3m21 xi − m31 ) N i=1 = m3 − 3m1 m2 + 3m21 m1 − m31 = m3 − 3m1 m2 + 2m31 , (31.11) which may be compared with equation (30.53) in the previous chapter. Mirroring our discussion of the normalised central moments γr of a population in subsection 30.5.5, we can also describe a sample in terms of the dimensionless quantities nk nk gk = k/2 = k ; s n 2 g3 and g4 are called the sample skewness and kurtosis. Likewise, it is common to define the excess kurtosis of a sample by g4 − 3. 31.2.4 Covariance and correlation So far we have assumed that each data item of the sample consists of a single number. Now let us suppose that each item of data consists of a pair of numbers, so that the sample is given by (xi , yi ), i = 1, 2, . . . , N. We may calculate the sample means, x̄ and ȳ, and sample variances, s2x and 2 sy , of the xi and yi values individually but these statistics do not provide any measure of the relationship between the xi and yi . By analogy with our discussion in subsection 30.12.3 we measure any interdependence between the xi and yi in terms of the sample covariance, which is given by Vxy = N 1 (xi − x̄)(yi − ȳ) N i=1 = (x − x̄)(y − ȳ) = xy − x̄ȳ. (31.12) Writing out the last expression in full, we obtain the form most useful for calculations, which reads N N N 1 1 xi yi − 2 xi yi . Vxy = N N i=1 i=1 1227 i=1 STATISTICS rxy = 0.0 rxy = 0.1 rxy = 0.5 rxy = −0.9 rxy = 0.99 y x rxy = −0.7 Figure 31.1 Scatter plots for two-dimensional data samples of size N = 1000, with various values of the correlation r. No scales are plotted, since the value of r is unaffected by shifts of origin or changes of scale in x and y. We may also define the closely related sample correlation by rxy = Vxy , sx sy which can take values between −1 and +1. If the xi and yi are independent then Vxy = 0 = rxy , and from (31.12) we see that xy = x̄ȳ. It should also be noted that the value of rxy is not altered by shifts in the origin or by changes in the scale of the xi or yi . In other words, if x = ax + b and y = cy + d, where a, b, c, d are constants, then rx y = rxy . Figure 31.1 shows scatter plots for several two-dimensional random samples xi , yi of size N = 1000, each with a different value of rxy . Ten UK citizens are selected at random and their heights and weights are found to be as follows (to the nearest cm or kg respectively): Person Height (cm) Weight (kg) A 194 75 B 168 53 C 177 72 D 180 80 E 171 75 F 190 75 G 151 57 H 169 67 I 175 46 J 182 68 Calculate the sample correlation between the heights and weights. In order to find the sample correlation, we begin by calculating the following sums (where xi are the heights and yi are the weights) xi = 1757, yi = 668, i i 1228