§1.1A motivating case 1.1.1Example§1.1.1 The following dataset records the plasma levels of total cholesterol (in mg/ml) of 24 patients with hyperlipoproteinaemia admitted to a hospital: 3.51.94.02.64.53.02.93.82.13.84.13.0 2.54.63.24.22.34.04.33.93.33.22.53.3 Fig.§1.1.1(a) gives a scatterplot of the data. 01020Patient 1.5 2.5 3.5 4.5 Level of cholesterolScatterplot (a)204060 Age 2 3 4 5 Level of cholesterol Scatterplot (b) Figure§1.1.1: Plasma levels of total cholesterol (in mg/ml) Question:Predict the cholesterol level of the next patient to be admitted to the hospital with hyperlipoproteinaemia.
View full document
End of preview
Want to read all 74 pages? Upload your study docs or become a member.
Enhanced Document Preview:MATH38141 Regression Analysis
Semester 1, 2022-23.
Dr. Wentao Li Department of Mathematics, University of Manchester
Contents 1 Introduction
2
1.1 A motivating case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2 General problem and terminology. . . . . . . . . . . . . . . . . . . . . .
3
1.3 General procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4 Uses and limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2 Simple Linear Regression
7
2.1 Model formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2 Model fitting by least squares method. . . . . . . . . . . . . . . . . . .
7
2.3 Analysis of variance (under normal error). . . . . . . . . . . . . . . . .
9
2.4 Coefficient of determination. . . . . . . . . . . . . . . . . . . . . . . .
11
2.5 Inference about regression parameters (under normal error). . . . . . .
12
2.6 Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3 General Linear Model (GLM)
18
3.1 Model formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.2 Model fitting by least squares method. . . . . . . . . . . . . . . . . . .
20
3.3 Comparing Two Nested Models
. . . . . . . . . . . . . . . . . . . . . .
26
3.4 Inference about regression parameters. . . . . . . . . . . . . . . . . . .
30
3.5 Prediction Intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4 Regression Diagnosis
34
4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.2 Leverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.3 Normal probability plot. . . . . . . . . . . . . . . . . . . . . . . . . . .
35
4.4 Outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.5 Influential observations. . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4.6 Multicollinearity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.7 Numerical example. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
1 5 Classification Models
45
5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.2 One-way classification models (One-way ANOVA). . . . . . . . . . . . .
46
5.3 Two-way classification models (Two-way ANOVA). . . . . . . . . . . .
51
5.4 Model fitting
56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Test for treatment effects: one-way ANOVA.
. . . . . . . . . . . . . . .
59
5.6 Two-way ANOVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6 Extraneous Variation
64
6.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
6.2 Randomized block design. . . . . . . . . . . . . . . . . . . . . . . . . .
65
6.3 Analysis of covariance (ANCOVA). . . . . . . . . . . . . . . . . . . . .
68
1
INTRODUCTION
1 1.1
2
Introduction A motivating case
1.1.1 Example 1.1.1 The following dataset records the plasma levels of total cholesterol (in mg/ml) of 24 patients with hyperlipoproteinaemia admitted to a hospital: 3.5 2.5
1.9 4.6
4.0 3.2
2.6 4.2
4.5 2.3
3.0 4.0
2.9 4.3
3.8 2.1 3.9 3.3
3.8 3.2
4.1 2.5
3.0 3.3
Fig. 2: A. 1.1.1(a) gives a scatterplot of the data. Scatterplot (a)
Scatterplot (b)
5
Level of cholesterol
Level of cholesterol
4.5
3.5
4
3
2.5
2
1.5 0
10 Patient
20
20
40 Age
60
Figure 1.1.1: Plasma levels of total cholesterol (in mg/ml). Question: Predict the cholesterol level of the next patient admitted to the hospital with hyperlipoproteinaemia. Intuitive answer: Use the average of the 24 observations: 3.354 (horizontal reference line in Fig. 12). 1.1.1(a)). See also Chapter 6.2. Observations scatter around the average, but are subject to considerable fluctuations. The above is justifiable if the observations are i.i.d. for instance, a number of different types of fields. In the absence of further information, this seems to be the best we can do.
1.1.2 Suppose the hospital has also collected data on the age of the 24 patients.
2063
52 40
30 48
57 28
25 49
28 52
36 58
22 29
43 34
57 24
33 50
Each observation (corresponding to each patient) consists of values of two variables.
1
INTRODUCTION
3
X, Y = (age, cholesterol level). Fig. 2: A. 1.1.1(b) plots the cholesterol levels (Y) 0 0 (X) for the dataset. The plot shows a strong linear relationship between X and Y. It therefore seems more reliable to assume a linear function relating to age and cholesterol levels, and predict the next patient's cholesterol level based on his/her age. 1.1.3 Fig. 1.1. 1.1.1(b) fits a sloped straight line to the scatterplot with the least square method (to be discussed later). This straight line summarises the relationship between cholesterol level and age and can be used for predicting future patients' cholesterol levels. Compared to Fig. 1.1.1(a). The fluctuations of the 24 observations around the sloped straight line are much smaller. A function linear in age (the sloped straight line) better accounts for the observed variation in cholesterol level than a simple constant function (the horizontal line). A crucial question of interest to statisticians: how much "better" is the "sloped straight line" model than the "horizontal line" model? 1.1.4 The above example highlights the importance for data analysis of data on some other variables (e.g. age) relevant to the main variable of interest (e.g. cholesterol level) in order to obtain a model which can better explain the observed variation in the main variable.
1.2
General problem and terminology
1.2.1 Typical observational or experimental studies involve the drawing of a sample of observations from a population for which inference is to be made. 1.2.2 In general, each observation consists of measurements on a number of variables related to an individual experimental/observational unit sampled from the population. Variable of primary interest — response or dependent variable Remaining variables — explanatory or independent variables, also known as regressors or covariates. Example 1.2.1 1. In an opinion poll, information is collected on n members sampled from a community. Each observation can be represented in the form (sex, age, educational level, opinion). 2. The amount of money to be spent. In a study of the property market, recent transactions are sampled. Each observation may have the following characteristics (area, building age, facilities, location, price). 3. The following is the main content. A clinical trial is conducted on a sample of nine patients, some receiving a new medical treatment and the rest old treatment. Each observation may have the form
1
INTRODUCTION
4
(age, gender, past medical record, smoking behaviour, type of medical treatment applied, response to treatment). 4. Do you have a job? To study gravitational force, a physicist varies the length of a pendulum and measures its period on n separate occasions, giving n observations of the form (pendulum length, period). 5. Do you have a good relationship with your own people? An electrician wants to determine the resistance of an electrical circuit. He passes several pre-specified currents through the circuit and measures the corresponding voltages. Each observation may have the form (current, voltage, etc).
Example 1 2 3 4 5
Response price response to treatment period voltage.
Explanatory gender, age, educational level, etc. area, building age, facilities, location age, sexuality, past medical record, smoking behaviour, type of medical treatment applied, length of pendulum current.
1.2.3 Our objective in linear modelling is to study the relationship between explanatory and response variables, based on the sample collected. Typical questions to ask include: Does a certain explanatory variable affect the response significantly? Can a simple statistical model explain the relation between the response and the explanatory variables? Can we predict a future response based on the values of the explanatory variables? 1.2.4 A variable may be quantitative, qualitative or categorical. Quantitative variables can be measured numerically: e.g. age, income, time, temperature, etc. Qualitative variables are not numerical in nature: e.g. sex, age, education level, type of crime committed, style of cuisine served in a restaurant, etc. Previous chapters will focus only on quantitative variables. Later chapters will also consider qualitative explanatory variables.
1.3
General procedure
1.3.1 Practical linear modelling consists of the following phases: graphical display of observed data; scatterplot, scatterplot matrix, etc. 2. Formulation of model • are we fitting a straight line, a parabola, or anything else?
1
INTRODUCTION
5
no useful model can fit the observed data perfectly; how do we formulate a model to account for any residual discrepancies? 3. model fitting • calculate the best estimates of the parameters in the model • "best" in the sense of what? How to calculate the cost? with some kind of optimisation? 4. Model adequacy checking • ascertain the quality of fit • is the model adequate? Should we modify our model to obtain a better fit? We may need to iterate the above phases many times before coming up with a satisfactory model. 5. making inference • try to answer questions of interest (depending on the context of the problem) • quantify confidence in our answers using statistical arguments • make suggestions, conclusions etc.
1.4
Uses and limitations
1.4.1 Uses of linear models include: • Data description — e.g. Model equations are effective mathematical devices for summarising the observed data.
parameter estimation — e.g. The unknown resistance of an electrical circuit is estimated by fitting a straight line to a number of current-voltage coordinate pairs and using the formula "voltage = current resistance".
• prediction and estimation — e.g. predicting future observations for planning and decision-making purposes.
• control — e.g. adjust the setting of the explanatory variables to yield the desired response level (but this usage is possible only if causality has been confirmed between the explanatory variable and the response).
1.4.2 Linear modelling identifies relationships between variables, but does NOT imply causality. Causality must be established by further theoretical consideration. Linear modelling analysis only assists in confirming, but not proving, causality. 1.4.3 The linear model is only an "approximation" of the true relationship between variables. It is useful because it provides a simple and effective "portrait" of the relationship.
1
INTRODUCTION
6
1.4.4 Model equations are valid only over the range of the observed data. Extrapolating our inference beyond this range risks committing serious errors.
2
SIMPLE LINEAR REGRESSION
2 2.1
7
Simple Linear Regression Model formulation.
2.1.1 Consider the simple case where there is just one explanatory variable X: Y : response variable.
X: The explanatory variable.
A sample of n observations is observed in the form of (xi, Yi ), i = 1,. . . n. X Y
x 1 Y 1
x 2 Y 2
xn Yn
2.1.2 A simple linear regression model (SLR) assumes: • Y is random, X is nonrandom, • Y = 0 + 1 X +, where there is a random error with E[] = 0 and Var() = 2, • x1,. . . xn are observed values of X, • Y1,. . . Yn are independent copies of Y, observed at X = x1,. . . , xn respectively, • 0, 1, are unknown parameters. Note: equivalently, Yi = 0 + 1 xi + i, where,. . . n are independent random errors with common mean 0 and common variance 2.
2.1.3 Under the SLR model, the responses Yi are expected to fluctuate about their mean 0 + 1 xi due to some random errors i. A plot of Yi against xi is expected to exhibit a linear trend, subject to such random errors. See Fig. 2 for more information. 1.1.1(b) for example. 2.1.4 Terminology —
g(X) = 0 + 1. X: regression function. 0, 1 : Regression coefficients.
Here 0 = intercept, 1 = slope of regression function = change in E[Y ] per unit change in X. 2.1.5 Often we wish to • estimate, and draw inferences about, the unknown parameters 0, 1,, • test for significance of the regression function, • predict future observations based on the model, etc.
2.2
Model fitting by least squares method.
2.2.1 Fitting an SLR model to the data pairs is often called regressing Y from X, i.e.
2
SIMPLE LINEAR REGRESSION
8
to find estimates 0, 1 for 0, 1 such that the fitted regression line y = 0 + 1 x "best" fits the observations. 2.2.2 Least squares method — find 0, 1 to minimise.
Pn
i = 1 ( Yi
0 1 xi )2 w.r.t. 0,1.
0, 1 are known as least squares estimates (lse) of 0, 1. 2.2.3 Define x = Y = Sxy.
X i X
xi /n, Sxx = 2 mm.
Yi /n, Sy y = - i /n
X i X
( xi x )2=
( Yi Y )=2
X i X
xi2 n x 2, Yi 2 nY 2,
i i X X X i X = (xi x )(Yi Y ) = (xi x )Yi = xi (Yi Y ) = xi Yi n x Y . I
I
I
I
Least squares estimates 0, 1 can be found by solving the normal equations: X X (Yi 0 1 xi ) = 0 0 + 1 x = Y, (Yi 0 1 xi )2 = 2 0 i i P 2 P X X xi Yi 2 i xi xi (Yi 1 i n n i
0 = Y (Sxy / Sxx)
and
1 = Sxy / Sxx.
2.2.4 Based on the lse 0, 1, we may compute the fitted or predicted values: Yi = 0 + 1 xi, for i = 1,. . . n; • residuals: i = Yi Yi, for i = 1,. . . n. 2.2.5 The i th residual i is the departure of the i th response Yi from its fitted value Yi. It mirrors the behaviour of the (unobservable) random error i. 2.2.6 Example 1.1.1 (cont'd) Y — cholesterol level, X — age: X = 39.417, Y = 3.354, Sxx = 4139.834, Sxy = 217.858, = 217.894, Sxx = 217.858,
0 = 1.280.
1 = 0.0526 for a reference.
Fitted regression line: y = 1.280 + 0.0526 x, as shown in Fig. 16. 1.1.1(b) for reference. Note: Graphically, the regression line is chosen to minimise the sum of squared vertical departures of all observations from the line.
2
SIMPLE LINEAR REGRESSION
2.3
9
Analysis of variance under normal errors.
2i 2.3.1 Define s =. It is an unbiased estimator of 2, E[s 2] = 2 (the proof is n2 omitted). 2
P
I
2.3.2 To test hypotheses or construct confidence intervals. We often have to assume further that 1,. . . n are i.i.d. with b. i.d. and s. N(0, 2 ). Note: equivalently, Yi N(0 + 1 xi, 2).
2.3.3 Definition of the "Case in" section. residual (or error) sum of squares total corrected sum of squares regression sum of squares
P 2 /Sxx SSE =P i 2i = Sy y Sxy n 2 Sy y = i=1 (Yi Y ) 2 /Sxx SSR = Sy y SSE = Sxy
2.3.4 Intuitively, the SLR model will be a good fit if the residuals and i's are small. Thus SSE provides a measure of the goodness of fit for the model. 2.3.5 If we had ignored the explanatory variable X in the SLR model and applied a fixed constant to the responses (i.e. assumed that the responses constitute an i.i.d. random sample), then the residual sum of squares becomes Sy y. 2.3.6. Evidently, Sy y SSE in general. Their difference SSR measures how effective the variable X is at explaining the variation in the response Y. 2.3.7 Analysis of variance (ANOVA) provides a formal method for determining whether SSR is large enough for concluding that the fitted regression line is useful in explaining the variation in the response Y. This method is based on the partition Sy y = SSR + SSE, and tabulates results in the form of an ANOVA table: Source s.s. Regression SSR Residual SSE Total Sy y
d.f. and c.f. 1 n 2 n 1
m.s. F -ratio MSR = SSR/1 F = MSR/MSE MSE = SSE/(n 2).
Features of ANOVA table: • bottom item = sum of items above; • each sum of squares (s.s.) accounts for a source of variation in Y; • each s.s. is associated with a degree of freedom (d.f.). ); • mean square (m.s.) = s.s./ d.f. ; • F -ratio for a particular source of variation (except for residual s.s.) is F =
m.s. due to the source. residual mean square (MSE).
2
SIMPLE LINEAR REGRESSION
10
2.3.8 Large F -ratio SSR contributes much to Sy y X not effective in explaining variation in Y Small F -ratio SSR contributes little to Sy y X not effective in explaining variation in Y 2.3.9 To judge whether the F -ratio F is big enough to indicate a significant effect of X on Y, we perform a formal test of H0 : 1 = 0.
vs
H1: 1 Unrestricted. No further data.
F test (of size ): reject H0 iff the p-value P(F1,n2 > F) is smaller than, or equivalently, () reject H0 iff F > F1,n2, () where Fab denotes the th upper quantile of the Fab distribution.
2.3.10 Example 1.1.1 (cont'd) Unbiased estimate of 2 : P24 (Yi 1.280 0.0526 xi ) 2 2.45481 2 s = i=1 = = 0.11158. 24 2 22 ANOVA table: Source s.s. d.f. m.s. F -ratio Regression 11.46477 1 11.46477 102.7473 2.45481 22 0.11158 Residual Total 13.91958 23 • Calculating using computer, p-value = P(F1,22 > 102.7473) = 9.4283 1010, which is extremely small. We conclude that there is a highly significant linear relationship between "age" and "cholesterol level". For a size test, we read from standard statistical tables or computer software: Critical value F1,22 5% 4.301 1% 7.945.
In either case, the F -ratio of 102.7473 is much bigger than the critical value, and we should reject H0 at the stated significance level.
2
SIMPLE LINEAR REGRESSION
2.4
11
Coefficient of determination
2.4.1 Definition - Definition at line 66 of file Gsvc.cpp. The coefficient of determination is 2 Sxy SSE SSR R =1 = =. Sy y Sy y Sxx Sy y 2
2.4.2 R2 measures the proportion of the total variation in Y reduced, or "explained", by the regression. 2.4.3 big R2 regression line effective in reducing total variation in Y 2.4.4 0 R2 1. 2.4.5 A significant result for an F test of the regression does not necessarily imply a significant R2. The test is used to judge whether there is "evidence" of a linear relationship between Y and X, but R2 is used to measure how much of the variation in Y can be explained by that relationship. 2.4.6 Example 1.1.1 (cont'd) R2 = 11.465/13.920 = 0.824, suggesting that the variable "age" is reasonably effective in explaining the variation in cholesterol level. 2.4.7 Example 2.4.1 Consider the following 12 observations of (X, Y):. X Y.
1 0.700
2 0.193
3 3.423
46.553
5 7.680
6 10.245
7 1.223
8 7.338
9 12.285
106.564
11 8.711
Regressing Y on X yields the ANOVA table: Source s.s. d.f. m.s. F -ratio Regression 54.8601 1 54.86010 5.152688 Residual 106.4689 10 10.64689 Total 161.3290 11 • p-value = P(F1,10 > 5.152688) = 0.0466, which is smaller than 0.05. Alter(0.05) natively, check that the F -ratio 5.153 > F1,10 = 4.965. Hence, the linear relationship between X and Y is significant at the 5% level. The coefficient of determination is R2 = 0.3401, which shows that the regression line is far from adequate in accounting for the variation in the Y values. Fig. 2: A. 2.4.1 gives the scatterplot of the data, together with the fitted regression line. Clearly, the observed Y values display a strong upward trend as X increases, but the observations still suffer from large fluctuations on the fitted line. This illustrates the earlier remark that a statistically significant linear relationship does not necessarily imply a strong linear relationship between X and Y. In fact, a significant linear relationship essentially means SLR is better than "no SLR", but says nothing about the absolute quality of the SLR fit.
12 6.144
2
SIMPLE LINEAR REGRESSION
12
12
Y
8
4
Fitted regression line
0
0
2
4
6 X
8
10
12
Figure 2.4.1: Example 2.4.1 — scatterplot of Y against X (with fitted regression line).
2.5
Inference about regression parameters (under normal errors).
2.5.1 The LSE's 0, 1 provide point estimates for 0, 1, and will vary from sample to sample. To assess their accuracy, we ought to investigate the sampling distribution (0, 1 ). 2.5.2 Under the normal error assumption 1,. . . n i.i.d. o.p., n i.i.d. N(0, 2 ), we have 2 1 2 2 2 (unbiased estimate of ) s = MSE = SSE/(n 2) (n 2) n2 2 2 2 x x 2 1 2 , N , + , N ,, Cov = ........................................................................................ Proof: (postponed to Chapter 4).
2.5.3 In general, any linear combination = c0 0 + c1 1, for known constants c0, c1, can be estimated unbiasedly by = c0 0 + c1 1. Estimating the standard deviation of with the standard error s s 2 2 1 x c x c02 (c0 x c1)2 s.e. ( ) = s c02 + + 1 2c0 c1 =s +, n Sxx Sxx Sxx n Sxx
2
SIMPLE LINEAR REGRESSION we have
13
tn 2. s.e. ( )
The above result enables us to calculate confidence intervals for, or to test the hypothesis that equals some specified value 0. ........................................................................................ Proof: (postponed to Chapter 4).
2.5.4 For example, a 100(1 )% confidence interval for is (/2) s.e. ( ) tn 2,
where tm() denotes the th upper quantile of the tm distribution. • To test whether = 0 at significance level, we can do a t test: 1. calculate the test statistic T =
0, s.e. ( )
(/2) 2. reject the hypothesis iff |T | > tn2.
Note: the above rejection criterion is equivalent to checking whether 0 falls outside the 100(1 )% confidence interval for.
2.5.5 If c0 = 1, c1 = 0, we have = 0, = 0, and = 0 0 02. s 1/n + x 2 /Sxx Putting c0 = 0, c1 = 1, we have = 1, = 1, and 1 1 tn2. s/ Sxx 2.5.6 For example, • a 100(1 )% confidence interval for the slope 1 is s (/2) 1 tn2. Sxx
2
SIMPLE LINEAR REGRESSION
14
• To test H0 : 1 = against H1 : 1 unrestricted, at significance level, calculate 1 T = s/ Sxx (/2) and reject H0 iff |T | > tn2. In particular, if is chosen to be 0, the above n t test o2 is equivalent to the F (/2) () 2 test introduced in 2.3.9, since T = F and tn2 = F1,n2.
2.5.7 Example 1.1.1 (cont'd). Recall that 1 = 0.0526, s= 1.17, s= 0.77, s
MSE =
0.11158 = 0.3340.
Suppose a 95% confidence interval for 1 is required. The 2.5% upper quantile (0.025) of t22 is t22 = 2.074. Thus the interval is 0.3340, 0.0526, 2.074, 4139.833. In this example, we're using [0.0418, 0.0634]. • Suppose we wish to know whether 1 = 0 (i.e. no linear relationship between age and cholesterol level). Note that the above interval misses the value 0, which suggests that 1 is significantly different from 0 at the 5% level, and that there is a significant linear relationship between age and cholesterol level. Note: We have arrived at the same conclusion in 2.3.10.
Alternatively, we can perform a formal size 5% test of the hypothesis that 1 = 0. The test statistic is 4139.833 (0.0526 0) = 10.133. T = 0.3340 (0.025) (0.025) The critical value is t22 = 2.074. Here T > t22 and we should reject the hypothesis.
Again, the conclusion is consistent with what we found earlier.
2.6
Prediction
2.6.1 Let Y1,. . . Ym be independent responses of m future (not yet observed) items that are equal to x1,. . . and xm respectively. Suppose we wish to predict PmX values l = j=1 aj Yj , for some known constants a1,. . . , am. Special cases of interest: (i) if m = 1, a1 = 1, x1 = x , then l = Y1, a single future response at X = x ; Pm (ii) if xj = x and aj = 1/m for all j, then l = j=1 Yj /m, the sample mean of m independent future responses, all observed at the same X value x .
2
SIMPLE LINEAR REGRESSION
15
iii. If m = 2, a1 = 1 and a2 = 1, then l = Y1 Y2, the difference between two independent future responses, observed at X = x1 and x2 respectively. 2.6.2 Based on the SLR model, the future responses are Y1,,. . . , Ym are independent, and Yj N(0 + 1 xj, 2 ). Hence
! l N
X
aj (0 + 1 xj), 2.
X
aj 2
.
j
j
It is natural to predict l by X X X l = aj + 1 aj xj. aj (0 + 1 x j ) = 0 j.
j
j
2.6.3 A 100(1 )% prediction interval for l is defined as a (random) interval [A, B] such that P(A l B) = 1 . To find A, B, note first that l and l are independent (since l is calculated from the observed sample data but l stems from future observations) and !2 " #2 X X X 1 l l N 0, 2 n1 aj + Sxx aj (xj x ) + aj2 . j
j
j
Estimating by s, we have v!2 " #2 u u X X X 1 aj + Sxx aj (xj x ) + s.e. (l l) = s tn1 aj2. j
j
j
A 100(1 )% prediction interval for l is (/2) l s.e. l l tn2.
2.6.4 For special cases described in 2.6.1, the prediction interval reduces to s 1 (x x )2 (/2) (i) 0 + 1 x tn2 s + +1 n Sxx s 1 (x x )2 1 (/2) s (ii) 0 + 1 x tn2 + + n Sxx m s (x1
2
SIMPLE LINEAR REGRESSION
16
2.6.5 Example 1.1.1 (cont'd). Suppose we wish to predict the difference in cholesterol levels between a patient aged 60 and a patient aged 30. Our target is then l = Y1 Y2, where Y1 and Y2 are the cholesterol levels of patients aged 60 and 30 respectively. This corresponds to the general form by putting a1 = 1, a2 = 1, x1 = 60 and x2 = 30. The predicted value is l = 0.0526(60 30) = 1.578. The 95% prediction interval is therefore r 1.578 2.074 0.3340.
302 + 2 = [0.546, 1.106]. 4139.833
2.6.6 Example 1.1.1 (cont'd). Suppose we wish to predict the cholesterol level of a future patient, whose age is 60. Then the predicted value is 0 + 601 = 4.436. The 95% prediction interval is r 4.436 2.074 0.3340.
1960 39.417)2 1 + 1 = [3.695, 5.177]. 24 4139.833
Note: The same predicted value, 4.436, is also used for estimating the mean cholesterol level, i.e., "Fetal cholesterol level p .000". 0 + 601, of patients whose age is 60. However, the 95% confidence interval for 0 + 601 is shorter: r 1 (60 39.417)2 + = [4.173, 4.699]. 4.436 2.074 0.3340 24 4139.833
Fig. 2: A. 2.6.1 compares the 95% prediction intervals at various ages for the cholesterol levels of a future patient with the 95% confidence intervals for the corresponding mean cholesterol levels. Clearly, prediction for the response of a particular future patient is less precise than estimation for the mean response of all patients of the same age. This is because taking action helps reduce uncertainty in our targets.
2
SIMPLE LINEAR REGRESSION
6.0
17
95% confidence limits for mean response 95% prediction limits for one response.
5.5 5.0
cholesterol level
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 10"
15
20
25
30
35
40
45
50
55
60
65
70
age
Figure 2.6.1: Example 1.1.1 — 95% confidence intervals for mean responses vs 95% prediction intervals for one future response.
3
GENERAL LINEAR MODEL (GLM).
3 3.1
18
General Linear Model (GLM) Model formulation.
3.1.1 Matrix Notations: • A p q matrix.
A =
a11 a12 a21 a22. . ap 1 ap 2
a 1 q a 2 q . . . . .. . apq
A column vector is a p 1 matrix. A row vector is a 1 q matrix. Denote by 0 p the p p identity matrix, by 0 p the p 1 zero vector, and by 0 pq the p q zero matrix. • The transpose of a p q matrix A a11 a12 A> = .. . a 1 q
is the q p matrix a21 ap1 a22 ap2 . . .. . . . . a2q apq
Note: the transpose of a column vector is a row vector, and vice versa.
A p p matrix is a square matrix. A square matrix A is symmetric if A > = A. • If a p p matrix A is nonsingular, then there exists a p p inverse matrix A 1 such that AA 1 = A 1A = I p. 3.1.2 Model formulation: A general linear model has the matrix form Y = X +, where.
Y : n 1 random vector of response, X : n d design matrix of rank d n, etc.
,
: d 1 vector of regression coefficients, : n 1 vector of random errors with E[] = 0 n and Var() = 2 n.
observable unobservable
X >X )1 exists. Note: X has a d. Columns of X are linearly independent of x.
3.1.3 Multiple linear regression model — an important special case: Consider.
3
GENERAL LINEAR MODEL (GLM).
19
Y : response variable.
X 1,. . . Xp : p explanatory variables.
A sample of n observations is observed in the form of yi, xi1,. . . xip ) for the i th observation: Y Y1 Y2.
X1 x11 x21.
X2 x12 x22..
.
Xp x1p x2p.
Yn
xn 1
xn 2
xnp
A multiple linear regression model (MLR) assumes: - Y is random, X1,. . . , Xp are nonrandom variables, - Y = 0 + 1 X1 + + p Xp +, where is a random error with E[] = 0 and Var() = 2, - x1j,. . . xnj are known values of Xj (j = 1,. . . p), - Y1,. . . Yn are independent copies of Y, where Yi is observed at (X1),. . . , Xp ) = ( xi 1,. . . , xip ), - 0, 1,,. . . , p, are unknown parameters, where j is the coefficient associated with the effect of Xj on Y (j = 1, ). . . p) - The sands in. -. Note: equivalently, Yi = 0 +1 xi1 + +p xip +i, where 1 =. . . n are independent random errors with common mean 0 and common variance 2.
MLR has the form of a general linear model with d = p + 1, 0 Y1 1 x11 x1p 1 Y2 1 x21 x2p Y = .. , X = . . . . . Yn 1 xn1 xnp p
1.. . . n
case p = 1 reduces to SLR. • MLR is considered a linear model because E[Y ] is linear in the regression coefficients 0,. . . p, but not because it is linear in each explanatory variable. Examples: Linear models: (i) Y = 0 + 1 X + (SLR); (ii) Y = 0 + 1 X + 2 X 2 +, explanatory variables: X1 = X, X2 = X 2 ; (iii) Y = 0 + 1 sin(2Z1) ), 2 ln Z2 +, explanatory variables: X1 = sin(2Z1), NOT linear models: (i) Y = exp0 + 1 X + ; (ii) Y = 1/(0 + 1 X + 2 X 2) +. 3.1.4 Hat matrix
3
GENERAL LINEAR MODEL (GLM).
20
Assume the columns of X are linearly independent. X >X )1X > is commonly known as the Definition. n n matrix H = X (X hat matrix). The matrix H provides important information about the influence of X on the linear model fit. • Some properties of hat matrix H : H )2 = I n H H, (II n H H )> = I n H H. H X = X, H 2 = H, H > = H, (II n H Note: H is the projection matrix onto the column space of X, i.e. vector space X : Rd, while I n H is the projection matrix onto its orthogonal V = X complement V.
3.2
Model fitting by least squares method.
3.2.1 Multivariate Random Vector Definition. Let Y 1,. . . Yn, W11, W12,. . . ty wr s ables. Write W11 Y1 W21 Y = ... and W = .. . Yn Wr 1
is a collection of random variW12 W1s W22 W2s. . . Wr 2 Wr s
represent a random n 1 vector and a random r s matrix, respectively. i. the means of Y and W are E[YY ] =
E [ Y 1 ].. . E [ Y n ]
W and E [ W ] =
E[W11 ] E[W12 ] E[W1s ] E[W21 ] E[W22 ] E[W2s ] ;......... . . . E [ Wr 1 ] E [ Wr 2 ] E [ Wr s ]
(ii) the variance-covariance matrix of Y is Var (YY ) = , an n n matrix with (i, j)th entry Var(Yi ), i = j, ij = Cov(Yi, Yj ), i 6= j. Note: = E (YY )(YY )>.
W ]. • For random r s matrices V, W, we have E[VV + W ] = E[VV ] + E[W • For a random r s matrix W, a constant q r matrix A and a constant s p (AW B ] = A E[W W ]B B, which is a q p matrix. matrix B, we have E[A • For any constant matrix A and constant vector u, AY + u ] = A E[YY ] + u, (i) E[A AY + u ) = A Var (YY ) A >, (ii) Var (A A Var(YY ))+E[YY > ] A E[YY ], (iii) if A is a square constant matrix, then E[YY >AY ] = tr (A
3
GENERAL LINEAR MODEL (GLM).
21
X + Y ) = Var (X, X) + • If X and Y are uncorrelated random vectors, then Var ( X, Var (Y ). 3.2.2 Parameter Estimation • Least squares method — find = [1, ]. . . , d ]> to minimise kYY X k2 = (YY X )> (YY X ). • The lse = [ 1,. . . , d ]> satisfies X > X )1 X > Y. = ( X...................................................................... X > X )1 X >. Consider Proof: Recall H = X (X ) X k2 = kYY H Y + H Y X k2 HY X k2 + 2 Y > (II n H )(H HY X ) = kYY H Y k2 + kH 2 2 HY X k = kYY H Y k + k
for all .
Equality is achieved at X > X )1X > Y. H Y = X X >H Y = X >X X = X >X.
X > X ) 1 X >. As with the SLR model, we may compute : X,. Recall that H = X (X fitted or predicted values Y = X = H Y residuals = Y Y = (II n H )YY. The following results apply to any general linear model: > (i) and s = are unbiased estimators of and 2, respectively. n d X > X ) 1. (ii) Var = 2 (X) (iii) (The Gauss-Markov Theorem) Let a = [a1,,. . . , ad ]> be a known constant is the best linear unbiased estimate (BLUE) of a >. The vectors are a type of vector. Then a > ...................................................................... Proof: For (i), note that X >X )1X >Y = (X X >X )1X > (X X + ) = (X X >X )1X >, = (X h i X >X )1X > E[] = . On the other hand, since tr (H H) = we have E = + (X X >X )1X >X = tr (II d ) = d, we have tr (X h i > > X (II n H H )X X = 2 (nd). E > = E Y > (II n H )YY = tr (II n H )( 2I n ) + X >X )1X >, we have For (ii), noting that = (X > 1 > > X X ) X X (X X >X )1 = 2 (X X >X )1. Var = E (X
3
GENERAL LINEAR MODEL (GLM).
22
X >X )1X >Y and E[ ] = , a > = a (X X >X )1X >Y is For (iii), noting that = (X > clear a linear unbiased estimate of a . For any linear unbiased estimate c >Y of a >, we have E[cc >Y ] = c >X = a > for all , so that X >c = a. Then X >X )1a = 2c > (II n H H )cc = 2 k(II n H H )cc k2 0. Var ( cc > Y ) Var ( aa > ) = 2 c > c 2 a > ( X
3.2.3 Example 3.2.1 ozone.
radiation
temperature
wind
ozone
radiation
temperature
wind
3.45 2.29 2.84 2.00 2.22 2.62 3.24 3.11 1.00 1.59 2.84 4.86 3.07 3.39 2.76
190.00 149.00 299.00 19.00 290.00 65.00 307.00 322.00 8.00 25.00 13.00 223.00 127.00 323.00 191.00
67.00 74.00 65.00 61.00 66.00 58.00 66.00 68.00 59.00 61.00 67.00 79.00 82.00 87.00 77.00
7.40 12.60 8.60 20.10 9.20 13.20 12.00 11.50 9.70 9.70 12.00 5.70 9.70 11.50 14.90.
3.30 2.62 2.67 2.52 2.41 2.41 1.82 2.22 2.22 3.17 3.56 3.33 4.14 2.84 3.33
118.00 313.00 99.00 256.00 274.00 334.00 78.00 44.00 320.00 92.00 252.00 279.00 291.00 148.00 284.00
72.00 62.00 59.00 69.00 68.00 64.00 57.00 62.00 73.00 61.00 81.00 76.00 90.00 82.00 72.00
8.00 11.50 13.80 9.70 10.90 11.50 18.40 9.70 16.60 12.00 14.90 7.40 13.80 8.00 20.70.
The above table shows a dataset taken from an environmental study that measured the following four variables for 30 consecutive days: ozone (Y ) — ozone surface concentration of ozone in New York, in parts per million; radiation (X1 ) — solar radiation; temperature (X2 ) — observed temperature, in degrees Fahrenheit; wind (X3 ) — wind speed, in miles per hour; A scatterplot matrix of the data is given below, providing visual information about pairwise relationships among the 4 variables.
3
GENERAL LINEAR MODEL (GLM).
0
100
200
23
300
5
10
15
20 5 4
ozone
3 2 1
300 200
radiation
100 0 90 80 temperature 70 60 20 15 wind 10 5 1.
2
3
4
5
60
70
80
90
To study the variation of ozone concentration (Y ), we fitted an MLR model to the data using 3 explanatory variables: radiation (X1 ), temperature (X2 ) and wind (X3 ). The 30 4 design matrix X equals 1 190.00 67.00 7.40 1 149.00 74.00 12.60 X = 1 299.00 65.00 8.60 . . . . . The LSE is calculated as 0.295271027 0.001305816 X >X )1X >Y = = (X 0.045605744 . 0.027843496
According to the estimates, it seems that ozone concentration increases as radiation and temperature increase, but is reduced by wind.
3.2.4 Multivariate normal distributions • Definition. A random vector Y = [Y1 ],. . . , Yn ]> has multivariate normal distribution if there exist Rn and an n n positive semidefinite symmetric matrix such that its moment generating function mY () satisfies h > i 1 > t Y > mY (tt) = E e = exp t + t t for all t Rn. 2
3
GENERAL LINEAR MODEL (GLM).
24
, . Write Y Nn (, ) N( 1, 11 ). Note: If n = 1, N1 ( Note: A square p p matrix A is positive semidefinite if u >Au 0 for any p 1 vector u.
, ): • Basic properties of Y Nn ( (i) E[YY] = . (ii) Var(YY) = A + u, AA > ) for any constant k n matrix A and any (iii) AY + u Nk (A constant k 1 vector u. Special case: a >Y N( a >, a > a ) for any constant vector a Rn.
• If Y 1,. . . , Yn are independent and Yi N(i, i2 ), then the random vector Y1 , ), Y = ... Nn ( Yn where
1 .. = . n • Suppose
,
Y 1 Y 2
Y =
and
,
=
Np + q
12
12 0 0 22. . 0 0
.
0 0.
n 2
.
1112,,,,,,, 2122
where Y 1 and Y 2 are p 1 and q 1 random vectors. Then 1, 11, 1+ Y, 1> yln(nu, 1&x
2, 22 ), Y 2, Nq ( ) ; Y 2 =
1 ) ( YY 2 2 ) > ] = > 12 = E [ ( YY 1 21,
and Y 1, Y 2.
independent
12 = 0 p q.
3.2.5 2, F and t distributions • Definition. A random variable Y is chi-squared with m degrees of freedom (df) if its moment generating function mY () is mY (t) = E[e tY] = (1 2t)m/2. We write Y 2 m. Note: E[Y ] = m, and Var (Y ) = 2 m.
If m is a positive integer, then Y 2m implies Y = Y12 + Y22 + + Ym2 for Y1,. . . Ym i.i.d. on the record. N(0, 1). R. = 0.
3
GENERAL LINEAR MODEL (GLM).
25
If Y1 2m and Y2 2n are independent, then Y = "2 2 m"
Y1 /m has an Y2 /n.
F distribution with m and n degrees of freedom. We write Y Fm,n. Z • Definition of a function. If Z N(0, 1) and X 2k are independent, then Y = p X/k has a student's t-distribution for k df. We write Y tk. 3.2.6 Cochran's Theorems of Mathematics. 0n, 2 n). Suppose that A 1,. . . , A m are symmetric n n matrices such as Let Y Nn. Then Pm (i) i=1, A i = I n, and (ii) A 2i = A i, and then Y >A 1Y, Y >A 2Y,. . . Y >A mY are independent random variables, and Y >A i Y 2 2ri,
for i = 1,. . . and the rest of the team.
Ai ). where ri = tr (A 3.2.7 Likelihood X , 2 n ), the likelihood function is given by Based on random responses Y Nn (X kYY X k2 2 2 2 n/2 , ) = li k(1,. . . , d, ) = (2 ). li k( exp 2 2 3.2.8 Maximum likelihood estimator (MLE) , 2 ) w.r.t. , 2 is equivalent to Maximising li k( minimumising kYY X k2 w.r.t. , followed by (n d)s 2 n 2 X k2 /(n w.r.t. 2, where s 2 = kYY X, maximising ln, 2 = 2 (d). Thus, (MLE of ) = (LSE of ).
and
d MLE of is 1 s 2. n 2.
3
GENERAL LINEAR MODEL (GLM).
3.3
26
Comparing Two Nested Models
0n, 2I n ), so that Y Nn (X X , 2I n ). 3.3.1 Assume further that Nn (0 A Given : r d of rank r d, A : q d of rank q r, a Rq, b Rr q, B consider Full model — general linear model under constraints A = a, Reduced model — general linear model under constraints A = a and B = b. Objective: to investigate whether satisfies further constraints B = b, given that it satisfies A = a. Formally, we would like to test hypotheses H0 : against H1 :. Note: A full model with unrestricted corresponds to the case q = 0.
3.3.2 An important special case of the problem is to investigate whether a subset of (pk) explanatory variables, Xk+1,,. . . , Xp, can be dropped from an MLR model with p explanatory variables X1,. . . , Xp. Thus, we wish to test H0 : against H1 :, with a full model — an MLR model based on X1,. . . , Xk, Xk +1,. . . Xp, Reduced model — MLR model based on X1,. . . , Xk. Under our general setup, the above corresponds to the setting 0(pk)(k+1), I = pk ] and b = 0 pk, r = p k, B = [0 pk ].
q = 0,
so that is unrestricted under full model, and k+1 = = p = 0 under reduced model. Consider Example 3.2.1. The question may be: if we have built an MLR model relating ozone concentration (Y ) to radiation (X1 ), temperature (X2 ) and wind speed (X3 ), would it be a good idea to simplify the model by dropping the variables radiation (X1 ) and wind speed (X3 )? 3.3.3 Useful quantities:.
(reduced) : (full) :
constraints constrained lse A a = B b A = a
Residual s.s. SSE = kYY X k2 (fitting ) SSE = kYY X k2 (fitting )
3
GENERAL LINEAR MODEL (GLM).
27
Note: >
1
X X) = + (X.
>
>
,
A, B ] [ A
A B
,
>
1
X X X X X X X
>
>
1,
A, B ] [ A
a b
,
A B
,
X >X )1A > A A(X X >X )1A > 1 A(A A ). = + (X
Since , SSE SSE, always. The size of the difference (extra s.s.) SSEXT = SSE SSE measures the "extra" reduction in SSE obtained by removing the constraints B = b. 3.3.4 Based on the partition n X.
Yi 2 = Y > Y = SSE + SSEXT + Y > Y SSE ,
i = 1
ANOVA provides an F test of against. H0 : A = a & B = b Define F test statistic F = a.
H 1 : A = a.
against
SSEXT /(r q ). SSE /(n d + q).
F test (of size ): reject H0 iff the p-value P(Fr q, nd+q > F ) is smaller than, or equivalently, iff F > Fr() q, nd+q. Note that the F test described in 2.3.9 for SLR is a special case of the above, in which d = 2, q = 0 and r = 1. 3.3.5 Consider an MLR model Y = 0 + 1 X1 + + p Xp +,
N(0, 2 ).
To assess the effects of the explanatory variables Xk+1,. . . , Xp, we wish to test H0 : k + 1 = = p = 0
vs
H1: "It's unrestricted".
This corresponds to setting full model : unrestricted (q = 0), 0(pk)(k+1), I pk ] = [k+1,. . . , p ] > = 0 p k. reduced model : [0
3
GENERAL LINEAR MODEL (GLM).
28
Quoting general formulae for constrained LSE's, we have = and 1 0 0 (k+1)(pk) (k+1)(pk) > 1 > 1 X X) 0(pk)(k+1), I pk ](X X X) = (X [0 I pk I pk 0(pk)(k [0 X , X EXT ], where X has k + 1 columns and is the design matrix for Write X = [X reduced model. Then the general formula for (X Y X ) X (X . = 0 p k........................................................................................ Note: To confirm the above, write =.
,
1 0 p k
and note that
> 1 > X X ) = [II k+1, 0 (k+1)(pk)]. X X, X > X EXT (X ) Thus, > > X> X >Y = X > (X X ) 1 = X X , X X EXT = [II k+1, 0 (k+1)(pk)]X Y.
Calculate 1 > 2 X> SSE = kYY X (X X ) X Y k,
X > X ) 1X > Y k 2, SSE = kYY X ( X
SSEXT = SSE SSE and test statistic F =
SSEXT /(p k ). SSE /(n p 1).
F test (of size ): reject H0 iff the p-value P(Fpk, np1 > F ) is smaller than, or equivalently, () iff F > Fpk, np1.
Conventionally, the above calculations are summarised in an ANOVA table: Source Regression Fittings.
s.s. : :
d.f. and c.f.
SSR = Sy y SSE
k
SSEXT
p k
SSE
n p 1
Y ) 2
n 1
Extra Residual fitting Total.
m.s.
Sy y =
Pn
i = 1 ( Yi
MSEXT = MSE =
SSEXT p k
SSE n p 1
F - ratio
F =
MSEXT MSE
3
GENERAL LINEAR MODEL (GLM).
29
The special case k = 0 corresponds to testing H 0 : 1 = = p = 0
vs
H1: "It's unrestricted".
Here SSE = Sy y and SSEXT = Sy y. SSE = regression s.s. fitting, leading to the ANOVA table: Source.
s.s. : :
Regression fitting
d.f. and c.f. SSEXT
p
SSE
n p 1
Y ) 2
n 1
Residual fitting Total
Sy y =
Pn
m.s.
i = 1 ( Yi
MSEXT = MSE =
SSEXT p
F - ratio F =
MSEXT MSE
SSE n p 1
The coefficient of determination for the MLR model is defined as R2.
Sy y SSE , Sy y
which is analogous to that for the SLR model, and has a similar interpretation: large R2 fitted MLR is effective in reducing total variation in Y. 3.3.6 Example 3.2.1 (cont'd) (i) To assess significance of the full MLR model (containing 3 explanatory variables: radiation, temperature, wind speed), consider the ANOVA table: s.s. d.f. m.s. F -ratio Source Regression 7.6685 3 2.5562 7.3244 Residual 9.0738 26 0.3490 Total 16.7423 29 - The F -ratio is associated with p-value = P(F3,26 > 7.3244) = 0.001026, which is very small = MLR highly significant. - For a size test, we refer to the th upper quantile of F3,26 : for example, () Critical value F3,26 5% 2.9752 1% 4.6365 In either case, the F -ratio 7.3244 is bigger than the critical value and we should reject H0 at the stated significance level. - The coefficient of determination is R2 = 7.6685/16.7423 = 0.4580, which shows that the MLR is only moderately effective in explaining the variation in ozone concentration, even though it is highly significant! ii) Suppose we have regressed ozone concentrations on temperature by SLR. Should we also include the variables radiation and wind speed to improve our SLR model?
3
GENERAL LINEAR MODEL (GLM).
30
Explanatory variables Reduced model temperature Full model temperature, radiation, wind.
SSE SSE = 9.8808 SSE = 9.0738
ANOVA table: Source Regression fitting Extra Residual fitting Total.
s.s. d.f. | - a.s. m.s. 6.8615 1 0.8070 2 0.4035 9.0738 26 0.3490 16.7423 29..
F - ratio 1.1562
The F -ratio is associated with p-value = P(F2,26 > 1.1562) = 0.3303, which is not small enough to justify inclusion of the variables radiation and wind speed in the model. he must be careful. If we obtain an insignificant result from testing a reduced model against a full model like the above, - it does NOT necessarily imply that the additional variables Xk+1,. . . , Xp are useless in explaining the variation in Y, BUT - it implies that the variables Xk+1,. . . Xp does not contribute much to the model if X1,. . . Xk has already been included. Thus, if we remove X1,,. . . Xk from the model and include only Xk+1,. . . Xp, the latter set of variables might become very useful in explaining the variation in Y despite their insignificance in the previous "reduced vs full" test. For example, regressing ozone only on radiation and wind speed yields the ANOVA table: Source Regression Residual Total.
s.s. 3.7725 12.9698 16.7423.
d.f. and c.f. 2 27 29
m.s. 1.8863 0.4804
F - ratio F = 3.9268
The p-value = P(F2,27 > 3.9268) = 0.03185, which is much smaller than the p-value (= 0.3303) obtained above. This shows that although the variables radiation and wind speed are highly insignificant in the presence of the variable temperature, they become quite significant once the 'temperature' has been removed from the model.
3.4
Inference about regression parameters
3.4.1 Useful results under general linear model X >X )1X >Y Nd ( , 2 (X X >X )1 ), • = (X • s 2 = (n d)1 > (n d)1 2 2nd, unbiased for 2, • and s 2 are independent.
3
GENERAL LINEAR MODEL (GLM).
31
........................................................................................
3.4.2 Let c = [c1 ],. . . cd ]> be a vector of d known constants. Suppose we are interested in the parameter d X = cj j = c > . j = 1
The BLUE of is = c > . Note that X >X )1c, N, 2c >(X).
s.e. () = s
p X > X )1c c > ( X
and
tn d. s.e. ( )
The above results enable us to calculate confidence intervals for or to test the hypothesis that equals a specified value. ........................................................................................ Proof: Note that X >X )1c = c > ( ) N 0, 2c > (X).
,
and is independent of s 2. Thus > > 1 1/2 X X) c c (X ( )/ Z p = p tnd, s.e. W/(n d) s 2 / 2 for independent Z N(0, 1) and W 2nd.
3.4.3 For example, a 100(1 )% confidence interval for is (/2) s.e. ( ) tnd.
• To test whether = 0 at significance level, we can do a t test: 1. calculate the test statistic 0 T =, s.e. ( ) (/2) 2. reject the hypothesis iff T > tnd.
Note: the above rejection criterion is equivalent to checking whether 0 falls outside the 100(1 )% confidence interval for.
• Inference about an individual j can be done by putting cj = 1 and ck = 0 for all k 6= j. Testing whether j = 0 using the t test is equivalent to the F test described in 3.3.4, where the F-test is the mtl test.
3
GENERAL LINEAR MODEL (GLM).
32
0 reduced model — [0 j1, [1, 0 dj]] = 0; full model —, unrestricted.
Here we are assessing whether j is significantly different from 0, given that the other k's are unrestricted. The result should not be confused with that obtained for the significance of j by testing, for example, H0 : j = 0 & j+1 = 0.
against
H1 : j+1 = 0.
3.4.4 Example 3.2.1 (cont'd) Suppose we want to see if the temperature can be removed from the full model, i.e. to test H0 : 2 = 0 against H1 : 2. Note that 2 = 0.04561 and s = 0.
0.3490 = 0.5908 ml-r
Calculate s.e. of the gtf id. ( 22 ) = s
p X >X )1 [0, 0, 1, 0]> = 0.0005338541 = 0.01365. 0, 0, 1, 0.
The 95% confidence interval for 2 is [ 0.04561 0.01365(2.0555), 0.04561 + 0.01365(2.0555) ] = [ 0.01755, 0.07366 ], (0.025), where t26 = 2.0555.
Point zero falls outside the above interval and so we reject H0 at the 5% significance level. It seems that including temperature in the model significantly improves our fit.
3.5
Prediction Intervals
3.5.1 Let Y1,. . . Ym be independent responses of m future (not yet observed) items such that Y1 Y ... Nm X , 2I m Ym for a given m d design matrix X . Let a = [a1, ]. . . m ]> be a vector of m known constants. Suppose we are interested in predicting m X l = aj Yj = a > Y . j = 1
Naturally, we predict Y by X , and l by l = a > X . Following the same arguments as given in 2.6.3, we have 1/2 X >X ( 1)1X >a + a >a s.e. l l = s a > X (X. The 100(1 )% prediction interval for l is l, s.e. l l) t (/2). n d
3
GENERAL LINEAR MODEL (GLM).
33
3.5.2 Example 3.2.1 (cont'd). Suppose we wish to predict the difference in ozone concentration between two typical days on which (radiation, temperature, wind) are observed to be (100, 70, 10).
and
d (50, 80, 10).
in a different way. Here m = 2, a = [1, 1]> and.
X =
,
1 100 70 10 1 50 80 10
.
Then a >X = [0, 100 50, 70 80, 10 10] = [0, 50, 10, 0]. The predicted value is l = [0, 50, 10, 0] = 0.3908.
s.e. l l = 0.5908 0.07981 + 2 = 0.8520.
The 95% prediction interval is 0.3908 0.8520(2.0555) = [2.1420, 1.3605], which lies on both sides of zero. In other words, we cannot predict with confidence which day is most likely to have higher ozone concentrations.
4
REGRESSION DIAGNOSIS
44.1
34
Regression Diagnosis Introduction
4.1.1 Linear modelling theory relies on a set of assumptions, which may not be accurate in reality. Departure from such assumptions may be caused by a number of factors: for example, 1. Curvilinearity — responses and explanatory variables are related in a more subtle way than linearity. 2. The amount of money to be spent. Heteroscedasticity — responses have non-constant variances. 3. The following is the main content. Non-normality — responses have distributions other than normal, often compounded by heavy tails and/or skewed shapes. 4. Do you have a job? Influential observations and outliers — the dataset contains some observations which may be influential in least squares calculations. 5. Do you have a good relationship with your own people? Multicollinearity — a bad design of explanatory settings may yield high standard errors for LSE's. 4.1.2 Two devices provide a quick overview of the dataset and may uncover potential problems: 1. Scatterplot matrix of the data. This is a matrix of scatter plots of all pairs of variables. It is useful for identifying major departures from assumptions: e.g. curvilinearity, heteroscedasticity, influential and outlying observations, and multicollinearity. 2. The way to go. Residual plot against fitted values. This is a plot of residuals against fitted values Y. A notable pattern in the plot may reveal curvilinearity, heteroscedasticity, nonnormality or influential and outlying observations. Consider henceforth a general linear model: Y = X +, where X = [xx 1],. . . x n ]> is an n d design matrix and N ( 0 n, 2 n ).
4.2
Leverage
4.2.1 Definition - "Integerization of Computers" X >X )1X >Hat matrix, H : n n matrix X (X X >X )1x i. Leverage of i th data point, hi : i th diagonal element of H, i.e. hi = x > i (X 4.2.2 The leverage is a measure of remoteness of the i th observation from the remaining observations in the space of explanatory variables. It serves as a useful diagnostic measure for identifying anomalous values of explanatory variables.
4
REGRESSION DIAGNOSIS
35
This is best exemplified in the special case of a simple linear regression model, where 1 (xi x )2 hi = + n. n X (xj x )2 j=1
4.2.3 Possible reasons for anomalous values of explanatory variables are: (i) inaccurate recording of the values, (ii) investigation of an anomalous sample item, or (iii) special design of the experiment to incorporate extreme cases. 4.2.4 Recall that the (raw) residual is i = Yi Yi. It follows from Var() = 2 (II n H ) and Var(Y ) = 2H that Var ( i ) = 2 (1 hi )
and
Var ( Y i ) = 2 hi.
This implies 0 hi 1. If hi = 0, then xi = 0 and so Yi = 0. • If hi = 1 then Yi = Yi. A large hi means that the i th data point exercises substantial leverage in determining the fitted value Yi. 4.2.5 Leverage a plot. Noting that n
1X 1 d H) =, hi = tr (H n i=1 n n). A rule of thumb is to single out those points with hi > 2d/n to be high leverage points. The leverage plot, i.e. an index plot of the hi, is useful in determining these points.
4.3
Normal probability plot
4.3.1 If the i is non-normal, the residual i will be non-normal. Checks for normality are thus often done on the residual. 4.3.2 Let (1) (n) be the ordered values of the residuals. If the residuals constitute a random sample from N(0, 2 ) then we should find (i) 1 2 , (i) 100(i/n)% quantile of N(0, ) or n where denotes the N(0, 1) distribution function. Thus, the points 1 (i/n), (i), i = 1,. . . and n, are expected to lie on a straight line approximately if the normality assumption is correct.
4
REGRESSION DIAGNOSIS
36
4.3.3 Normal probability plot or normal quantile-quantile plot (QQ plot) i 3/8 1 — plots the ordered residuals (i) against . n + 1/4 A plot that is nearly linear suggests agreement with normality, whereas a plot that departs substantially from linearity suggests that the error distribution is not normal. Note: The adjustments 3/8 and +1/4 are made to avoid infinite values 1 (0) and 1 (1).
4.3.4 That the residuals constitute a random sample from N(0, 2 ) is not strictly true: indeed, = (II n H )YY N 0 n, 2 (II n H ), hence the residuals are correlated in general. However, we may usually assume that the correlations between the i are negligible in graphical procedures, except when (n d)/d is small.
4.4
Outliers
4.4.1 Outliers are sample observations that have unusual response values. 4.4.2 It is important to spot outliers and determine if they are simply there by chance. If there is direct evidence that an outlier stems from a recording error, a miscalculation, malfunctioning of equipment, or a similar type of circumstance, then one should discard it. On the other hand, outliers may convey significant information if, for example, they are caused by the interaction with an omitted explanatory variable. Outliers may also arise from atypical values of explanatory variables for which the model does not give an adequate representation. Here, the model should be reexamined to determine the range of values of explanatory variables for which it is adequate. Observations outside this range should be analysed separately, and a better model should be constructed. 4.4.3 Plots of residuals i against fitted responses Yi may be used to identify outliers. 4.4.4 Since Var ( i ) = 2 (1 hi ), a residual at a high leverage point has a smaller variance than those at points of lower leverage. Consequently, the residuals are, strictly speaking, not directly comparable to one another. Such shortcomings may be partially remedied by considering modified versions of the residual. 4.4.5 Definition - "C" The Studentised residual (or internally Studentised residual) fi is defined as i, fi =, s 1 hi Pn where s 2 = MSE = j=1 2j/(n d).
4
REGRESSION DIAGNOSIS
37
4.4.6 Delete case i from the data and carry out linear regression with the remaining data. 2. the lse's of and 2 respectively, based on the reduced denote by (i) and s(i) dataset. Then the fitted response for case i is Y(i) = x > i (i). c. Definition of a file. The Studentised residual (or externally Studentised residual) in case i is defined as Yi Y(i) ri =. s(i) / 1 Hi Note that ri tnd1. Equivalently, ri can be expressed as s nd 1 ri = fi. n d fi 2 Roughly speaking, | ri | > 2 suggests that the i th sample observation should be carefully scrutinised as a possible outlier. ........................................................................................ Proof: First consider 1 > X Y Yi x i X >X x i x > i 1 > > 1 > > X Y Yi x i X X x ix > = X >X X X x ix > i i + x ix i o > 1 > > 1 X Y Yi x i X X x x = =
(i) =
It follows that > i + hi Yi Y(i), Yi Y(i) = Yi x > i (i) = Yi x i + hi Yi Y(i) = so that Yi Y(i) = (1 hi )1 i N 0, (1 hi )2 2 (1 hi ) N 0, 2 (1 hi ) 2 2 (n 1 d)1 2 Note that s(i) n1d and is independent of Yi Y(i), we then have (1 hi )1/2 Yi Y(i) / ri = tnd1. s(i)/
Consider next
2 2
2 (n d 1)s(i) = YY X (i) Yi Y(i).
1
2 = + (1 hi )1 i X X >X x i (1 hi )2 2i = k k2 + (1 hi )2 hi 2i + 2(1 hi )1 i Y > (II n H ) X X >X (1 hi )2 Y1 X X
hi ) 2 2 i 2
= (n d)s (1 hi )1 2i =
n d fi 2 s 2.
It then follows that ri = (1 hi )1/2 i /s(i) = s fi /s(i).
s n d 1 = f i. n d f 2
1
xi
4
REGRESSION DIAGNOSIS
38
4.4.7 Studentised residual plot / Studentised residual plot deleted. For more reliable detection of outliers, residual plots may be based on fi or ri instead of i, plotted against fitted values Yi. Advantage — if the linear model is correct then all the fi have the same variance and are comparable to one another. The same also holds for the ri.
4.5
Influential observations
4.5.1 A sample observation is said to be an influential observation if its exclusion from the analysis results in conclusions very different from those reached in its presence. 4.5.2 The following diagram shows an example of a simple linear regression model: Yi = 0 + 1 xi + i, where 1 is significantly non-zero.
Effects of Outliers
B
response Y
(without C,D)
A
and a C,D.
C
D
explanatory variable x
Point A with its unusually big response value is an outlier, whereas points B,C,D are high leverage points. Among them, A and B may not be too influential in fixing the fitted regression function, but C and D do have a strong influence on it. 4.5.3 A common way to determine the influence of individual observations is to monitor the changes in the regression results, as each observation is deleted. Big changes suggest big changes.
4
REGRESSION DIAGNOSIS
39
4.5.4 Cook's distance is a distance. Target to monitor: lse (i) (after deletion of the i th point) Definition. Cook's distance Di measures the influence of the in the data point by examining the change in when this point is excluded from the analysis. It is defined as ( (i) )>X >X ( (i) ) hi fi 2 Di = =. ds 2 d 1 hi A large Di, compared with, say, the median of Fd, nd, denotes an influential i th observation. ........................................................................................ Proof: Using the results given in the proof of 4.4.6, we have Di = (ds 2)1 ( (i) )>X >X ( (i) ) = (ds 2)1 (1 hi )2 hi 2i = d 1 (1 hi )1 hi fi 2.
4.5.5. Target to monitor: fitted value Yi Y(i) = x> i (i) (after deletion of the i th point) Definition. "Difference in fitted value-Standardised" (DFFITS) measures the influence of the i th data point by examining the change in the fitted value Yi when this point is excluded from the analysis. It is defined as r Yi Y(i) hi = ri DFFITS i =. 1. Hi s(i) Hi p Observations with |DFFITS i | > 2 d/n may be considered influential. ........................................................................................ Proof: Using the results given in the proof of 4.4.6, we have s 1 1/2 > 1 1/2 DFFITS i = s(i) hi x i (i) = s(i) hi Yi Y(i) = ri.
4.6
Hi. 1 hi
Multicollinearity
4.6.1 Write X = [cc 1],. . . , c d ], so that c j is the jth column of X. Multicollinearity occurs when there is a close relationship between two or more c j 's. In this case, there exists a vector with unit length i.e. kaa k = 1) such that X a k2 is small, and consequently, the corresponding linear combination a >X >X a = kX a > has a large variance. ........................................................................................
4
REGRESSION DIAGNOSIS
40
To see the above, note first that there exists a d d matrix Q satisfying QQ > = Q >Q = I d and X >X = QQ >, where denotes a diagonal matrix with diagonal elements given by the eigenvalues of X >X. Using Cauchy-Schwarz inequality, we have a > 2 2 1 = a > QQ > a = 1/2Q >a 1/2Q >a.
,
1/2 > 2 1/2 > 2 Q a = a >Q 1Q >a >QQ >a = a > (X X >X )1a kX X a k2. Q a Thus
> a X >X )1a 2 kX X a k2, Var = 2a > X X a k is small. suggesting that Var > has a large value of kX.
4.6.2 Condition index. Define = X
,
c1 cd,,. kcc 1 k kcc d k.
being the design matrix helps eliminate "near singularity" of X >X. Taking X to a single small c j and therefore directs attention to only multicollinearity. De and 1,. . . , d their corresponding >X fine 1,. . . , d to be the eigenvalues of X j k = 1 for all j eigenvectors with k A small eigenvalue j indicates presence of multicollinearity and the corresponding eigenvector j specifies the relationship between the c j's that causes trouble. This is because j = j is small, the columns c k is associated with >X j k2 = >X. If kX j large entries of j are closely related, leading to multicollinearity. The condition index is defined as k =.
maxj j k
1/2.
Any k > 30 indicates possible multicollinearity. 4.6.3 Variance of proportions. . Then Let be the lse of when fitting E[YY] = X d d > 1 X X 2 1 > 2 2 2 X = Var = X k k k and Var j = 1 k jk, j = 1,. . . d, k = 1
k = 1
where k = [1k,,. . . , dk ]>. Define kj =
2 1 k jk d X =1
2 1 j
4
REGRESSION DIAGNOSIS
41
is the proportion of variance of j accounted for by the eigenvector g k corresponding to the eigenvalue k. Pd Note that k=1 kj = 1 and the kj are usually tabulated in a multicollinearity diagnostics table where the column sums all equal 1: Variance Proportions for Condition index 1 2 d 1 11 12 1d. . . . . d d1 d2 dd Total 1 1 1 General strategy: 1. identify rows associated with high condition indices k, 2. for each of these rows, identify large values of kj, 3. multicollinearity is caused mainly by linear combinations of explanatory variables associated with large values of kj across the same row. If the explanatory variable corresponding to c j features prominently in several P combinations, its corresponding kj values cannot be all large (due to the k kj = 1 restriction); thus one should make allowance for this. 4.6.4 In cases of multicollinearity, one should examine in detail the offending explanatory variables and see whether they should be omitted or combined, and whether more observations need to be taken to increase the precision of the estimates.
4.7
Numerical example
4.7.1 Example 3.2.1: (cont'd) Consider the data of Example 3.2.1. The scatterplot matrix given in 3.2.3 suggests linear relationships between ozone and the other 3 explanatory variables, and correlation between radiation and temperature.
4
REGRESSION DIAGNOSIS
42
Normal probability plot
Leverage plot 0.30
1.5 7 30
0.25
1.0
26
threshold = 8/30 0.5
14 27 28
25
0.15
4
23
17
12
18
5
19
1
11
13
2
-0.524
15
29
6 20
-1.0
10
8
3
0.0
21 22
16
9
0.10
residual
leverage
0.20
0.05 -1.5 -2.5
0.00
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
N(0,1) quantile of the hexane
4.7.2 Leverage and control. The leverage plot shows that observation 7 (and, marginally, observation 30) have high leverage, based on the threshold 2d/n = 8/30. 4.7.3 Normal probability plot for a given event. The normal probability plot reveals approximate linearity and suggests that the error distribution generally agrees with normality. 4.7.4 Outliers. The following plots Studentised residuals fi vs fitted values Yi (left) and Studentised deleted residuals ri vs fitted values Yi (right).
Studentised residual plot
Studentised deleted residual plot
3
23
323
2
Studentised deleted residual
2
Studentised residual
20 1 30 13
11 6
1
21
2
26 22
15 7 14
0
5
4
24
16 12
-1
29 8 10 9
19
25 27 28
3 18
-2
20
1
21
-3
2
26 22
15 7 14
0
5
4
24
16 12
-1
29
25
108 9
19
27 28
3 18
-2
17
1 30 13
11 6
17
-3 1.5
2.0
2.5
3.0
3.5
4.0
1.5
fitted value
2.0
2.5
3.0
fitted value
The Studentised residual plot does not reveal any noticeable trends. Observations 17 and 23 seem outlying relative to other observations. The Studentised deleted residual plot is very similar. The threshold levels 2 are indicated by broken lines, suggesting that observations 17 and 23 may be outliers. Neither residual plot detects any heteroscedasticity.
3.5
4.0
4
REGRESSION DIAGNOSIS
43
4.7.5 Influential observations and findings. The following displays plots of Cook's distances (left) and DFFITS (right).
Index plot of Cook's distance.
Index plot of DFFITS.
0.923
threshold = 2(4/30) 1/2
0.8 1.0
threshold = median of F4,26 = 0.86145 0.7.
30
0.6
0.5
20
DFFITS
1
Cook's distance
0.5 0.4
26
11
6
21
13
2
22
15 4 5
0.0
7
24 14 8
10
3
23
16
27 19
-0.5
30
0.1
181
0.0
18
17
0.2
2 3 4 5 6 7 8
10
11 12 13
14 15 16
-1.0
28
19 20
9
21 22
29
25
12
9
0.3
24 25
26 27
29
threshold = -2(4/30) 1/2"
17
28
4
REGRESSION DIAGNOSIS
44
Based on threshold level 0.86145 (50% quantile of F4,26 ), Cook's distances reveal no influential observations. p • Based on threshold levels 2 d/n = 0.7303, DFFITS reveals influential observations 17 and 23. 4.7.6 The multicollinearity diagnostics table is given below, where the variance proportions for each variable are listed according to the sorted order of the condition indices.
Condition index Intercept 1.0000 0.0008 4.2902 0.0033 8.1243 0.0332 24.3117 0.9628 0.0493 0.0493 0.0819 0.04
Variance Proportions radiation temperature 0.0125 0.0009 0.7336 0.0009 0.1501 0.0576 0.1038 0.9406
wind 0.0056 0.0828 0.7991 0.1126.
All condition indices are below 30, so no serious multicollinearity is detected. The bottom row shows that temperature and the intercept are correlated to a certain degree.
5
CLASSIFICATION MODELS
5 5.1
45
Classification Models Introduction
5.1.1 A response variable Y may be related to qualitative (or categorical) explanatory variables such as gender, race, machine type, age, geographic location, etc. A study of effects of such variables, sometimes called factors, on the response variable entails a special form of ANOVA under a general linear model. 5.1.2 A level of a factor is a particular "value" of that factor. Example: Factor Levels No. of levels Gender Male, female 2 Race Chinese, Japanese, Russian 3 Age 12, 12-21, 22-40, 41-65, > 65 5 Region Urban, rural, industrial 3 5.1.3 A treatment is a particular combination of factor levels. In a single-factor study, a treatment is simply a particular level of the factor. Example: (i) One factor: gender. No. You can use any of the following methods to improve the product. Levels: 2 (male & female) No. of possible treatments: 2 (male & female) (ii) two factors: gender & race. No. You can use any of the following methods to improve the product. levels for gender: 2 (male & female). Levels for race: 3 (Chinese, Japanese, Russian) no. of possible treatments: 23 = 6 (Chinese male, Chinese female, Japanese male, Japanese female, Russian male, Russian female). 5.1.4 Main objective: to determine whether the treatments have any systematic effects on the response variable, and to examine the nature of these effects. Our underlying model treats all observations subject to the same treatment as a random sample from the same distribution. We are interested in studying how the means of all these distributions vary across different treatments. 5.1.5 Special case: Two treatment groups two-sample problem (useful tools: two-sample t tests and confidence intervals based on t distributions).
5
CLASSIFICATION MODELS
46
5.1.6 Example 5.1.1 The data frame in Table 5.1.1 gives information on the makes of cars taken from the April, 1990 issue of Consumer Reports. This data frame contains data on 3 variables (columns) for 40 cars (rows). One variable (price) is quantitative, while the other two are qualitative: Price — list price for standard equipment, in dollars; Country — country in which the car was manufactured; Type — general type of car. The two categorical factors Country and Type can be used to study the variation in the list price of the 40 cars. Levels of the factors are given below: Factor Country Type.
Levels Japan Korea USA compact large medium small
sporty
van
The numbers of observations (cars) within each treatment group are given below: Country Type Japan Korea USA
compact 7 0 4
large 0 0 3
medium 4 0 4
small 6 2 2
sporty 3 0 2
van 1 0 2
In this example, only 12 of the 3 6 = 18 possible treatment groups contain data. Our classification model assumes that these 12 groups represent 12 random samples from 12 populations with the same variance but possibly different identifiers. Our objective is to study how these 12 population means are affected by the levels of the two factors under study.
5.2
One-way classification models (One-way ANOVA).
5.2.1 No. of the syll of factors.
No. I am not. levels: k Level 1 2 Observable responses Y11,. . . Y1n1 Y21,. . . Y 2 n 2
k Yk 1,. . . , Yknk
Note: A sample of ni independent responses, namely (Yi1,. . . Yini ), is observed at the i th level of the factor.
5.2.2 Cell-mean Model and Treatment-Effect Model Consider the following two equivalent forms of the one-way classification model: (A) Cell-mean model: Yij = i + ij, i = 1,. . . k, j = 1,. . . , ni, E [ij ] = 0, Var (ij) = 2, where the ij s denote independent random errors and the parameter i represents the ith treatment mean.
5
CLASSIFICATION MODELS
47
Table 5.1.1: Example 5.1.1 — data on 40 cars made.
Price
Country
Type
Make
Price
Country
Type
Acura Legend Buick Century Buick Le Sabre Chevrolet Beretta Chevrolet Caprice Chrysler Le Baron Chrysler Le Baron Coupe Dodge Daytona Dodge Grand Caravan Eagle Premier Eagle Summit Ford Aerostar Ford Escort Ford Festiva Ford LTD Crown Victoria Ford Probe Ford Tempo Ford Thunderbird Honda Accord Honda Civic.
24760 13150 16145 10320 14525 10945 12495 9745 15395 15350 8895 12267 7402 6319 17257 11470 9483 14980 12145 6635.
Japan, USA, Korea, USA, USA, USA, USA, Japan, Japan
Medium Medium Large Compact Large Compact Medium Sporty Van Medium Small Van Small Small Large Sporty Compact Medium Compact Small.
Honda Civic CRX Honda Prelude Mazda 626 Mazda 929 Mazda MPV Mazda Protege Mitsubishi Galant Mitsubishi Sigma Nissan 240SX Nissan Maxima Nissan Sentra Nissan Stanza Pontiac Grand Am Pontiac LeMans Subaru Legacy Subaru Loyale Toyota Camry Toyota Corolla Toyota Cressida Toyota Tercel.
9410 13945 12459 23300 14944 6599 10989 17879 13249 17899 7399 11650 10565 7254 11499 9599 11588 8748 21498 6488.
Japan Japan Japan Japan Japan Japan Japan Japan Japan Japan USA Korea Japan Japan Japan Japan Japan Japan Japan Japan
Sporty Sporty Compact Medium Van Small Compact Compact Sporty Medium Small Compact Compact Small Compact Small Compact Small.
B) Factor-effect model: Reparameterize treatment means i as i = + i,.
i = 1,. . . k, k, k, k,
1 = 0. 1 = 0.
Under the above reparameterisation and the constraint that 1 = 0, i characterises the effect of the i th treatment relative to the kth treatment. The purposes of constraints include: 1. Avoid overparameterization of the model. In other words, it makes the two models have the same number of parameters. Because (A) has k parameters, (B) has k + 1 1 = k parameters due to the constraint. 2. The amount of money to be spent. Avoid the insimilation of (B). For example, if k = 2 and there is no constraint, 1 = + 1 = + 1, 2 = + 2 = + 2, where = c, i = i + c, so the parameters are unestimable. 5.2.3 Under the general linear model formulation, a one-way classification model can be formed.
5
CLASSIFICATION MODELS
48
written as Y11 1 n1 0 n1 0 n1 Y12 . . Y = . = . . . . . 0 nk 0 nk 0 nk Yknk 1 n1 1 n1 0 n1 . . . . = . 1 nk 1 0 nk 1 0 nk 1 1 nk 0 nk 0 nk
0 n 1.. . 1 nk
11 12.
1 .. + . k k 0 n1 0 n1 1. . . . . 0 nk1 1 nk1 .. 0 nk 0 nk 1
+
11 12.
.
knk
The last expression has the form of a "multiple linear regression" model with k 1 "explanatory variables", whose values are stored in the second to kth columns of the design matrix. Such "explanatory variables" are known as "dummy variables". The dummy variable is a function of factor levels. 1, if A is the ith level, Di (A) = 0, otherwise. We denote them by D 1,. . . Dk 1, corresponding to the coefficient 1,. . . , k 1. Thus, a factor of k levels can be identified with a set of k 1 dummy variables. The values of dummy variables for a particular response are determined by the level of the factor at which the response is observed, according to the following table. Factor Level 1 2. k 1 k
D1 1 0.
D2 0 1.
D3 0 0.
.
Dk 1 0 0.
0 0
0 0
0 0
10
Thus, if i k, then the response Yij has dummy variables Di = 1 and Di 0 = 0 for i, 0 = 6= i ; whereas the response Ykj has dummy variables Di = 0 for all i. Expressed in terms of dummy variables, the one-way classification model has the form Y = + 1 D1 + + k1 Dk1 +. (c.f.) The "Seal of Eternity" multiple linear regression model: Y = 0 + 1 X1 + + p Xp +.
5.2.4 There are two common choices of constraints to remove the overparameterization. A Level-i -zero constraint: (Take i = 1 as an example). The simplest to implement (and hence found often in software packages) is to say that 1 = 0. This immediately removes one parameter, and also removes the corresponding column of X. The interpretation of in the one factor model now is that it is the mean of group 1. The interpretation of i, for
5
CLASSIFICATION MODELS
49
i = 2,. . . k, is that this is the deviation of the mean of group i from that of group 1. In other words, the treatment mean of the group 1 is used as the benchmark value. In a more general linear model containing a factor together with other regressor variables, discussed in later sections, the level one zero constraint means that the parameter associated with factor level i, again for i, for i = 2,. . . , k, is the increment (or decrement if negative) in the mean Y when we move from level 1 of this factor to level i. (B) Sum-to-zero constraint: Pk. This constraint is i=1 i = 0. The interpretation then of is that it is the mean, averaged over the k group. The i s are then deviations from the average. In other words, the overall mean is used as the benchmark value. This is attractive in that it does not entail an arbitrary choice of one to eliminate. On the other hand, it may be equally arbitrary in the sense that it depends on the actual levels of the factor that are present in the data (because it is usually the case that other levels of the factor can also exist). 5.2.5 In the general linear model formulation, the sum to zero constraint gives the level parameters the interpretations of deviations from the average, in the same way. Technically, the sum to zero constraint is implemented by removing one level parameter, usually the last, and setting it equal to minus the sum of all the others. So Pk1 wherever k appears in the model, it stands for i=1 i, instead of representing a separate unknown parameter. In this case, the model becomes + i + ij if i = 1, 2,. . . , k 1 Pk1 yij = i=1 i + ij if i = k This removes the column of X corresponding to k, but also modifies the columns corresponding to all the other i s in rows where i = k. Below are X and in the 3 cases when no constraints, level-1-constraint and sum-to-zero constraint are imposed. No Constraint: Xn(k+1) = .
1 n 1 1 n 1 0 n 1. . . 0 n 1 1 n 2 0 n 2 1 n 2. . . 0n2. . . . 1 nk 0 nk 0 nk. . . 1 nk
Level-One Constraint: 1n1 0n1 0n1 1n 1n 0n 2 2 2 Xnk = . . . . . 1 nk 0 nk 0 nk
0 n 1 0 n 2 1 nk
1 and = ...
and
2 = ... k
5
CLASSIFICATION MODELS
50
Sum-to-Zero Constraints: 1n1 1n1 0n1 1n 0 1n2 n2 2 . Xnk = . . 1 nk 1 0 nk 1 0 nk 1 1 nk 1 nk 1 nk
0 n 1 0 n 2 . . . 1 nk 1 . . . 1 nk.........
1 = ... 1
and
5.2.6 Example 5.2.1 The file Coagulation.txt (Box, Box and Draper, 1978) contains data that comes from a study of blood coagulation times: 24 animals were randomly assigned to 4 different diets - A, B, C and D. The aim of the study is to compare the mean blood coagulation times for the 4 diets. A B C D
62 63 68 56
60 67 66 62
63 71 71 60
59 64 67 61
65 68 63
66 68 64
63
59
In the cell-mean model, i denotes the mean coagulation time for the ith diet. In the factor-effect model, when using level-A-zero constraint, is the mean time for diet A, and i is the difference between mean times of diet A and the corresponding diet; when using sum-to-zero constraint, is the average over all mean times of the four diets, and i is the deviation in mean time of the corresponding diet from the average. The GLM coefficient matrices can be obtained in R as below. rm(list=ls(all=TRUE)) c.data - read.table("Coagulation.txt", header = TRUE) attach(c.data).
Level-One Constraint: g - lm(coag) model.matrix(g) (Intercept) dietB dietC dietD 1 1 0 0 0 2 1 0 0 0 3 1 0 0 0 4 1 0 0 0 5 1 1 0 0 6 1 1 0 0 7 1 1 0 0 8 1 1 0 0 9 1 1 0 0 10 1 1 0 0 11 1 0 1 0 12 1 0
5
CLASSIFICATION MODELS 16 1 0 17 1 0 18 1 0 19 1 0 20 1 0 21 1 0 22 1 0 23 1 0 24 1 0 attr(,"assign") [1] 0 1 1 1 attr(,"contrasts") attr(,"contrasts")$diet [1] "contr.treatment"[2] (the "selector" function will occur when you enter the function as an entrant to
1 0 0 0 0 0 0 0 0
51 0 1 1 1 1 1 1 1 1
Sum-to-Zero Constraint: (Intercept) diet1 diet2 diet3 1 1 1 0 0 2 1 1 0 0 3 1 1 0 0 4 1 1 0 0 5 1 0 1 0 6 1 0 1 0 7 1 0 1 0 8 1 0 1 0 9 1 0 1 0 10 1 0 1 0 11 1 0 0 1 12 1 0 0 1 13 1 0 0 1 14 1 0 0 1 15 1 0 0 1 16 -1 20 1 -1 -1 -1 21 1 -1 -1 -1 22 1 -1 -1 -1 23 1 -1 -1 -1 24 1 -1 -1 -1 attr(,"assign") [1] 0 1 1 1 attr(,"contrasts") attr(,"contrasts")$diet [1] "contr.sum" (c,v d).
5.3
Two-way classification models (Two-way ANOVA).
5.3.1 Factors: A, B, B. level A: No. of levels of B: J Denoted by ai bj the treatment made up of the i th level of A and the jth level of B. Observable responses:
5
CLASSIFICATION MODELS
52
Level A Level B 1 2.
1 Y 111,. . . Y11n11 Y211,. . . Y21n21.
2 Y121,. . . Y12n12 Y221. . . Y 22 n 22.
.
J Y 1 J 1,. . . Y 1 J 1 J Y 2 J. . . Y 2 J n 2 J.
I
YI 11,. . . , YI 1 nI 1
YI 21,. . . , YI 2 nI 2
YIJ 1,. . . , YIJnIJ
Note: The dataset consists of independent samples of sizes n11, n12, and n13. . . , nIJ. To each treatment ai bj corresponds a sample of nij independent responses, namely (Yij1,. . . Yijnij.
5.3.2 Cell-mean Model and Treatment-Effect Model Consider the following two equivalent forms of the two-way classification model: (A) Cell-mean model: Yijk = ij +ijk, ij = 1,. . . i, j = 1,. . . j, k = 1,. . . , nij, E [ijk ] = 0, Var (ijk ) = 2, where the ijk 's denote independent random errors and ij the ai bj treatment mean. (B) Factor-effect model with level-IJ-zero constraint: Reparameterise treatment means ij as ij = + i + j + ()ij,.
i = 1,. . . i, I, a.k.a. your
j = 1,. . . , J
subject to constraints I = 0, J = 0, ()Ij = 0j, ()iJ = 0i. Here i and j are called the main effects of A and B, respectively, and ()ij is called the interaction effect between A and B. Then we have = IJ.
i = iJ IJ,
j = Ij IJ,
( ) ij = ij iJ Ij + IJ,
for i = 1,. . . and j = 1,. . . J. So represents the treatment mean for aI bJ, i represents the difference between the treatment means for aI bJ and aI bJ, j represents the difference between the treatment means for aI bj and aI bJ, and ()ij represents how the impact of level j of B on level i and I of A differs from that of level J of B. Remarks: • Under either parameterising scheme, treatment means are modelled using the same number, i.e. IJ, of free parameters: (11,. . . , IJ ) under the first, and, 1,. . . , I 1, 1,. . . , J 1, ( ) 11,. . . ()I1,J1 under the second scheme. There may exist empty treatment groups (nij = 0) in the dataset, so that IJ treatment groups actually contain data. In this case, the IJ free parameters overparameterize the model and cannot be uniquely identified. One practical solution is to remove the interaction effects, i.e. assume ()ij = 0 for all i, j.
5
CLASSIFICATION MODELS
53
If the factors A and B do not interact, i.e. ij = 0 i, j, then ai, bj. The treatment mean ij = + i + j. At the i th level of A, the difference between any two treatment means ij and ij 0 is j j 0, which does not depend on the level fixed for A. In other words, the effects of factor B apply similarly at each level of A, or equivalently, the effects of factor A apply similarly at each level of B, or equivalently, the effects of A and B are additive. The parameters ()ij 's measure the extent of the departure from additivity. The following diagrams illustrate treatment means in a two-way classification model without and with interactions, respectively.
NO INTERACTIONS
A,B INTERACT
4
5 i: Levels of A.
i. Levels of A treatment mean.
treatment means
5
i = 2 3 i = 1
2 1 0
4 i = 1 3 2
i = 2
10
1
2
levels of B
3
4
1
2
levels of B
3
4
5.3.3 Under the general linear model formulation, for a two-way classification model, factors A and B can be identified with dummy variables defined in a similar way as under a one-way model. Interactions between A and B can be identified by cross-products of the A and B dummy variables, as shown below. Effects Factor A of I levels Factor B of J levels Interactions between A and B
Representation of dummy variables A1,. . . AI1 dummy variables B1,. . . BJ1 A1 B1, A1 B2,. . . A1 BJ1, A2 B1,. . . , AI 1 BJ 1
Expressed in terms of dummy variables, the two-way classification model has the form Y.
= + 1 A1 + + I1 AI1 + 1 B1 + + J1 BJ1 + ()11 A1 B1 + + ()1,J1 A1 BJ1 + ()21 A2 B1 + + ()I1,J1 AI1 BJ1 +.
5.3.4 Similar to the one-way classification model, a two-way classification model can also be reparametrised by two choices of constraints.
5
CLASSIFICATION MODELS
54
(A) Level-i-zero constraint: (Take i = 1 as an example). 1j = 0 for j = 1, 2,. . . IB, and i1 = 0 to i = 1, 2,. . . , IA. B) Sum-to-zero constraints: IA, X.
ij = 0 for j = 1, 2,. . . , IB, and
i = 1
IB X
ij = 0 for i = 1, 2,. . . , IA.
j = 1
The general linear model form under these constraints is tedious to specify. Fortunately, we will not have to apply them manually to the data, as modern software will do it for us. 5.3.5 Example 5.3.1 The file PVC.txt (Morris and Watson, 1998) contains data from a study investigating whether operators and devices (called resin railcards) affect the particle size of the produced plastic polyvinyl chloride (PVC). The number of operators and devices is 3 and 6, respectively. rm(list=ls(all=TRUE)) pvc.data - read.table("pvc.txt", header = TRUE) attach(pvc.data) pvc size operator resin 1 36.2 1 1 2 36.3 1 1 3 35.3 1 2 In the dataset, we have Response variable: Factors:
size operator (of 3 levels), resin (of 6 levels).
The model is Yijk = + i + j + ()ij + ijk.
i = 1, 2, 3, j = 1, 2, 3, 4, 5, 6.
subject to 1 = 1 = ()1j = ()i1 = 0, i, j. Define dummy variables C2, C3 with factor operator and T2,. . . T6 for factor resin according to the following.
operator 1 2 3
C2 0 1 0
C 3 0 0 1
resin T2 1 0 2 1 3 0 4 0 5 0 6 0?
T3 0 0 1 0 0 0
T4 0 0 0 1 0 0
T5 0 0 0 0 0 1 0
T6 0 0 0 0 0 1
5
CLASSIFICATION MODELS
55
The two-way model can be written as Y.
= + 2 C2 + 3 C3 + 2 T2 + 3 T3 + 4 T4 + 5 T5 + 6 T6 + ()22 C2 T2 + + ()26 C2 T6 + ()32 C3 T2 + + ()36 C3 T6 +.
In the above, i is an effect associated with the ith operator, i = 1, 2, 3, while j is the effect of the jth device, j = 1, 2,. . . 6. - - - - - - - - - - - - - - - - - - - - - - - - - - - - Then ij is the interaction between the ith operator and the jth device. More interpretations of parameters? There are 3 6 = 18 possible treatment groups, i.e. combinations of operators and devices, and 2 measurements for each group. Below is the GLM design matrix given in R. Note that the data for operator and resin are numerical, and we have to create dummy variables.
In the design matrix, each observation gives rise to a row, and each treatment group has a distinct row. Therefore there are 36 rows in total and, because there is no empty treatment group, 18 distinct rows. Here columns 2 18 of the design matrix correspond respectively to the dummy variables C2, C3, T2, T3, T4, T5, T6, C2 T2, C2 T3, C2 T4, C2 T5, C2 T6, C3 T2, C3 T3, C3 T4, C3 T5, C3 T6, and the design matrix has ranked 18.
5
CLASSIFICATION MODELS
5.4
56
Model fitting
5.4.1 Least squares method — Generally, fitting a classification model to data observed in m treatment groups entails minimisation of ( ) m X X 2 s th response in r th treatment group r th treatment mean. r = 1
s
If the m treatment means are modelled by m free parameters, then the least squares method simply estimates each treatment mean by the sample mean response observed for that particular treatment. 5.4.2 One-way classification model: Parameter.
Least squares estimate i = Y i.
treatment means, i
= Y k i = Y i Y k
kth treatment mean, treatment effect, i 5.4.3 Two-way classification model: Parameter treatment mean, ij (aI, bJ treatment mean) main effect of A, i main effect of B, j A-B interaction, ij ().
Least squares estimate ij = Y ij = Y IJ i = Y iJ Y IJ j = Y Ij Y IJ c ij = Y ij Y iJ Y Ij + Y IJ () = Y ij
5.4.4 Example 5.2.1 (cont'd). Fitting a one-way classification model with a last-level constraint to the data, we get lse's of the 4 treatment means (i 's), and the i 's: i i Diet i = y 4 = 40.40.
A B C D
1 2 3 4
61 66 68 61
0 5 7 0
Below is parameter estimates for models with other types of constraints. Level-One Constraint: options(contrasts = rep ("contr.treatment", 2)) g - lm(coag diet) summary(g) Call: lm(formula = coag diet).
5
CLASSIFICATION MODELS
Residuals: Min 1Q Median -5.00 -1.25 0.00.
3Q 1.25
57
Max 5.00
Coefficients: Estimated - Std. Error t value Pr(—t—) (Intercept) 6.100e+01 1.183e+00 51.554 2e-16 *** dietB 5.000e+00 1.528e+00 3.273 0.003803 ** dietC 7.000e+00 1.528e+00 4.583 0.000181 *** dietD 2.991e-15 1.449e+00 0.000 1.000000 —Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.366 on 20 degrees of freedom Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212 F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05.
Sum-to-zero Constraint: options(contrasts=c("contr.sum","contr.poly")) gs - lm(coag diet) summary(gs) Call: lm(formula = coag diet) Residuals: Min 1Q Median -5.00 -1.25 0.00.
3Q 1.25
Max 5.00
Coefficients: Estimated - Std. Error t value (Intercept) 64.0000 0.4979 128.537 diet1 -3.0000 0.9736 -3.081 diet2 2.0000 0.8453 2.366 diet3 4.0000 0.8453 4.732 --Signif. codes: 0 '***' 0.001 '**' 0.01.
Pr(—t—) 2e-16 0.005889 0.028195 0.000128 0.005889 0.028195 0.0
*** ** * ***
0.05 '.' 0.1 '.' 1.
Residual standard error: 2.366 on 20 degrees of freedom; Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212; F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05.
5.4.5 Example 5.3.1 (cont'd) Recall the model used previously, Yijk = + i + j + ()ij + ijk,
i = 1, 2, 3, j = 1, 2, 3, 4, 5, 6.
subject to 1 = 1 = ()1j = ()i1 = 0, i, j. The sample means Y ij and are given below.
5
CLASSIFICATION MODELS
58
Operator Device 1 2 3 4 5 6 1 36.25 35.15 30.70 29.70 31.85 30.20 2 35.40 35.35 29.65 30.05 31.40 30.65 3 35.30 33.35 29.20 28.65 29.30 29.75 These are also ij if the cell-mean model is used. Estimates of individual parameters are then obtained by formulas in 3. Intercept:.
= 36.25
Main effects (Operator):
Main effects (Device):
(operator2) 2 0.85 (resin2) 2 1.10
Operator 3 3 0.95
resin 3) 3 5.55.
Resigns 4 6.55 "
Resign 5 5 4.40 5.40 4.40
Interaction effects (Operator*Device): below: ()ij resin2 resin3 resin4 resin5 resin6 operator2 1.05 -0.20 1.20 0.40 1.30 operator3 -0.85 -0.55 -0.10 -1.60 0.50 These can be obtained in R as shown below.
operator-factor(operator) resin-factor(resin) pvc.lm - lm(psize operator*resin) summary(pvc.lm).
Call: lm(formula = psize operator * resin). Resistance: Min 1Q -2.1000 -0.3125.
Median 0.0000
3Q 0.3125
Max 2.1000
Coefficients: (Intercept) operator2 operator3 resin2 resin3 resin4 resin5 resin6 operator2:resin2 operator3:resin2 operator2:resin3 operator3:resin3 operator2:resin4 operator3:resin4 operator2:resin5 operator3:resin5.
Estimate - Std. Error t value Pr(—t—) 36.2500 0.7297 49.676 2e-16 *** -0.8500 1.0320 -0.824 0.420917 -0.9500 1.0320 -0.921 0.369457 -1.1000 1.0320 -1.066 0.300557 -5.5500 1.0320 -5.378 4.13e-05 *** -6.5500 1.0320 -6.347 5.58e-06 *** -4.4000 1.0320 -4.264 0.000467
Resign 6 6 6.05?
5
CLASSIFICATION MODELS
59
operator2:resin6 1.3000 1.4595 0.891 0.384818 operator3:resin6 0.5000 1.4595 0.343 0.735872 --Signife. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.032 on 18 degrees of freedom Multiple R-squared: 0.9235, Adjusted R-squared: 0.8513 F-statistic: 12.78 on 17 and 18 DF, p-value: 8.848e-07, r-value: 9.8198, r-value: 9.8099.
Interpretation of parameter estimates.
5.5
Test for treatment effects: one-way ANOVA.
5.5.1 Assume further that the random error N(0, 2 ). We have described in 3.3 techniques for testing whether regression coefficients satisfy certain linear constraints. The same techniques apply immediately to classification models, since the latter can be treated as general linear models with the use of dummy variables. We shall discuss in this and the next sections special forms of such tests which are commonly encountered in practical applications involving classification models. 5.5.2 Consider a one-way classification model: Yij = i + ij, i = y, i = ij.
i = 1,. . . k, j = 1,. . . , ni,
iid
ij N(0, 2 ).
Let be a reduced model with 1 = 2 = = k =, for some unspecified. Equivalently, we have :
Yij = + i + ij,
k = 0,
and :
Yij = + ij (i.e. 1 = = k = 0.
Here represents a full model, and is a reduced model under which the factor has no effect on Y. 5.5.3 Useful quantities for testing against : Terminology (multiple linear regression) (one-way ANOVA) (Total corrected s.s.)
notation definition ni k X X Total corrected s.s. Sy y (Yij Y )2 i=1 j=1
Residency s.s. fitting - residual s.s. fitting
Within Group s.s. - g.h.
SSE
ni k X X
( Yij Y i ) 2
i = 1 j = 1
Regression s.s. fitting ).
Between-group s.s. SST or treatment s.s.
Sy y SSE k X = ni (Y i Y )2 i=1.
5
CLASSIFICATION MODELS
60
The between-group SST is the key source of variation distinguishing from. Its size measures the variation in the responses, which is explained by the k treatment groups. 5.5.4 Determination of degrees of freedom: Part 1. Sy y has i ni 1 = n 1 d.f. 2. The amount of money to be spent. For SSE, note that (n ) (n ) (n ) 1 2 k X X X SSE = (Y1j Y 1 )2 + (Y2j Y 2 )2 + + (Ykj Y k )2, j=1.
j = 1
j = 1
which is a sum of k independent of "total corrected s.s.". The i th "total corrected s.s." refers to the random sample of the i th treatment and has ni 1 d.f. Thus, SSE has (n1 1) + (n 1) = n k d.f. 3. The following is the standard for a k-meter. SST has (n 1) (n k) = k 1 d.f. 5.5.5 Based on the partition Sy y = SSE + SST, compile a one-way ANOVA table: Source s.s. Between groups SST SSE Within groups Total Sy.
d.f. the s.r.o.d. m.s. k 1 MST = SST /(k 1) n k MSE = SSE/(n k) n1?
F - ratio F = MST / MSE
5.5.6 F test (of size ) for H0 : vs H1 : — reject H0 iff the p-value P(Fk1, nk > F ) is smaller than, or equivalently, () iff F > Fk1, nk.
If H0 is rejected, we say that the factor has significant effects on the response variable at the 100 % significance level. 5.5.7 For the case k = 2, the above one-way ANOVA F test is equivalent to a two-sided t test for equality of the means of two independent normal populations having a common variance. 5.5.8 Example 5.2.1 (cont'd) To examine the blood coagulation time varies on different diets of animals, we take the factor to be the "diet" A, B, C and D. So we have k = 4.
n1 = 4
n2 = n3 = 6.
n4 = 8
n = 24.
The ANOVA table can be constructed by the above formulas and sample means obtained in section 4.
5
CLASSIFICATION MODELS Source :. Between groups 228 112 Within groups Total 340.
61. m.s. F -ratio 3 76 13.57 20 5.6 23.
The p-value is P(F3,20 > 13.57) = 0.0000466, which is extremely small. We conclude that the factor "diet" has highly significant effects on the blood coagulation rate. g - lm(coag diet) anova(g) Analysis of Variance Table Response: coag Df Sum Sq Mean Sq F value Pr(F) diet 3 228 76.0 13.571 4.658e-05 *** Residuals 20 112 5.6 --Signif. code: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1.
Unfortunately, the F test alone cannot reveal the actual cause of the significant result. In other words, the test can show that there are differences between the four diets, but it cannot tell in what way they are different.
5.6
Two-way ANOVA
5.6.1 Consider a two-way classification model under normal error assumption : Yijk = ij +ijk,
i = 1,. . . i, j = 1,. . . j, k = 1,. . . , nij,
iid
ijk N(0, 2) - N(0, 2)
The IJ treatment means ij can be reparameterised as ij = + i + j + ()ij, subject to constraints I = 0, J = 0, ()Ij = 0 j, ()iJ = 0 i. Thus we have I 1 unrestricted i 's (factor A main effects), J 1 unrestricted j 's (factor B main effects), and (I 1)(J 1) unrestricted ()ij 's (interactions). 5.6.2 Here we consider the case that the dataset is balanced, i.e. the number of observations in cells is the same. All the main and interaction effects can be tested in a single ANOVA table. For unbalanced data, tests of different effects need to be done in different ANOVA tables, using the method of comparing nested models. The course MATH48082 covers ANOVA in much more detail.
5
CLASSIFICATION MODELS
62
5.6.3 The total sums of squares can be decomposed as follows, X X X (yijk y ••• )2 = (y i•• y ••• )2 + (y •j• y ••• )2 i,j,k.
i,j,k
+
X
+
X
i,j,k
( y ij• y i•• y • j• + y • •• ) 2
i,j,k
( yijk y ij • 2),
i,j,k
because all cross-terms add up to 0. Define the following sums of squares: P PI b 2i, SSA := i,j,k (y i•• y ••• )2 = Jn i=1 P PJ SSB := i,j,k (y •j• y ••• )2 = In j=1 bj2, P P P c2, SSAB := i,j,k (y ij• y ••• 5.6.4 ANOVA table (Two-way): For balanced data, tests of the main effects and the interaction effect can be done in one ANOVA table.
Source Main effects of A Main effects of B Interaction Error Total.
df I-1 J-1 (I-1)(J-1) IJ(n-1) IJn-1.
SS SSA SSB SSAB SSE SST otal.
MS MSA MSB MSAB MSE
F MSA /MSE MSB /MSE MSAB /MSE
Critical values F;I1, IJ(n1) F;J1, IJ(n1) F;(I1)(J1), IJ(n1)(j1), IJ
IJ(n 1)
The test of each effect is done by comparing the full model in 1 and the reduced models specified in the table below. Test Main effects of A Main effects of B Interaction
Reduced model i = 0 for all i j = 0 ()ij = 0 for all i, j.
Null hypothesis 1 = = I 1 = = J ()11 = = ()IJ.
5.6.5 Example 5.3.1 (cont'd).
Response: Size n = 24.
Factors: (A) Operator, (B) Device No. of categories containing data = 18.
operator-factor(operator) resin-factor(resin) anova(pvc.lm) Analysis of Variance Table Response: psize.
5
CLASSIFICATION MODELS
63
Df Sum Sq Mean Sq F value Pr(F) operator 2 13.224 6.612 6.2084 0.008901 ** resin 5 212.766 42.553 39.9560 3.954e-09 *** operator:resin 10 5.433 0.543 0.5101 0.861213 Residual 18 19.170 1.065 --Signif. code: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1.
p-value = P(F2,18 > 6.2084) = 0.008901.
main effects of Operator highly
p-value = P(F5,18 > 39.9560) = 3.954 0.9 significance.
main effects of Device highly
p-value = P(F10,18 > 0.5101) = 0.861213. Interactions highly significant.
6
EXTRANEOUS VARIATION
6 6.1
64
Extraneous Variation Introduction
6.1.1 Studies of classification models sometimes encounter an extraneous source of variation, which may affect the responses or mask the true treatment effects, but is not of our primary interest. Examples: (i) Responses to different treatments were observed on experimental units located in different regions. The 'Region' is an extraneous variable. ii) Responses to different treatments are observed across a certain period of time. 'Time' is an extraneous variable. iii. Quality tests are done on products manufactured in different production lines. The 'Production line' is an extraneous variable. iv) A number of teaching methods (treatments) are compared within a class of mathematics students, who may have very different levels of intelligence. The 'IQ score' is an extraneous variable. 6.1.2 Subjects having similar attributes to the extraneous variable usually exhibit more similar or homogeneous behaviour than subjects having different attributes to that variable. 6.1.3 If it is not possible to remove extraneous variables, it is at least desirable to control them so as to reduce experimental error variance, i.e. decrease the residual s.s., • increase precision in experiments, i.e. increased power to detect treatment effects. 6.1.4 A common way to control extraneous variation is to incorporate any extraneous variables into the classification model in the form of additive effects. Type of extraneous variable Terminology Categorical blocking factor or blocks Continuous concomitant variable or covariate.
Model randomized block design analysis of covariance (ANCOVA).
6.1.5 An important assumption underlying the randomized block design (ANCOVA) is that the extraneous variables do not interact with the treatment effects. Thus, • for randomized block design, comparisons between treatment effects are unaffected by the block within which such comparisons are made; • for ANCOVA, comparisons between treatment effects are unaffected by the values of concomitant variables at which such comparisons are made.
6
EXTRANEOUS VARIATION
6.2
65
Randomized block design
6.2.1 In a randomized block design, categorical extraneous factors are used to stratify samples into homogeneous groups called blocks. Treatments of our primary interest are randomly assigned to the units within each block. 6.2.2 Suppose we have available experimental units which are grouped into blocks. Each block is subject to treatment under random allocation. Suppose that within the jth block, the i th treatment is applied to nij units, for i = 1,. . . t and j = 1,. . . b. Thus, the b blocks consist of n1,. . . nb units, respectively, and the n b t sample size is n = n = j=1 i=1 nij. We can model the responses as : Yijk = + i + j + ijk,
i = 1,. . . t, j = 1,. . . b, k = 1,. . . , nij,
i.i.d. and i.i.d
ijk N(0, 2)!
subject to the constraints that t = b = 0, where the treatment effects and the blocking effects are represented respectively by 1,. . . , t and 1,. . . , b. Alternatively, we may formulate the model as multiple linear regression of dummy variables A1 and. . . At 1, C 1,. . . , Cb1 : Y = + 1 A1 + + t1 At1 + 1 C1 + + b1 Cb1 +, which enables us to set up an n (t + b 1) design matrix X. 6.2.3 Technically, the randomized block design is a two-way classification model with no interactional terms. The two factors are the main factor and the blocking factor. 6.2.4 Model fitting — Least squares estimators for, i, j : X >X )1X >Y. , 1,.. . . , 1, 1,. . . X 6.2.5 ANOVA — Main question: Are there significant treatment effects after controlling for blocking effects? We illustrate below a test for the effects of the main factor, taking to be the full model. Let be a reduced model with i = 0 for all i, i.e. A one-way classification model based on the blocking factor alone, : Yijk = + j + ijk, subject to constraint b = 0. Then a test for treatment effects amounts to testing (reduced model ) H0 : i = 0 i restrictions. Fitting the models yields.
vs
(full model ) H1 : no.
6
EXTRANEOUS VARIATION
66
Model Residual.
SSE =
X (Yijk i j )2
X SSE = (Yijk Y j ) 2
i,j,k
i,j,k
Setting the extra s.s. to be SST = SSE SSE , the partition Sy y = SSE + SST + (Sy y SSE ) leads to the ANOVA table: Source: S.S.E., S.S.E.
s.s. : :
d.f. and c.f.
Sy y SSE
b 1
SST
t 1
Error
SSE
n t b +1
Total
Sy y
n 1
Between blocks between treatments
m.s.
F - ratio
MST = MSE =
SST t 1
FT =
MST MSE
SSE n b +1
F test (of size ) for treatment effects — reject H0 iff the p-value P(Ft1, ntb+1 > FT ) is smaller than. If H0 is rejected, we say that there is significant treatment effect at 100 % level. 6.2.6 Example 6.2.1 An investigation was undertaken to study how education levels affect monthly income of computer technicians. Four big computer companies participated in the survey. Six technicians were selected, two at each education level, to report on their monthly income. The data (in dollars) are given below: Education level: 1 (pre-high school), 2 (high school), 3 (college) Total.
18,000 7,500 10,000 9,100 17,000 16,500 68,100
2 6,500 9,000 8,800 9,000 10,000 11,000 54,300
3 4 Total 10,100 7,200 11,050 7,000 66,350 15,000 6,500 14,300 7,500 80,200 24,000 9,300 25,500 8,900 122,200 99,950 46,400 268,750
The factor of our primary interest is 'education level'. However, there may be some systematic discrepancies in fees between different companies. The four companies can be used as 'blocks'. Assuming a randomized block design, we model the monthly income as Yijk = + i + j + ijk.
i = 1, 2, 3, j = 1, 2, 3, 4, k = 1, 2.
i.i.d. and i.i.d
ijk N(0, 2)!
EXTRANEOUS VARIATION
67
subject to the constraints that 3 = 4 = 0, where the 'education' effects and the 'company' (blocking) effects are represented respectively by 1, 2, 3 and 1, 2, 3, 4. The following diagram displays the observed income data (marked '1', '2', '3', '4'), and compares the estimated mean incomes obtained by fitting the randomized block design model with those obtained by fitting a one-way classification model with no blocking (by 'companies'). Company 1 Company 2 Company 3 Company 4 Estimated Mean Income (with blocking, Co.1) Estimated Mean Income (with blocking, Co.2) Estimated Mean Income (with blocking, Co.3) Estimated Mean Income (with blocking, Co.4) Estimated Mean Income (no blocking).
1
25000
2 3 4
2000
monthly income
6
3 3
1 1 3 3
15000
10000
3 3
1 1 2 2
2 1 1 4 4 2
2 2 4 4
4 4
5000 pre-high school
high school
college
education level
An F test for 'education' effects can be based on the ANOVA table: Source: Between blocks (companies) Between education levels Error Total
s.s. d.f. an.y. s.f. s.f. m.s. F -ratio 278736979.2 3 211460208.3 2 105730104.2 16.95371 112255208.3 18 6236400.5 602452395.8 23.
The p-value P(F2,18 > 16.95371) = 0.00007250836 shows that 'education' has highly significant effects on monthly income. If we had not blocked the data by 'companies', then we would have had a one-way ANOVA table: Source Between education levels Error Total.
s.s. d.f. an.y. s.f. s.f. m.s. F -ratio 211460208.3 2 105730104.2 5.67871241 390992187.5 21 18618675.6 602452395.8 23.
The p-value for 'education' effects becomes P(F2,21 > 5.67871241) = 0.01068, which is not significant at 1%. Compared to previous results, blocking has helped reveal the effects of 'education' significantly.
6
EXTRANEOUS VARIATION
6.3
68
Analysis of covariance (ANCOVA) rates.
6.3.1 ANCOVA combines features of both regression and classification models. In an ANCOVA model, additive effects of one or more additional quantitative extraneous variables are introduced to augment a multiway classification model in order to reduce variation in the responses systematically. Added variables, usually known as concomitant variables or covariates, account for extraneous variations which may not be of our primary interest. Examples of concomitant variables: • Locations of experimental units measured by their altitude; • Time, age, years spent in education, IQ, weight, aptitude etc. • Pre-study attitudes of respondents, pre-study sales etc. 6.3.2 Analysis of covariance (ANCOVA) refers to the analysis of classification models in the presence of one or more concomitant variables. In principle, ANCOVA models are linear models involving both qualitative (categorical) and quantitative explanatory variables. 6.3.3
ONE-WAY ANCOVA An ANCOVA model with 1 factor of k levels and p concomitant variables X1,. . . Xp may be expressed as Yij = +i +1 xij,1 + +p xij,p +ij,
iid
ij N(0, 2 ),.
i = 1,. . . k, j = 1,. . . , ni,
subject to k = 0. Here xij,m is the value of Xm associated with the response Yij. Alternatively, we may formulate the model as multiple linear regression of dummy variables D1 and. . . Dk1 and concomitant variables X1,. . . , Xp : Y = + 1 D1 + + k1 Dk1 + 1 X1 + + p Xp +, which enables us to set up an n (k + p) design matrix X, where n = n =
Pk
i = 1
ni.
1. How much is a tee in the park? Model fitting — Least squares estimators for, i, j : X >X )1X >Y. , 1,.. . . , k 1, 1,. . . p ]> = x 2. Test for treatment effects — To test for treatment effects, use two models: • Full model: Regressing Y on all variables (factor + concomitant variables) Residual SSE . Reduced model : regressing Y on concomitant variables X1,. . . Xp residual s.s. SSE. Setting SST = SSE SSE and SSR = Sy y SSE, we have ANCOVA table.
6
EXTRANEOUS VARIATION Source Regression fitting, i.e. concomitant variables).
69 s. m. m. n.
d.f. and c.f.
m.s.
SSR
p
SST
k 1
Error
SSE
n p k
Total
Sy y
n 1
Extra (i.e. between treatments)
MST =
MSE =
F - ratio
SST k 1
FT =
MST MSE
SSE n p k
Refer FT to the Fk1, npk distribution to test for treatment effects. 6.3.4 Example 6.3.1 The dataset PEFR.dat contains measurements of the peak expiratory flow rate measurements (PEFR), sex, height and the vital capacity measurements of male(Sex=2) and female(Sex=1) patients. pefr.data Num Sex ht pefr vc 1 1 1 180.6 522.1 4.74 2 2 1 168.0 440.0 3.63 3 3 1 163.0 428.0 3.40 102 102 2 171.2 575.0. rm(list=ls(all=TRUE)) pefr.data - read.table("PEFR.txt", header = TRUE) attach(pefr.data) plot(pefr,ht, type="n",ylab="PEFR (litre/min)",xlab="Hight (cm)") points(pefr[Sex==1],ht[Sex==1],type="p",pch=19) points(pefr[Sex==2
Suppose we are not only interested in how PEFR differs between male and female, but also how it relates to height. From the scatterplot, it is reasonable to assume.
6
EXTRANEOUS VARIATION
70
errors for the two classes that follow the same distribution. So compared to separately modelling male and female groups, a single model using both the sex and height is more efficient. Let Y be the PEFR of a patient and x be the height. One choice is to use the one-way classification model with the concomitant variable x: Yij = + i + xij + ij, i = 1, 2, j = 1,. . . ni, subject to 2 = 0. Here xij is the value of height associated with the jth female patient, if i = 1, or the male patient, if i = 2. Equivalently, let D1 be the dummy variable defined as ( 1 if ith patient is a female D1i = 0 if ith patient is a male, and we can formulate the multiple linear regression model Yi = + 1 D1i + xi + i, i = 1,. . . where n = n1 + n2. The design matrix and the regression coefficients are 1 x1 D11 1 x2 D12 X = . and = 1 . . . . 1 xn D1n The first 2 rows and the last row of 1 1 X= ... 1
X: 180.6 1 180.6 168.0 1 168.0 171.2 0 0
The above formulation assumes the linear relationship (slope) between height and the response does not depend on the gender. It is also reasonable to assume the slope depends on the sex and use the following alternative model: Yi = + 1 D1i + xi + xi D1i + i, i N(0, 2 ), i = 1,. . . n, which includes the interaction term between the dummy variable and the concomitant variable. The significance of the interaction can be tested by comparing the two nested models. Compared to model the two groups separately as follows, Yi = + xi + i, i N(0, 12 ),
if the patient is a male.
and Yi = 2 + 2 xi + i, i N(0, 22 ),
if the patient is a female.
for i = 1,. . . n, if the ith patient is a male, D1i = 0 and the classification model reduces to the individual model for males; if the ith patient is a female, D1i = 1 and it reduces to the individual model for females with 2 = +1 and 2 = +. Thus, in the classification model, the parameter 1 represents the separation between the two lines at x = 0 and represents the difference in the slopes of the two lines. In R, 'Sex' needs to be defined as a factor.
6
EXTRANEOUS VARIATION
71
z = factor(Sex) # Creates dummy variables summary(lm(pefr z + ht + z:ht)) Call: lm(formula = pefr z + ht + z:ht) Coefficients: Estimate Std. Error t value Pr(—t—) (Intercept) 77.603 237.663 0.327 0.7447 z2 -212.624 313.926 -0.677 0.4998 ht 2.407 1.433 1.680 0.0961. z2:ht 1.559 1.841 0.847 0.3991 --Signifing. code: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 55.05 on 98 degrees of freedom, Multiple R-squared: 0.4606,Adjusted R-squared: 0.4441, F-statistic: 27.9 on 3 and 98 DF, p-//value: 3.978e-13.
Below is a wrong analysis that treats 'Sex' as a continuous variable. summary(lm(pefr Sex + ht + Sex:ht)) Call: lm(formula = pefr Sex + ht + Sex:ht) Coefficients: Estimate Std. Error t value Pr(—t—) (Intercept) 290.226 517.688 0.561 0.576 Sex -212.624 313.926 -0.677 0.500 ht 0.848 3.089 0.274 0.784 Sex:ht 1.559 1.841 0.847 0.399 Residual standard error: 55.05 on 98 degrees of freedom Multiple R-squared: 0.4606,Adjusted R-squared: 0.4441 F-statistic: 27.9 on 3 and
All test results are not significant. No significant parameters may indicate possible multicollinearity. It seems reasonable to fit a model with a common slope, but different intercepts for boys and girls. summary(lm(pefr z + ht)) Call: lm(formula = pefr z + ht) Coefficient: Estimate Std. Error t value Pr(—t—) (Intercept) -78.9505 149.1834 -0.529 0.597839 z2 52.9580 15.0812 3.512 0.000673 *** 3.3513 0.8984 3.730 0.000319 *** --Signif. code: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 54.97 on 99 degrees of freedom Multiple R-squared: 0.4567, Adjusted R-squared: 0.4457 F-statistic: 41.61 on 2 and 99 DF, p-value: 7.671e-14.
We do not “sell” personal information that we collect directly from you, as “sell” is defined in the California Consumer Privacy Act, as amended (CCPA) or the Virginia Consumer Data Protection Act (VCDPA) or as “share” is defined under the CCPA. We do work with service providers and advertising companies that use cookies and other tracking technologies to collect information about your visits to our website and third-party sites, and then use that information to deliver advertisements relevant to your interests. To opt out of the collection of your personal information for advertising purposes, you can modify your cookie settings below under ”Opt Out of Third-Party Targeting Cookies.” For more information on how we collect and process personal information, please visit our
Privacy Policy
Manage Consent Preferences
Essential Cookies
Always Active
Essential Cookies are required for providing you with features or services that you have requested. For example, certain Cookies enable you to log into secure areas of our Services.
Share or Sale of Personal Data
Always Active
Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link.
If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.
Advertising Cookies
Always Active
Advertising Cookies collect data about your online activity and identify your interests so that we can provide advertising that we believe is relevant to you. Advertising Cookies may include Retargeting Cookies.
Analytics Cookies
Always Active
Analytics Cookies allow us to understand how visitors use our Services. They do this by collecting information about the number of visitors to the Services, what pages visitors view on our Services and how long visitors are viewing pages on the Services.
Analytics Cookies also help us measure the performance of our advertising campaigns in order to help us improve our campaigns and the Services’ content for those who engage with our advertising.