Related link :
Multiple Regression Program Page
Sample Size for Multiple Regression Program Page
Sample Size for Multiple Regression Explained and Tables Page
Curve Fitting Explained Page
Path Analysis Explained Page
Multiple Regression
Complex Factorial Covariance Models
References
Introduction
Multiple Regression
Example
Multiple regression, from the program in Multiple Regression Program Page
is one of the most flexible
and powerful statistical tools available to the researcher, as it allows the modelling of multiple influences on an outcome,
correcting for the overlapping influence of the independent variables. For those who are familiar with the concepts, the algorithm
of multiple regression can be used to calculate a large number of other parametric statistical procedures.
Most professional statistical packages provide large numbers of complex statistical procedures based on multiple
regression, under the broad heading of the General Linear Model. StatTools provides the following algorithms
based on the multiple regression.
This page has two sections.
- This section explains the use of multiple regression, its sample size, and provides an example
- A very much larger and complex section on how to use the multiple regression algorithm to conduct complex
Factorial models of Covariance Analysis
Multiple regression consists of two or more independent variables (x 1,x 2,x 2, etc)
and a single dependent variable (y). The formula produced is y = a + b 1x 1+b 2x 2+b 3x 3... where "
- y is the single dependent variable, assumed to be a parametric measurement (continuous and normally distributed)
- The x in x1,x2,x2, etc, are the independent variables. They need not
be parametric, but they need to be ordered (3>2>1). Common independent variable types are :
- Binary of 0/1, (no/yes, false/true, negative/positive, etc)
- Ordinal, such as responses to pain (0=none, 1=some, 2=lots), or the Likert Item (0=SD, 1=D, 2=N 3=A, 4=SA)
- Poisson distributed counts, such as number of cells in a set volume, number of complaints per month
- Discrete Interval Measurements with unstated distribution, such as height in cms, age in years, Time on waiting list
- Normally distributed measurement
- Log-normally distributed measurements such as ratios.
Data Entry : when using the program in Multiple Regression Program Page
consists of a multi-column
table, where
- Each row is data from a subject
- Each column is a measurement of a variable
- The last column is the dependent (y) variable
Terminology
- Partial Correlation Coefficient (PCor) is the correlation between an independent variable (x) with the dependent variable (y),
having corrected for inter-correlations between all the independent variables
- Partial Standardised Regression Coefficient (PSReg) is the regression coefficient between an independent variable (x) and the
dependent variable (y), having corrected for inter-correlations between all the independent variables, rescaled to a mean of 0
and Standard Deviation of 1 for both. This is measurement unit free, and used for comparing the relative scale of influence
from different independent variables
- Partial Regression Coefficient (PReg or b) is the regression coefficient between an independent variable (x) and the
dependent variable (y), having corrected for inter-correlations between all the independent variables. This is the b used in
the regression formula y = a + b1x1+b2x2+b3x3...
- Standard Error of the Partial Regression Coefficient (SE)
- t=b/SE, and α is the Probability of Type I Error (two tail) of t with residual degrees of freedom
- Constant(a) is the a in the formula y = a + b1x1+b2x2+b3x3...
Please note : that, in the table of analysis if variance, although the model Degrees of Freedom is the sum of the
Regression Degrees of Freedom, the model Sums of Square is greater than the sum of (Sums of Square) from all the regression
Coefficients.
This is because the individual Sums of Squares describes the pure influence on y from each x variable, while the model sums
all of them, and add on top those Sums of Squares that overlap between the independent x variables. It is this difference
that provides the very powerful analysis of variance in complex models, where multiple measurements often have various
degrees of correlation with each other, and their pure influences and overlapping influences need to be separately accounted for.
Multiple Regression as Entered, and with Stepwise deletion
The program in the Multiple Regression Program Page
provides two options for conducting multiple regression
- The as Entered model calculates multiple regression once, using all the entered data. This is the preferred model if the
intension is to provide a description of the relationship between the variables, or if the calculation is used to obtain
parameters for other complex statistical purposes.
- The Stepwise Deletion model carries out repeated multiple regression analysis on the data entered, deleting the weakest
independent variable after each cycle. This is the preferred model when developing a predictive algorithm, where the
researcher starts with a large number of plausible predictors, and eliminate the weaker ones serially to obtained the most
powerful yet most parsimonious (fewest predictors) formula.
The algorithm from the program continues until only 1 independent variable left, allowing the user to determine the number of
independent variables to retain in the final formula. This can be done arbitrarily by judgement, but in most cases,
the decision is to retain only those independent variables where the Partial Regression Coefficient (b) is
statistically significant (α<0.05)
Sample size for Multiple Regression
Sample size program for multiple regression in the Multiple Regression Program Page
uses a modified version of that
for comparing multiple groups of measurement in the Sample Size for Unpaired Differences Tables Page
, but using the number of
independent variables and Multiple Correlation Coefficient R to represent the number of groups and the residual variance.
The calculations require multiple iterative approximations, so computation time increases exponentially with the number of
independent variables, and with decreasing value of R. Users are encouraged to consult the tables in the
Sample Size for Multiple Regression Explained and Tables Page
for their sample size needs.
Example 1 : Sample Size
We wish to study whether we can predict birthweight from maternal age, height
and weight, as well as gestational age and the sex of the baby, 5 independent
variables or predictors. We want this model to be clinically useful, so
requires a moderate effect size of R=0.5
Using α=0.05, power=0.8, number of independent variables u=5, and anticipated
effects size R=0.5, we can obtained from the Sample Size for Multiple Regression Explained and Tables Page
that the sample size
required to be 56 pregnancies.
Example 2 : Multiple Regression as Entered
age | Ht | Gest | Sex | BWt |
24 | 170 | 37 | 1 | 3048 |
29 | 161 | 36 | 0 | 2813 |
29 | 167 | 41 | 1 | 3622 |
21 | 165 | 36 | 1 | 2706 |
35 | 168 | 35 | 0 | 2581 |
27 | 161 | 39 | 0 | 3442 |
26 | 163 | 40 | 1 | 3453 |
34 | 167 | 37 | 0 | 3172 |
25 | 165 | 35 | 1 | 2386 |
28 | 170 | 39 | 0 | 3555 |
32 | 167 | 37 | 1 | 3029 |
31 | 169 | 37 | 0 | 3185 |
26 | 161 | 36 | 1 | 2670 |
21 | 165 | 38 | 0 | 3314 |
21 | 166 | 41 | 1 | 3596 |
24 | 164 | 38 | 0 | 3312 |
34 | 169 | 38 | 0 | 3414 |
25 | 161 | 41 | 0 | 3667 |
26 | 167 | 40 | 0 | 3643 |
27 | 162 | 33 | 1 | 1398 |
27 | 160 | 38 | 1 | 3135 |
21 | 167 | 39 | 1 | 3366 |
We use the default example data from the Multiple Regression Program Page
for this exercise. The data
was computer generated to demonstrate the procedure and not real.
We wish to explained factors that may influenced the birth weight of babies, these being maternal age (years) and height (cms),
the gestation age at birth (weeks), and whether the baby is a girl (1) or boy (0). We collected 22 subjects, with the data showing on the left.
Var | mean | SD |
1.age | 27.0 | 4.3 |
2.Ht | 165.2 | 3.2 |
3.Gest | 37.8 | 2.1 |
4.Sex | 0.5 | 0.5 |
5.BWt | 3114 | 533 |
Please note : The data are in columns separated by spaces or tabs, and
the dependent variable (BWt) is in the last column.
Using the program from the Multiple Regression Program Page
and taken the option of calculating the
data as entered, we obtained the following results.
we firstly produced the means and standard deviations of all the variables as shown to the right,
the last variable (5.BWt) is the dependent variable.
1 | 0.26 | -0.25 | -0.38 | -0.10 |
0.26 | 1 | 0.08 | -0.13 | 0.24 |
-0.25 | 0.07 | 1 | -0.11 | 0.92 |
-0.38 | -0.13 | -0.11 | 1 | -0.32 |
-0.10 | 0.24 | 0.92 | -0.32 | 1 |
The correlation matrix is produced next, as shown on the right.
The multiple regression analysis now takes place. Please note abbreviations
for the coefficients table are as follows.
PCor = Partial Correlation Coefficient. This is the correlation between
the variable and the dependent variable after correction for inter-correlation
between the independent variables.
PSReg = Partial Standardised Regression Coefficient. This measures the
influence of each independent variable on the dependent variable, using z or
standardised units. For example, for 1 SD of change in maternal age, 0.01 SD of
change occurs in birthweight. For 1 SD of change in gestation, 0.9 SD of
change occurs in birthweight.
PReg = Partial Regression Coefficient. This measures the change in
the dependent variable for each unit of change in the independent variable.
For example, for an increase of 1 year in age, the baby weighs 1.7g more. For
each week of maturing, the baby weighs 223g more. Girls are 209g lighter.
var | PCor | PSReg | PReg | SE | t | α |
1.age | 0.0418 | 0.0137 | 1.701 | 9.8641 | 0.1724 | 0.8653 |
2.Ht | 0.4395 | 0.1417 | 23.6492 | 11.7243 | 2.0171 | 0.0608 |
3.Gest | 0.9493 | 0.8952 | 223.1943 | 17.9205 | 12.4547 | <0.0001 |
4.Sex | -0.5476 | -0.2009 | -209.15 | 77.5107 | -2.6983 | 0.0158 |
Const = -9165.48 R = 0.961 R2 = 0.9236
|
SE = standard error of the Partial Regression Coefficient.
t = t test for that Partial Regression Coefficient
α (p) = the probability of Type I Error (α) for that Partial Regression Coefficient.
Const = the constant of the equation. In this case, BWt in G = -9165 + 1.7(age in years)
+ 23.7(height in cms) + 223.2(gestation in weeks), and -209.5 if the baby is a girl.
R = the Multiple Correlation Coefficient. This is the effect size of the equation,
R Sq is R2, the proportion of the total variance that is explained
by the regressions.
This is followed by the analysis of variance
| df | SSq | MSq | F | α |
Var1 | 1 | 797 | 797 | 0.0297 | 0.8652 |
Var2 | 1 | 109006 | 109006 | 4.0688 | 0.0598 |
Var3 | 1 | 4155814 | 4155814 | 155.1197 | <0.0001 |
Var4 | 1 | 195066 | 195066 | 7.281 | 0.0152 |
Model | 4 | 5503642 | 1375910 | 51.3572 | <0.0001 |
Res | 17 | 455447 | 26791 |
Tot | 21 | 5959089 |
The abbreviations for the analysis of variance table are as follows
Var = the source of variation
df = degrees of freedom
SSq = Sum of Squares
MSq = mean Sums of Squares or variance
F = Fisher's F, ratio of MSq of Reg and Res
p = Probability of Type I error (α)
Model = Contribution from all independent variables collectively
Res = results related to the residual or random error
Tot = total df and SSq.
Var1-Var4 = individual contributions from each variable after corrections for correlation
It should be noted that, although the sum of degrees of freedom from all the independent variable equals to that of the
model as a whole (in this example both = 4), this is not so for the Sums of Squares unless the independent variables are all uncorrelated with each other.
Otherwise the sum of all the individual Sums of Squares is usually less than that of the model as a whole
(in this example 4460683 and 5503642). This is because,
for each variable, the Sum of Squares tabulated is that unique to itself, excluding
the part it shares by correlation with other independent variables. The missing value, the difference between model ssq and
the sums of those from individual variables (5503642-4460683=1042958), is that attributable to the overlaps and
correlations between the independent variables.
Example 3 : Multiple Regression with Stepwise Deletion
Instead of aiming to understand the relationship between independent and dependent variables, we wish to establishe
the most efficient formula to predict birthweight. The efficiency is defined by the most accurate prediction with the
least number of independent variables. We determined to use α(p)>0.05 to delete those variables that are
inefficient predictors.
var | PCor | PSReg | PReg | SE | t | α |
2.Ht | 0.46 | 0.14 | 24.18 | 11.0 | 2.1985 | 0.042 |
3.Gest | 0.95 | 0.89 | 222.14 | 16.38 | 13.5577 | <0.0001 |
4.Sex | -0.59 | -0.21 | -214.61 | 68.83 | -3.118 | 0.0063 |
constant (a) = -9165.37 |
From the first cycle of calculation in the previous example, we determined that maternal age (PSReg=0.01, t=0.17, α=0.87)
can be deleted. In the second cycle, we found the results as shown to the right.
All 3 remaining predictors now have statistically significant Partial Regression Coefficient (α<0.05), so
no further deletion is necessary, and the final prediction formula is
Birth weight (g) = -9165 + 24 (maternal height in cms) + 222 (gestation in weeks) for boys, and
215g less for girls
Please note : that the program in the Multiple Regression Program Page
progressively delete
the least significant variable at each cycle of calculations until only one variable is left in the equation. The user however
should examine the results at the end of each cycle, and decide when the stepwise deletion should stop. In this example,
stepwise deletion is stopped after the first cycle, and only maternal height had been deleted, because the decision
to delete was based on α>0.05
Concepts and Background
OneWay Analysis of Variance and Covariance
Factorial Analysis of Variance and Covariance
Introduction and Theoretical Considerations
Technical Considerations
This section explains the relationship between multiple regression and the general model of analysis of variance and covariance.
This is done for the following reasons.
- To demonstrate the underlying principles of the least squares statistical approach to the analysis of variance
- To provide an understanding of One Way Analysis of Variance, the Factorial model of Analysis of Variance, and the Analysis
of Covariance
- To provide a guideline on how to conduct complex Analysis of Covariance, step by step, using the algorithm of
multiple regression. Although this may still be of interest to some, it is mostly superceded by the commercially
available statistical packages, which will perform the procedures with check boxes for options, and a click of the button.
For those who do not have a clear understanding of Analysis of Covariance, the following minimal and very basic
terms and descriptions may be useful.
- Variance is the square of the Standard Deviation, and it measures variations in a measurement
- The Analysis of Variance partitions the variance of the dependent variable according to those factors that influence it.
- In the simplest model, the analysis of variance is summarized as the t test. For example, how is the variance in
birth weight influenced by the sex (male or female) of the baby, a single comparison of the two sexes
- When there are more than two groups, the general model of One Way Analysis of Variance is used. For example, how do
three different ethnic origin (say Greeks, Germans, and Slavs) influence the birth weight of the baby. with three
groups there are 3 comparisons, Greek vs Germans, Greek vs Slavs, and German vs Slavs.
- When Two sets of influences (Factors) are involved (say sex and ethnicity), then a Two Way Analysis of Variance is used.
With more, Multiway Analysis of Variance. However, there may be systematic or accidental correlations between factors,
(say Greeks have more girls than Germans), and these are called Interactions between Factors. The analysis of Variance
which separates those variances unique to each factor, and those that overlapped between factors is known as the
Factorial Model of Analysis of Variance.
- If, on top of all of this, as is usually the case, there are other influences to be taken into consideration, such as
differences in birth weights must be corrected by the gestational age, then one or more of these corrections are
termed covariates, and the combination of the analysis becomes Covariance Analysis.
- Things now starts to become a bit complicated, because each covariate may act differently in different factors, say
German babies grow faster than Slav babies near term. This is call an Interaction between a factor and a covariate.
- The total number of interactions are therefore a multiple of covariates and factors. As these increases, the model becomes
complex confusing.
- To be correct, the results of a covariate analysis is only valid if all possible interactions are tested and found to be
trivial (not statistically significant). In a review of the literature however, most do not bother and assumes that
interactionsare either irrelevant or do not exist.
This panel describes to the reader the organisation of the explanations, and the example data used, in the rest of this section.
The rest of the sections are divided as follows
- One Way Analysis of Variance and covariance, with the following examples
- Analysis using two groups (sex of the baby) and a covariate (gestation)
- Analysis using three groups (ethnicity of the mother baby) and a covariate (gestation)
- Factorial Analysis of Variance and Covariance, with two factors (sex and ethnicity) and a covariate (gestation).
Sex | Ethnicity | Gest | BWt |
Girl | Greek | 37 | 3048 |
Boy | German | 36 | 2813 |
Girl | French | 41 | 3622 |
Girl | Greek | 36 | 2706 |
Boy | German | 35 | 2581 |
Boy | French | 39 | 3442 |
Girl | Greek | 40 | 3453 |
Boy | German | 37 | 3172 |
Girl | French | 35 | 2386 |
Boy | Greek | 39 | 3555 |
Girl | German | 37 | 3029 |
Boy | French | 37 | 3185 |
Girl | Greek | 36 | 2670 |
Boy | German | 38 | 3314 |
Girl | French | 41 | 3596 |
Boy | Greek | 38 | 3312 |
Girl | German | 39 | 3200 |
Boy | French | 41 | 3667 |
Boy | Greek | 40 | 3643 |
Girl | German | 38 | 3212 |
Girl | French | 38 | 3135 |
Girl | Greek | 39 | 3366 |
The algorithm used to obtain the results will be multiple regression (as entered model), as calculated in the
Multiple Regression Program Page
. Out of all the results produced, the useful parameters used for
Analysis of Variance and Covariance are
- The constant (a) and regression coefficient (b) of the regression coefficient
- The degrees of freedom (df) and Sums of Square (ssq) from the Anakysis of Variance table
The dataset used for this exercise, as tabulated to the right and plotted to the left, is artificially generated by the computer to demonstrate the procedures, and they do not represent reality.
Users should also understand that real analysis requires a much larger volume of cases than that presented here.
There are 4 German boys (red) and 3 German girls (maroon), 3 Greek boys (light green) and 5 Greek girls (dark green),
3 French boys (blue) and 4 French girls (navy). All sex and ethnicity in subsequent plots will be identified by these colors.
Two Groups
Three Groups
Sex | Gest | BWt |
Boy | 36 | 2813 |
Boy | 35 | 2581 |
Boy | 37 | 3172 |
Boy | 38 | 3314 |
Boy | 39 | 3555 |
Boy | 38 | 3312 |
Boy | 40 | 3643 |
Boy | 39 | 3442 |
Boy | 37 | 3185 |
Boy | 41 | 3667 |
Girl | 37 | 3029 |
Girl | 39 | 3200 |
Girl | 38 | 3212 |
Girl | 37 | 3048 |
Girl | 36 | 2706 |
Girl | 40 | 3453 |
Girl | 36 | 2670 |
Girl | 39 | 3366 |
Girl | 41 | 3622 |
Girl | 35 | 2386 |
Girl | 41 | 3596 |
Girl | 38 | 3135 |
We will use the data set and analyse the difference in birth weight between boys and girls, and for the moment forget the
ethnicity. The re-arranged data table is as shown to the right, and the plot as shown to the left.
One way Analysis of Variance
If we ignore the gestational age, then we can use the program in the Unpaired Difference Programs Page
. The results would
be
- For boys, n=10, mean=3268g, Standard Deviation=351g
- For girls n=12 mean=3119g, Standard Deviation=380g
- The difference = 149g, t=0.95 df=20 p=0.35
However, if we were to use the regression model in Multiple Regression Program Page
, using x=0 for boys and x=1
for girls, and y=birth weight, we will obtain the formula birth weight (y) = 3268 - 149(girls). This means that the birth weight
is 3268g when x=0 (boys), and reduced by 149g when sex is 1 (girl). The t for the regression coefficient -0.95 is also the same
as that using the algorithm to compare the two groups.
In other words, the regression algorithm produces the same results as that of analysis of variance for two groups.
One way Analysis of Variance with a covariance
The One Way Analysis of Variance showed that there was no significant difference between the birth weight of boys and girls. This is because a much greater influence obfuscated the difference, the gestational age, as can be seen in the diagram.
One method of correcting for the influence
of gestational age is to draw two regression lines and compare them, using the program in the
Compare Two Regression Lines (Covariance Analysis) Program Page
. Submitting the data to that program will produce the following results.
- For girls, Birth weight (y in gram) = -2772 + 185(gestation in weeks)
- For boys, Birth Weight (y in gram) = -3999 + 187(gestation in weeks)]
- Difference in slope = 185-187 = -2g per week, t = 0.07, df = 18, p = 0.95
- Assumed common slope = 186g / week
- Difference between sexes (girls - boys) adjusted for gestational age = -165g, t = 4.03, df = 19, p <0.001
- In other words, the growth rates between boys and girls are not significantly different, at 186g/week. Having
corrected for growth rates, girls are 186g lighter than boys, which is statistically significant.
Sex | Gestation | Ia | BWt |
0 | 36 | 0 | 2813 |
0 | 35 | 0 | 2581 |
0 | 37 | 0 | 3172 |
0 | 38 | 0 | 3314 |
0 | 39 | 0 | 3555 |
0 | 38 | 0 | 3312 |
0 | 40 | 0 | 3643 |
0 | 39 | 0 | 3442 |
0 | 37 | 0 | 3185 |
0 | 41 | 0 | 3667 |
1 | 37 | 37 | 3029 |
1 | 39 | 39 | 3200 |
1 | 38 | 38 | 3212 |
1 | 37 | 37 | 3048 |
1 | 36 | 36 | 2706 |
1 | 40 | 40 | 3453 |
1 | 36 | 36 | 2670 |
1 | 39 | 39 | 3366 |
1 | 41 | 41 | 3622 |
1 | 35 | 35 | 2386 |
1 | 41 | 41 | 3596 |
1 | 38 | 38 | 3135 |
We will now use the multiple regression model, and introduce the concept of interaction. Before we combined the influences
of gestational age and sex on birth weight, we must first assure ourselves that the influences of gestation are not different in
the two sexes, that boys grows faster/slower than girls near term.
We therefore create a new variable, the interaction (Ia) so that Ia = sex * Gestation, so that the data to be used are as shown to the right. We then analyse this set of data using multiple regression and obtain the following results (rounded to the nearest whole number).
- Birth weight (g) = -3772 + (-227(girls)) + (185(Gestation in weeks)) + (2(Interaction))
- The interaction = 2, t = 0.07, not statistically significant, is the same as the difference between the two slopes in the previous calculation
Had there been significant interaction, we would not be able to proceed, as the adjustment for gestation will need to be different in the two sexes. As there is no significant interaction, the multiple regression analysis can now be repeated without the interaction term, and the result is Birth Weight (g) = -3808 + (-165(girls)) + (186(Gestation in weeks)). In other words, having corrected for the
influence of gestation, girls are 165g lighter than boys.
The whole point of this exercise, to analyse the same data using comparison of two regression lines and using multiple regression,
is to demonstrate the principle underlying covariance analysis, and to demonstrate what an interaction in a multivariate set
of calculation is all about. To summarise
- Multiple regression can be used to analyse multivariate statistical data
- In the multi-variate situation, there is a need to check for interaction, that the influence of one variable on the outcome
is not affected by another influence.
Ethnicity | Gest | BWt |
German | 36 | 2813 |
German | 35 | 2581 |
German | 37 | 3172 |
German | 38 | 3314 |
German | 37 | 3029 |
German | 39 | 3200 |
German | 38 | 3212 |
Greek | 39 | 3555 |
Greek | 38 | 3312 |
Greek | 40 | 3643 |
Greek | 37 | 3048 |
Greek | 36 | 2706 |
Greek | 40 | 3453 |
Greek | 36 | 2670 |
Greek | 39 | 3366 |
French | 39 | 3442 |
French | 37 | 3185 |
French | 41 | 3667 |
French | 41 | 3622 |
French | 35 | 2386 |
French | 41 | 3596 |
French | 38 | 3135 |
We will use the data set and analyse the difference in birth weight between ethnic origins, and for the moment forget sex of the
baby. The re-arranged data table is as shown to the right, and the plot as shown to the left.
One way Analysis of Variance
If we ignore the gestational age, then we can use the program in the Unpaired Difference Programs Page
. The results would
be
- For Germans, n=7, mean=3046g, Standard Deviation=261g
- For Greeks, n=8, mean=3219g, Standard Deviation=373g
- For French, n=7, mean=3290, Standard Deviation=451g
- In the analysis of variance, F=0.81, α=0.46,the groups are not significantly different to each other.
Multiple Regression : Introducing the dummy variable
Multiple regression requires that the independent variables to be at least ordered (3>2>1). When there are multiple
groups which are not ordered, thee is a need to create dummy variables that are ordered to represent them, using the following procedures.
- The number of dummy variables = 1 less than the number of groups. For the current data of 3 ethnic groups, we will create
2 dummy variables EthnicDummy1 (ED1) and EthnicDummy2 (ED2)
- For each group, we will assign it to one of the dummy variables as 1, and the remaining ones as 0, and for the last group,
we will assign it as 0 to all groups. It does not matter which group is assigned to what, providing they are identified
when the results are interpreted.
- For Germans, ED1=1, ED2 = 0 (German and not Greek)
- For Greeks, ED1=0, ED2 = 1; (Greek and not German)
- For French, ED1=0, ED2=0; (Not German and not Greek)
ED1 (German) | ED2 (Greek) | Birth Weight |
1 | 0 | 2813 |
1 | 0 | 2581 |
1 | 0 | 3172 |
1 | 0 | 3314 |
1 | 0 | 3029 |
1 | 0 | 3200 |
1 | 0 | 3212 |
0 | 1 | 3555 |
0 | 1 | 3312 |
0 | 1 | 3643 |
0 | 1 | 3048 |
0 | 1 | 2706 |
0 | 1 | 3453 |
0 | 1 | 2670 |
0 | 1 | 3366 |
0 | 0 | 3442 |
0 | 0 | 3185 |
0 | 0 | 3667 |
0 | 0 | 3622 |
0 | 0 | 2386 |
0 | 0 | 3596 |
0 | 0 | 3135 |
Multiple Regression now produces the formula Birth Weight (y) = 3290 + (-245ED1) + (-71ED2). This means :
- For German babies, where ED1=1 and ED2=0, the birth weight is 3290 - 245 = 3045g
- For Greek babies, where ED1=0 and ED2=1, the birth weight is 3290 - 71 = 3219g
- For French babies, where ED1=0 and ED2=0, the birth weight is 3219g
- F for the model is 0.81, which is not statistically significant.
- Except for the rounding error of 1g for German babies, these are the same results as that from One Way Analysis of Variance
Analysis of Covariance for multiple groups.
ED1 (German) | ED2 (Greek) | Gestation | ED1S | ED2S | Birth Weight |
1 | 0 | 36 | 36 | 0 | 2813 |
1 | 0 | 35 | 35 | 0 | 2581 |
1 | 0 | 37 | 37 | 0 | 3172 |
1 | 0 | 38 | 38 | 0 | 3314 |
1 | 0 | 37 | 37 | 0 | 3029 |
1 | 0 | 39 | 39 | 0 | 3200 |
1 | 0 | 38 | 38 | 0 | 3212 |
0 | 1 | 39 | 0 | 39 | 3555 |
0 | 1 | 38 | 0 | 38 | 3312 |
0 | 1 | 40 | 0 | 40 | 3643 |
0 | 1 | 37 | 0 | 37 | 3048 |
0 | 1 | 36 | 0 | 36 | 2706 |
0 | 1 | 40 | 0 | 40 | 3453 |
0 | 1 | 36 | 0 | 36 | 2670 |
0 | 1 | 39 | 0 | 39 | 3366 |
0 | 0 | 39 | 0 | 0 | 3442 |
0 | 0 | 37 | 0 | 0 | 3185 |
0 | 0 | 41 | 0 | 0 | 3667 |
0 | 0 | 41 | 0 | 0 | 3622 |
0 | 0 | 35 | 0 | 0 | 2386 |
0 | 0 | 41 | 0 | 0 | 3596 |
0 | 0 | 38 | 0 | 0 | 3135 |
The differences between ethnic groups have been found to be not statistically significant, but this may be caused by the
much greater influence of gestational age on birth weight, as can be seen in the plot above. The inclusion of gestational age
as a covariate is therefore necessary.
As the three ethnic groups have been converted into two dummy variables ED1 and ED2, the interaction between gestation and both
ED variables will now need to be constructed. These are ED1G=ED1*Gest, and ED2G=ED2*Gest. The data is now as shown to the right,
and analysis will follow the following steps.
Step 1 : All 5 independent variables, ED1, ED2, gestation,
ED1S, ED2S, plus the dependent variable BWt, are subjected to multiple regression analysis. Although the full data output
is produced by the program, we are only interested in the model degrees of freedom (5) and Sums of Square (2544655).
Step 2 : The exercise is repeated, excluding the two interaction terms of ED1S and ED2S. The 3 independent variables,
ED1, ED2, Gestation, plus the dependent variable BWt is subjected to multiple regression analysis. Again, we are interested in
the degrees of freedom (3) and Sums of Square (2527306)
Step 3 : Analysis of Interaction Using the combined information from the two steps , we can now reconstruct the
Analysis of Variance Table obtained initially in Step 1, as shown in the table to the right. The Probability of Type I Error
for F= 0.49, with 2 and 16 degrees of freedom is α=0.63, and we can now conclude at this point that
no significant interaction exists between gestation and ethnic origin of the babies. In other words, the growth rates
of babies near term are not different in the three ethnic groups.
| df | SSq | MSq | F |
Inclusive of Interaction | 5 | 2544655 |
Exclusive of Interaction | 3 | 2527306 |
Attributable to Interaction | 2 | 17349 | 17349/2=8675 | 8675/17535=0.49 |
Residual | 16 | 280560 | 280560/16=17535 |
Step 4 : Covariance Analysis . The Regression Formula obtained in Step 2, excluding interactions, can now be examined.
- The formula is Birth weight (y in g) = -4166 + 84ED1 +69ED2 + 192Gestation (in weeks)
- Birth weight increases by 192g per week near term (t=11.8, α<0.001, statistically significant)
- A French baby, at term (40 weeks), averaged 40*192-4166 = 3514g
- German babies (ED1) are 84g more than French babies (t=1.13, α=0.27, not statistically significant)
- Greek babies are 69g more than French babies (t=1.02, α=0.32, not statistically significant)
Comments : These simple steps demonstrate the mathematical sequence used to handle complex data using the multiple regression algorithm.
- The creation of binary dummy variables to replace variables with multiple groups
- The creation of interaction variables between different factors, where Interaction value = Factor1 value multiplied by Factor 2 value
- The double analysis of variance, with and without the interaction variables, to isolate the interaction effect. This is necessary,
because some correlation (and therefore overlapping effect) exists between different factors, and this double procedure allows
the overlap to remain with the main effect, so that the uncorrelated interactions can be isolated.
- Only when there is no significant interaction, can the covariance analysis be interpreted.
Two very important concepts involved when handling multivariate data are also demonstrated in this model.
- Interaction, where the influence of on factor on the dependent variable is altered by another factor. Interaction can be
helpful or unhelpful, but they need to be defined, isolated, and interpreted. An example is that interaction between sex and
gestation means boys and girls have different growth rates
- Confounding, caused by correlations between factors, so that it is difficult or even impossible to identify how much each
factor affects the outcome. Confounding is always bad as it results in misleading interpretations, and the greatest virtue of
multiple regression analysis is its ability to separate the unique and overlapping parts of effects from multiple factors.
An example of correlation and confounding would be if girls are born earlier than boys, so that it is unclear
whether it is the sex or the gestation that affects birth weight.
Factorial Analysis of Variance
Factorial Analysis of Covariance
The Factorial model of Analysis of Variance was initially used in agriculture and animal laboratories, where subjects
(plants or animals) are randomly allocated to groups, which are given a combination of two or more treatments. Such a model has
many advantages
- The same subject is used in a number of experiments simultaneously, thus greatly reduce the cost of research
- In many cases, the combination of two treatments may have greater (synergism) or less (antagonism) effect than the sum
of their individual treatment. These are called interactions and provides additional useful information to have'
- Mathematically, the analysis of Variance calculates the effect of each treatment (single factors), then in groups of
combined treatments (combined factors). The difference between the combined effect and the sums of the single effects
then represented the interaction, which can be numerically presented and statistically tested.
- The two important underlying assumptions in this model are, firstly, that the treatment must be randomly and
independently allocated, so there is no correlation between treatments, and secondly, that all groups and subgroups
at different levels have the same sample size.
The Factorial model is a powerful and efficient model of investigation, so gradually it is adopted in all aspects of
psychosocial research, and into the clinical area, and from the controlled experiment to the epidemiological model. In doing so,
the important assumptions of Factorial models cannot be met, as independent variables are often not randomly allocated treatments,
but characteristics in the natural environment, and sample size availability in subgroups are seldom the same.
- The sample size in the groups can only be controlled to an extent. For example, the number of boys and girls born are
never exactly the same, and to artificially create equal numbers will require removing some cases arbitrarily, and this
process itself will introduce a bias.
- The difference in birth weight between boys and girls amongst Germans may be different to that amongst Greeks (interaction).
Although interaction can be useful information, in clinical investigations they often represents an unwanted distraction making
interpretation of data difficult.
- We cannot allocate sex at random to different groups, and a possibililty of correlation occurs. For example, the sex ratio may
differ in different ethnic groups, so that the influence of ethnicity and sex cannot be separated (confounding).
When the assumptions of the Factorial model is violated, the results produced becomes misleading, and sometimes the numbers do not add up. When there is extensive correlation between independent variables, the overlapping influences are counted repeatedly and thus inflated in the single effects, so that the combined effect is less than the sum of the single effects, resulting in a conceptually unacceptable negative interaction.
The mathematics of multiple regression is able to resolve this difficulty, because it separates those influence (in terms of Sums of Squares) that are unique to each independent variable, and those influence that overlaps between the correlated variables. In short, it treats every factor both as an independent variable and a covariate. In most modern statistical packages therefore, the multiple regression algorithm is used for calculation even though the user interface retains the Analysis of Variance format.
Sex | Ethnicity | BWt |
Boy | German | 2813 |
Boy | German | 2581 |
Boy | German | 3172 |
Boy | German | 3314 |
Boy | Greek | 3555 |
Boy | Greek | 3312 |
Boy | Greek | 3643 |
Boy | French | 3442 |
Boy | French | 3185 |
Boy | French | 3667 |
Girl | German | 3029 |
Girl | German | 3200 |
Girl | German | 3212 |
Girl | Greek | 3048 |
Girl | Greek | 2706 |
Girl | Greek | 3453 |
Girl | Greek | 2670 |
Girl | Greek | 3366 |
Girl | French | 3622 |
Girl | French | 2386 |
Girl | French | 3596 |
Girl | French | 3135 |
Factorial Model for Birth Weight
The data, as plotted, is shown in the diagram to the left, but for this analysis, we will ignore gestational age, and only
examine how the two factors, sex and ethnic origin, affect birth weight. The data is as shown in the table to the right.
To allow multiple regression, the 3 groups in the ethnicity factor is converted into two binary variables, as follows
- For Germans, ED1=1, ED2 = 0 (German and not Greek)
- For Greeks, ED1=0, ED2 = 1; (Greek and not German)
- For French, ED1=0, ED2=0; (Not German and not Greek)
To allow the estimation of interaction, two additional interaction variables are created
- Interaction between ED1 and sex ED1S = ED1 * sex
- Interaction between ED2 and sex ED2S = ED2 * sex.
sex | Ed1 | Ed2 | ED1S | ED2S | BWt |
0 | 1 | 0 | 0 | 0 | 2813 |
0 | 1 | 0 | 0 | 0 | 2581 |
0 | 1 | 0 | 0 | 0 | 3172 |
0 | 1 | 0 | 0 | 0 | 3314 |
0 | 0 | 1 | 0 | 0 | 3555 |
0 | 0 | 1 | 0 | 0 | 3312 |
0 | 0 | 1 | 0 | 0 | 3643 |
0 | 0 | 0 | 0 | 0 | 3442 |
0 | 0 | 0 | 0 | 0 | 3185 |
0 | 0 | 0 | 0 | 0 | 3667 |
1 | 1 | 0 | 1 | 0 | 3029 |
1 | 1 | 0 | 1 | 0 | 3200 |
1 | 1 | 0 | 1 | 0 | 3212 |
1 | 0 | 1 | 0 | 1 | 3048 |
1 | 0 | 1 | 0 | 1 | 2706 |
1 | 0 | 1 | 0 | 1 | 3453 |
1 | 0 | 1 | 0 | 1 | 2670 |
1 | 0 | 1 | 0 | 1 | 3366 |
1 | 0 | 0 | 0 | 0 | 3622 |
1 | 0 | 0 | 0 | 0 | 2386 |
1 | 0 | 0 | 0 | 0 | 3596 |
1 | 0 | 0 | 0 | 0 | 3135 |
The data is then subjected to analysis using similar steps as that for covariance analysis.
Step 1 : A two stage Analysis of Variance using the multiple regression algorithm, with and without the interaction
variables are carried out. In these analysis, only the degrees of freedom and Sums of Squares for the model are of interest.
- The first calculation, including 5 independent variables of Sex, ED1, ED2,ED1S, ED2S, and the outcome variable BWt
are used. The degrees of freedom = 5, and the Sums of Square = 768244
- The second calculation excludes the two interaction variables (ED1S and ED2S). Three independent variables Sex, ED1, ED2,
and the outcome variable BWt are analysed. The degrees of freedom = 3, and Sums of Squares = 400694
- The Table of Analysis of Variance can now be restructured accordingly, as shown in the table to the right.
- Probability of Type I Error for F=1.43, with 2 and 16 degrees of freedom, α=0.27, not statistically significant.
| df | SSq | MSq | F |
Inclusive of Interaction | 5 | 768244 |
Exclusive of Interaction | 3 | 400694 |
Attributable to Interaction | 5-3=2 | 768244-400694=367550 | 367550/2=183775 | 183775/128561=1.43 |
Residual | 16 | 2056971 | 2056971/16=128561 |
At this point therefore, we can conclude that no significant interaction exists between sex and ethnicity. In other words, the difference in birth weight between boys and girls in different ethnic groups are similar.
Step 2 : The regression formula obtained without the interaction variables can now be used to interpret the data.
- The formula is Birth weight (y in grams) = 3395 -183(sex) -270(ED1) -61(Ed2)
- French (ED1=0 and Ed2=0) Boys (sex=0) averaged 3395g
- French girls are 183g less (t = 1.15, α=0.27)
- Germans are 270g less than French babies in their respective sexes (t = 1.37, α=0.19)
- Greek are 61g less than French babies in their respective sexes. (t = 0.32, α=0.75)
- None of these differences are statistically significant
Sex | Ethnicity | Gest | BWt |
Boy | German | 36 | 2813 |
Boy | German | 35 | 2581 |
Boy | German | 37 | 3172 |
Boy | German | 38 | 3314 |
Boy | Greek | 39 | 3555 |
Boy | Greek | 38 | 3312 |
Boy | Greek | 40 | 3643 |
Boy | French | 39 | 3442 |
Boy | French | 37 | 3185 |
Boy | French | 41 | 3667 |
Girl | German | 37 | 3029 |
Girl | German | 39 | 3200 |
Girl | German | 38 | 3212 |
Girl | Greek | 37 | 3048 |
Girl | Greek | 36 | 2706 |
Girl | Greek | 40 | 3453 |
Girl | Greek | 36 | 2670 |
Girl | Greek | 39 | 3366 |
Girl | French | 41 | 3622 |
Girl | French | 35 | 2386 |
Girl | French | 41 | 3596 |
Girl | French | 38 | 3135 |
All previous discussion on Factorial Analysis of Variance and in Covariance Analysis are subsections of the full
Factorial Covariance Model , which will be discussed in this section. The data is as presented in the table to the left,
and plotted to the right.
The aim is to analyse the influence of two factors, sex and ethnicity on the birth weight of a baby, corrected for a
single covariate, the gestational age in weeks. The algorithm to be used in the multiple regression.
As the reasons for the various procedures have already been covered in previous section, only the various stages of computation will be listed here.
Sex | ED1 | ED2 | ED1S | ED2S | Gest | ED1G | ED2G | ED1SG | ED2SG | BWt |
0 | 1 | 0 | 0 | 0 | 36 | 36 | 0 | 0 | 0 | 2813 |
0 | 1 | 0 | 0 | 0 | 35 | 35 | 0 | 0 | 0 | 2581 |
0 | 1 | 0 | 0 | 0 | 37 | 37 | 0 | 0 | 0 | 3172 |
0 | 1 | 0 | 0 | 0 | 38 | 38 | 0 | 0 | 0 | 3314 |
0 | 0 | 1 | 0 | 0 | 39 | 0 | 39 | 0 | 0 | 3555 |
0 | 0 | 1 | 0 | 0 | 38 | 0 | 38 | 0 | 0 | 3312 |
0 | 0 | 1 | 0 | 0 | 40 | 0 | 40 | 0 | 0 | 3643 |
0 | 0 | 0 | 0 | 0 | 39 | 0 | 0 | 0 | 0 | 3442 |
0 | 0 | 0 | 0 | 0 | 37 | 0 | 0 | 0 | 0 | 3185 |
0 | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 0 | 3667 |
1 | 1 | 0 | 1 | 0 | 37 | 37 | 0 | 37 | 0 | 3029 |
1 | 1 | 0 | 1 | 0 | 39 | 39 | 0 | 39 | 0 | 3200 |
1 | 1 | 0 | 1 | 0 | 38 | 38 | 0 | 38 | 0 | 3212 |
1 | 0 | 1 | 0 | 1 | 37 | 0 | 37 | 0 | 37 | 3048 |
1 | 0 | 1 | 0 | 1 | 36 | 0 | 36 | 0 | 36 | 2706 |
1 | 0 | 1 | 0 | 1 | 40 | 0 | 40 | 0 | 40 | 3453 |
1 | 0 | 1 | 0 | 1 | 36 | 0 | 36 | 0 | 36 | 2670 |
1 | 0 | 1 | 0 | 1 | 39 | 0 | 39 | 0 | 39 | 3366 |
1 | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 0 | 3622 |
1 | 0 | 0 | 0 | 0 | 35 | 0 | 0 | 0 | 0 | 2386 |
1 | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 0 | 3596 |
1 | 0 | 0 | 0 | 0 | 38 | 0 | 0 | 0 | 0 | 3135 |
Step 1. Preparation of the data
- Sex : Boy=0, Girl=1
- Creation of two dummy variables. ED1=0 for non-German and 1 for German, and ED2=0 for non-Greek and 1 for Greek
- Two Interaction variable between Sex and the dummy variables. ED1S = Sex * ED1, and ED2S = Sex * ED2
- Gest : Gestation in weeks
- 4 interaction variables involving Gestation. ED1G = Gest * ED1, ED2G = Gest * ED2, ED1SG = Gest * ED1S, ED2SG = Gest * ED2S
- BWt : Birth weight in grams(g)
Step 2 : Interaction related to Gestation
Two Analysis, inclusive and exclusive of interaction variables of gestation, are carried out, to obtain the degrees
of freedom and Sums of Squares of the two models.
- The first analysis includes the 10 independent variables of Sex, ED1, ED2, ED1S, ED2S, Gestation, ED1G, ED2G, ED1SG, ED2SG,
and the dependent variable Birth weight. The degrees of Freedom = 10, and Sums of Squares = 2729846
- The second analysis excludes the 4 gestation related interaction variables (ED1G, ED2G, ED1SG, ED2SG). Six independent variables
of Sex, ED1, ED2, ED1S, ED2S, Gestation, and the dependent variable BWt are analysed. The model degrees of freedom is now 6,
and Sums of Squares = 2682429
- The Analysis of Variance Table can now be constructed, as shown below and to the right. Probability of Type I Error for F=1.37,
with 4 and 11 Degrees of Freedom α = 0.31, not statistically significant.
| df | SSq | MSq | F |
Inclusive of Interaction | 10 | 2729845.964 | |
Exclusive of Interaction | 6 | 2682429 | |
Attributive to Interaction | 10-6=4 | 2729846-2682429=47417 | 47417/4=11854 | 11854/8670=1.37 |
Residual | 11 | 95369 | 95369/11=8670 |
At this point, we can conclude that there is no significant interaction involving gestation. In other words, growth rates in all groups are similar.
Step 3 : Evaluating Interaction between sex and ethnicity
As with gestation, consideration of interaction between sex and ethnicity also involves two analysis.
- The first analysis includes 6 independent variables of Sex, Ed1, ED2, ED1S, ED2S, Gestation, and the dependent variable BWt,
and these are now subjected to analysis of variance using the multiple regression algorithm. The model degrees of freedom is 6,
and Sums of Square = 2682429.
- The second analysis, excludes the two interaction variables between sex and ethnicity (ED1S, ED2S). Four independent variables,
Sex, Ed1, ED2, Gestation, and the dependent variable BWt, are now subjected to analysis of variance using the multiple
regression algorithm. The model degrees of freedom is 4, and Sums of Square = 2673221.
- The Analysis of Variance Table can now be constructed, as shown below and to the right. Probability of Type I Error for F=0.48,
with 2 and 15 Degrees of Freedom α = 0.63, not statistically significant.
| df | SSq | MSq | F |
Inclusive of Interaction | 6 | 2682429 | |
Exclusive of Interaction | 4 | 2673221 | |
Attributive to Interaction | 6-4=2 | 2682429-2673221=9208 | 9208/2=4604 | 4604/9519=0.48 |
Residual | 15 | 142785 | 142785/15=9519 |
At this point, we can conclude that there is no significant interaction between sex and ethnicity. In other words, once corrected for
gestation, the difference in birth weight between boys and girls are similar in all ethnic groups.
Step 5 : Final Analysis
The regression formula in the last analysis, free of any interaction terms, can now be interpreted. T
- The formula is Birth Weight (y in g) = -4022 -166Sex + 58ED1 + 77ED2 + 191Gest(week)
- Weight gain is 191g per week near term (t=15.94, α<0.0001, statistically highly significant)
- A French Boy (Sex=0, ED1=0, ED2=0), at 40 weeks gestation, averaged 40*191-4022 = 3618g
- A French girl is 166g lighter (t=4.04, α=0.0009, statistically highly significant)
- German babies are 58g heavier than French babies with respective sex and gestation (t=1.06, α=0.30, not statistically significant)
- Greek babies are 77g heavier than French babies with respective sex and gestation (t=1.55, α=0.44, not statistically significant)
- We have established that gestation and sex of the babies significantly affect birth weight, but ethnic origins do not.
Multiple Regression
Steel RGD, Torrie JH, Dickey DA (1997) Principles and procedures of statistics.
A biomedical approach. 3rd Ed. McGraw-Hill Inc New York NY 10020
ISBN 0-07-061028-2 p. 322-351
Sample Size for Multiple Regression
Cohen J. (1988) Statistical Power Analysis for the Behavioural
Sciences. Second Edition. Lawrence Erlbaum Associates Publishers.
Hillsdale New Jersey USA. ISBN 0-8058-0283-5. p. 407-410; 551.
Covariance Models
Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology.
McGraw Hill Book Company New York. Library of Congress No. 73-14716407-047935-6 p.415-440.
This provides the template used on this page, of how to use multiple regression to carry out the analysis of covariance.
Steel RGD, Torrie JH, Dickey DA (1997) Principles and procedures of statistics. A biomedical approach. 3rd Ed.
McGraw-Hill Inc New York NY 10020 ISBN 0-07-061028-2 p. 322-351
This provides the mathematics of calculating the coefficients and the analysis of variance for the multiple regression model.
Pedhazur E. (1997) Multiple Regression in Behavioral Research. Explanation and Prediction.(3rd. ed.)
Harcourt Brace College Publishers, Fort Worth, USA. ISBN 0-03-072831-2 p.181-196.
This is a very detail textbook dealing with multiple regression and the many ways it can be used, and a very useful reference book.
It is included here however because it provides an excellent discussion on dummy variables.
|