mult Reg exp

StatTools : Multiple Regression Explained

Links : Home Index (Subjects) Contact StatTools

Multiple Regression Complex Factorial Covariance Models References

Introduction Multiple Regression Example

Multiple regression, from the program in Multiple Regression Program Page is one of the most flexible and powerful statistical tools available to the researcher, as it allows the modelling of multiple influences on an outcome, correcting for the overlapping influence of the independent variables. For those who are familiar with the concepts, the algorithm of multiple regression can be used to calculate a large number of other parametric statistical procedures.

Most professional statistical packages provide large numbers of complex statistical procedures based on multiple regression, under the broad heading of the General Linear Model. StatTools provides the following algorithms based on the multiple regression.

Multiple Regression in Multiple Regression Program Page and its Sample Size Sample Size for Multiple Regression Explained and Tables Page
Path Analysis in Path Analysis Explained Page
Polynomial Curve Fitting in Curve Fitting Explained Page

This page has two sections.

This section explains the use of multiple regression, its sample size, and provides an example
A very much larger and complex section on how to use the multiple regression algorithm to conduct complex Factorial models of Covariance Analysis

Multiple regression consists of two or more independent variables (x₁,x₂,x₂, etc) and a single dependent variable (y). The formula produced is y = a + b₁x₁+b₂x₂+b₃x₃... where "

y is the single dependent variable, assumed to be a parametric measurement (continuous and normally distributed)
The x in x₁,x₂,x₂, etc, are the independent variables. They need not be parametric, but they need to be ordered (3>2>1). Common independent variable types are :
- Binary of 0/1, (no/yes, false/true, negative/positive, etc)
- Ordinal, such as responses to pain (0=none, 1=some, 2=lots), or the Likert Item (0=SD, 1=D, 2=N 3=A, 4=SA)
- Poisson distributed counts, such as number of cells in a set volume, number of complaints per month
- Discrete Interval Measurements with unstated distribution, such as height in cms, age in years, Time on waiting list
- Normally distributed measurement
- Log-normally distributed measurements such as ratios.

Data Entry : when using the program in Multiple Regression Program Page consists of a multi-column table, where

Each row is data from a subject
Each column is a measurement of a variable
The last column is the dependent (y) variable

Terminology

Partial Correlation Coefficient (PCor) is the correlation between an independent variable (x) with the dependent variable (y), having corrected for inter-correlations between all the independent variables
Partial Standardised Regression Coefficient (PSReg) is the regression coefficient between an independent variable (x) and the dependent variable (y), having corrected for inter-correlations between all the independent variables, rescaled to a mean of 0 and Standard Deviation of 1 for both. This is measurement unit free, and used for comparing the relative scale of influence from different independent variables
Partial Regression Coefficient (PReg or b) is the regression coefficient between an independent variable (x) and the dependent variable (y), having corrected for inter-correlations between all the independent variables. This is the b used in the regression formula y = a + b₁x₁+b₂x₂+b₃x₃...
Standard Error of the Partial Regression Coefficient (SE)
t=b/SE, and α is the Probability of Type I Error (two tail) of t with residual degrees of freedom
Constant(a) is the a in the formula y = a + b₁x₁+b₂x₂+b₃x₃...

Please note : that, in the table of analysis if variance, although the model Degrees of Freedom is the sum of the Regression Degrees of Freedom, the model Sums of Square is greater than the sum of (Sums of Square) from all the regression Coefficients. This is because the individual Sums of Squares describes the pure influence on y from each x variable, while the model sums all of them, and add on top those Sums of Squares that overlap between the independent x variables. It is this difference that provides the very powerful analysis of variance in complex models, where multiple measurements often have various degrees of correlation with each other, and their pure influences and overlapping influences need to be separately accounted for.

Multiple Regression as Entered, and with Stepwise deletion The program in the Multiple Regression Program Page provides two options for conducting multiple regression

The as Entered model calculates multiple regression once, using all the entered data. This is the preferred model if the intension is to provide a description of the relationship between the variables, or if the calculation is used to obtain parameters for other complex statistical purposes.
The Stepwise Deletion model carries out repeated multiple regression analysis on the data entered, deleting the weakest independent variable after each cycle. This is the preferred model when developing a predictive algorithm, where the researcher starts with a large number of plausible predictors, and eliminate the weaker ones serially to obtained the most powerful yet most parsimonious (fewest predictors) formula.
The algorithm from the program continues until only 1 independent variable left, allowing the user to determine the number of independent variables to retain in the final formula. This can be done arbitrarily by judgement, but in most cases, the decision is to retain only those independent variables where the Partial Regression Coefficient (b) is statistically significant (α<0.05)

Sample size for Multiple Regression Sample size program for multiple regression in the Multiple Regression Program Page uses a modified version of that for comparing multiple groups of measurement in the Sample Size for Unpaired Differences Tables Page , but using the number of independent variables and Multiple Correlation Coefficient R to represent the number of groups and the residual variance. The calculations require multiple iterative approximations, so computation time increases exponentially with the number of independent variables, and with decreasing value of R. Users are encouraged to consult the tables in the Sample Size for Multiple Regression Explained and Tables Page for their sample size needs.

Example 1 : Sample Size

We wish to study whether we can predict birthweight from maternal age, height and weight, as well as gestational age and the sex of the baby, 5 independent variables or predictors. We want this model to be clinically useful, so requires a moderate effect size of R=0.5

Using α=0.05, power=0.8, number of independent variables u=5, and anticipated effects size R=0.5, we can obtained from the Sample Size for Multiple Regression Explained and Tables Page that the sample size required to be 56 pregnancies.

Example 2 : Multiple Regression as Entered

age	Ht	Gest	Sex	BWt
24	170	37	1	3048
29	161	36	0	2813
29	167	41	1	3622
21	165	36	1	2706
35	168	35	0	2581
27	161	39	0	3442
26	163	40	1	3453
34	167	37	0	3172
25	165	35	1	2386
28	170	39	0	3555
32	167	37	1	3029
31	169	37	0	3185
26	161	36	1	2670
21	165	38	0	3314
21	166	41	1	3596
24	164	38	0	3312
34	169	38	0	3414
25	161	41	0	3667
26	167	40	0	3643
27	162	33	1	1398
27	160	38	1	3135
21	167	39	1	3366

We use the default example data from the Multiple Regression Program Page for this exercise. The data was computer generated to demonstrate the procedure and not real.

We wish to explained factors that may influenced the birth weight of babies, these being maternal age (years) and height (cms), the gestation age at birth (weeks), and whether the baby is a girl (1) or boy (0). We collected 22 subjects, with the data showing on the left.

Var	mean	SD
1.age	27.0	4.3
2.Ht	165.2	3.2
3.Gest	37.8	2.1
4.Sex	0.5	0.5
5.BWt	3114	533

Please note : The data are in columns separated by spaces or tabs, and the dependent variable (BWt) is in the last column.

Using the program from the Multiple Regression Program Page and taken the option of calculating the data as entered, we obtained the following results.

we firstly produced the means and standard deviations of all the variables as shown to the right, the last variable (5.BWt) is the dependent variable.

1	0.26	-0.25	-0.38	-0.10
0.26	1	0.08	-0.13	0.24
-0.25	0.07	1	-0.11	0.92
-0.38	-0.13	-0.11	1	-0.32
-0.10	0.24	0.92	-0.32	1

The correlation matrix is produced next, as shown on the right.

The multiple regression analysis now takes place. Please note abbreviations for the coefficients table are as follows.

PCor = Partial Correlation Coefficient. This is the correlation between the variable and the dependent variable after correction for inter-correlation between the independent variables.
PSReg = Partial Standardised Regression Coefficient. This measures the influence of each independent variable on the dependent variable, using z or standardised units. For example, for 1 SD of change in maternal age, 0.01 SD of change occurs in birthweight. For 1 SD of change in gestation, 0.9 SD of change occurs in birthweight.
PReg = Partial Regression Coefficient. This measures the change in the dependent variable for each unit of change in the independent variable. For example, for an increase of 1 year in age, the baby weighs 1.7g more. For each week of maturing, the baby weighs 223g more. Girls are 209g lighter.

var	PCor	PSReg	PReg	SE	t	α
1.age	0.0418	0.0137	1.701	9.8641	0.1724	0.8653
2.Ht	0.4395	0.1417	23.6492	11.7243	2.0171	0.0608
3.Gest	0.9493	0.8952	223.1943	17.9205	12.4547	<0.0001
4.Sex	-0.5476	-0.2009	-209.15	77.5107	-2.6983	0.0158

Const = -9165.48 R = 0.961 R² = 0.9236

SE = standard error of the Partial Regression Coefficient.
t = t test for that Partial Regression Coefficient
α (p) = the probability of Type I Error (α) for that Partial Regression Coefficient.
Const = the constant of the equation. In this case, BWt in G = -9165 + 1.7(age in years) + 23.7(height in cms) + 223.2(gestation in weeks), and -209.5 if the baby is a girl.
R = the Multiple Correlation Coefficient. This is the effect size of the equation, R Sq is R², the proportion of the total variance that is explained by the regressions.

This is followed by the analysis of variance

	df	SSq	MSq	F	α
Var1	1	797	797	0.0297	0.8652
Var2	1	109006	109006	4.0688	0.0598
Var3	1	4155814	4155814	155.1197	<0.0001
Var4	1	195066	195066	7.281	0.0152
Model	4	5503642	1375910	51.3572	<0.0001
Res	17	455447	26791
Tot	21	5959089

The abbreviations for the analysis of variance table are as follows

Var = the source of variation
df = degrees of freedom
SSq = Sum of Squares
MSq = mean Sums of Squares or variance
F = Fisher's F, ratio of MSq of Reg and Res
p = Probability of Type I error (α)
Model = Contribution from all independent variables collectively
Res = results related to the residual or random error
Tot = total df and SSq.
Var1-Var4 = individual contributions from each variable after corrections for correlation

It should be noted that, although the sum of degrees of freedom from all the independent variable equals to that of the model as a whole (in this example both = 4), this is not so for the Sums of Squares unless the independent variables are all uncorrelated with each other. Otherwise the sum of all the individual Sums of Squares is usually less than that of the model as a whole (in this example 4460683 and 5503642). This is because, for each variable, the Sum of Squares tabulated is that unique to itself, excluding the part it shares by correlation with other independent variables. The missing value, the difference between model ssq and the sums of those from individual variables (5503642-4460683=1042958), is that attributable to the overlaps and correlations between the independent variables.

Example 3 : Multiple Regression with Stepwise Deletion

Instead of aiming to understand the relationship between independent and dependent variables, we wish to establishe the most efficient formula to predict birthweight. The efficiency is defined by the most accurate prediction with the least number of independent variables. We determined to use α(p)>0.05 to delete those variables that are inefficient predictors.

var	PCor	PSReg	PReg	SE	t	α
2.Ht	0.46	0.14	24.18	11.0	2.1985	0.042
3.Gest	0.95	0.89	222.14	16.38	13.5577	<0.0001
4.Sex	-0.59	-0.21	-214.61	68.83	-3.118	0.0063

constant (a) = -9165.37

From the first cycle of calculation in the previous example, we determined that maternal age (PSReg=0.01, t=0.17, α=0.87) can be deleted. In the second cycle, we found the results as shown to the right. All 3 remaining predictors now have statistically significant Partial Regression Coefficient (α<0.05), so no further deletion is necessary, and the final prediction formula is

Birth weight (g) = -9165 + 24 (maternal height in cms) + 222 (gestation in weeks) for boys, and 215g less for girls

Please note : that the program in the Multiple Regression Program Page progressively delete the least significant variable at each cycle of calculations until only one variable is left in the equation. The user however should examine the results at the end of each cycle, and decide when the stepwise deletion should stop. In this example, stepwise deletion is stopped after the first cycle, and only maternal height had been deleted, because the decision to delete was based on α>0.05

Concepts and Background OneWay Analysis of Variance and Covariance Factorial Analysis of Variance and Covariance

Introduction and Theoretical Considerations Technical Considerations

This section explains the relationship between multiple regression and the general model of analysis of variance and covariance. This is done for the following reasons.

To demonstrate the underlying principles of the least squares statistical approach to the analysis of variance
To provide an understanding of One Way Analysis of Variance, the Factorial model of Analysis of Variance, and the Analysis of Covariance
To provide a guideline on how to conduct complex Analysis of Covariance, step by step, using the algorithm of multiple regression. Although this may still be of interest to some, it is mostly superceded by the commercially available statistical packages, which will perform the procedures with check boxes for options, and a click of the button.

For those who do not have a clear understanding of Analysis of Covariance, the following minimal and very basic terms and descriptions may be useful.

Variance is the square of the Standard Deviation, and it measures variations in a measurement
The Analysis of Variance partitions the variance of the dependent variable according to those factors that influence it.
- In the simplest model, the analysis of variance is summarized as the t test. For example, how is the variance in birth weight influenced by the sex (male or female) of the baby, a single comparison of the two sexes
- When there are more than two groups, the general model of One Way Analysis of Variance is used. For example, how do three different ethnic origin (say Greeks, Germans, and Slavs) influence the birth weight of the baby. with three groups there are 3 comparisons, Greek vs Germans, Greek vs Slavs, and German vs Slavs.
- When Two sets of influences (Factors) are involved (say sex and ethnicity), then a Two Way Analysis of Variance is used. With more, Multiway Analysis of Variance. However, there may be systematic or accidental correlations between factors, (say Greeks have more girls than Germans), and these are called Interactions between Factors. The analysis of Variance which separates those variances unique to each factor, and those that overlapped between factors is known as the Factorial Model of Analysis of Variance.
If, on top of all of this, as is usually the case, there are other influences to be taken into consideration, such as differences in birth weights must be corrected by the gestational age, then one or more of these corrections are termed covariates, and the combination of the analysis becomes Covariance Analysis.
Things now starts to become a bit complicated, because each covariate may act differently in different factors, say German babies grow faster than Slav babies near term. This is call an Interaction between a factor and a covariate.
The total number of interactions are therefore a multiple of covariates and factors. As these increases, the model becomes complex confusing.
To be correct, the results of a covariate analysis is only valid if all possible interactions are tested and found to be trivial (not statistically significant). In a review of the literature however, most do not bother and assumes that interactionsare either irrelevant or do not exist.

This panel describes to the reader the organisation of the explanations, and the example data used, in the rest of this section.

The rest of the sections are divided as follows

One Way Analysis of Variance and covariance, with the following examples
- Analysis using two groups (sex of the baby) and a covariate (gestation)
- Analysis using three groups (ethnicity of the mother baby) and a covariate (gestation)
Factorial Analysis of Variance and Covariance, with two factors (sex and ethnicity) and a covariate (gestation).

Sex	Ethnicity	Gest	BWt
Girl	Greek	37	3048
Boy	German	36	2813
Girl	French	41	3622
Girl	Greek	36	2706
Boy	German	35	2581
Boy	French	39	3442
Girl	Greek	40	3453
Boy	German	37	3172
Girl	French	35	2386
Boy	Greek	39	3555
Girl	German	37	3029
Boy	French	37	3185
Girl	Greek	36	2670
Boy	German	38	3314
Girl	French	41	3596
Boy	Greek	38	3312
Girl	German	39	3200
Boy	French	41	3667
Boy	Greek	40	3643
Girl	German	38	3212
Girl	French	38	3135
Girl	Greek	39	3366

The algorithm used to obtain the results will be multiple regression (as entered model), as calculated in the Multiple Regression Program Page . Out of all the results produced, the useful parameters used for Analysis of Variance and Covariance are

The constant (a) and regression coefficient (b) of the regression coefficient
The degrees of freedom (df) and Sums of Square (ssq) from the Anakysis of Variance table

The dataset used for this exercise, as tabulated to the right and plotted to the left, is artificially generated by the computer to demonstrate the procedures, and they do not represent reality. Users should also understand that real analysis requires a much larger volume of cases than that presented here.

There are 4 German boys (red) and 3 German girls (maroon), 3 Greek boys (light green) and 5 Greek girls (dark green), 3 French boys (blue) and 4 French girls (navy). All sex and ethnicity in subsequent plots will be identified by these colors.

Two Groups Three Groups

Sex	Gest	BWt
Boy	36	2813
Boy	35	2581
Boy	37	3172
Boy	38	3314
Boy	39	3555
Boy	38	3312
Boy	40	3643
Boy	39	3442
Boy	37	3185
Boy	41	3667
Girl	37	3029
Girl	39	3200
Girl	38	3212
Girl	37	3048
Girl	36	2706
Girl	40	3453
Girl	36	2670
Girl	39	3366
Girl	41	3622
Girl	35	2386
Girl	41	3596
Girl	38	3135

We will use the data set and analyse the difference in birth weight between boys and girls, and for the moment forget the ethnicity. The re-arranged data table is as shown to the right, and the plot as shown to the left.

One way Analysis of Variance

If we ignore the gestational age, then we can use the program in the Unpaired Difference Programs Page . The results would be

For boys, n=10, mean=3268g, Standard Deviation=351g
For girls n=12 mean=3119g, Standard Deviation=380g
The difference = 149g, t=0.95 df=20 p=0.35

However, if we were to use the regression model in Multiple Regression Program Page , using x=0 for boys and x=1 for girls, and y=birth weight, we will obtain the formula birth weight (y) = 3268 - 149(girls). This means that the birth weight is 3268g when x=0 (boys), and reduced by 149g when sex is 1 (girl). The t for the regression coefficient -0.95 is also the same as that using the algorithm to compare the two groups.

In other words, the regression algorithm produces the same results as that of analysis of variance for two groups.

One way Analysis of Variance with a covariance The One Way Analysis of Variance showed that there was no significant difference between the birth weight of boys and girls. This is because a much greater influence obfuscated the difference, the gestational age, as can be seen in the diagram. One method of correcting for the influence of gestational age is to draw two regression lines and compare them, using the program in the Compare Two Regression Lines (Covariance Analysis) Program Page . Submitting the data to that program will produce the following results.

For girls, Birth weight (y in gram) = -2772 + 185(gestation in weeks)
For boys, Birth Weight (y in gram) = -3999 + 187(gestation in weeks)]
Difference in slope = 185-187 = -2g per week, t = 0.07, df = 18, p = 0.95
Assumed common slope = 186g / week
Difference between sexes (girls - boys) adjusted for gestational age = -165g, t = 4.03, df = 19, p <0.001
In other words, the growth rates between boys and girls are not significantly different, at 186g/week. Having corrected for growth rates, girls are 186g lighter than boys, which is statistically significant.

Sex	Gestation	Ia	BWt
0	36	0	2813
0	35	0	2581
0	37	0	3172
0	38	0	3314
0	39	0	3555
0	38	0	3312
0	40	0	3643
0	39	0	3442
0	37	0	3185
0	41	0	3667
1	37	37	3029
1	39	39	3200
1	38	38	3212
1	37	37	3048
1	36	36	2706
1	40	40	3453
1	36	36	2670
1	39	39	3366
1	41	41	3622
1	35	35	2386
1	41	41	3596
1	38	38	3135

We will now use the multiple regression model, and introduce the concept of interaction. Before we combined the influences of gestational age and sex on birth weight, we must first assure ourselves that the influences of gestation are not different in the two sexes, that boys grows faster/slower than girls near term.

We therefore create a new variable, the interaction (Ia) so that Ia = sex * Gestation, so that the data to be used are as shown to the right. We then analyse this set of data using multiple regression and obtain the following results (rounded to the nearest whole number).

Birth weight (g) = -3772 + (-227(girls)) + (185(Gestation in weeks)) + (2(Interaction))
The interaction = 2, t = 0.07, not statistically significant, is the same as the difference between the two slopes in the previous calculation

Had there been significant interaction, we would not be able to proceed, as the adjustment for gestation will need to be different in the two sexes. As there is no significant interaction, the multiple regression analysis can now be repeated without the interaction term, and the result is Birth Weight (g) = -3808 + (-165(girls)) + (186(Gestation in weeks)). In other words, having corrected for the influence of gestation, girls are 165g lighter than boys.

The whole point of this exercise, to analyse the same data using comparison of two regression lines and using multiple regression, is to demonstrate the principle underlying covariance analysis, and to demonstrate what an interaction in a multivariate set of calculation is all about. To summarise

Multiple regression can be used to analyse multivariate statistical data
In the multi-variate situation, there is a need to check for interaction, that the influence of one variable on the outcome is not affected by another influence.

Ethnicity	Gest	BWt
German	36	2813
German	35	2581
German	37	3172
German	38	3314
German	37	3029
German	39	3200
German	38	3212
Greek	39	3555
Greek	38	3312
Greek	40	3643
Greek	37	3048
Greek	36	2706
Greek	40	3453
Greek	36	2670
Greek	39	3366
French	39	3442
French	37	3185
French	41	3667
French	41	3622
French	35	2386
French	41	3596
French	38	3135

We will use the data set and analyse the difference in birth weight between ethnic origins, and for the moment forget sex of the baby. The re-arranged data table is as shown to the right, and the plot as shown to the left.

One way Analysis of Variance

If we ignore the gestational age, then we can use the program in the Unpaired Difference Programs Page . The results would be

For Germans, n=7, mean=3046g, Standard Deviation=261g
For Greeks, n=8, mean=3219g, Standard Deviation=373g
For French, n=7, mean=3290, Standard Deviation=451g
In the analysis of variance, F=0.81, α=0.46,the groups are not significantly different to each other.

Multiple Regression : Introducing the dummy variable

Multiple regression requires that the independent variables to be at least ordered (3>2>1). When there are multiple groups which are not ordered, thee is a need to create dummy variables that are ordered to represent them, using the following procedures.

The number of dummy variables = 1 less than the number of groups. For the current data of 3 ethnic groups, we will create 2 dummy variables EthnicDummy1 (ED1) and EthnicDummy2 (ED2)
For each group, we will assign it to one of the dummy variables as 1, and the remaining ones as 0, and for the last group, we will assign it as 0 to all groups. It does not matter which group is assigned to what, providing they are identified when the results are interpreted.
- For Germans, ED1=1, ED2 = 0 (German and not Greek)
- For Greeks, ED1=0, ED2 = 1; (Greek and not German)
- For French, ED1=0, ED2=0; (Not German and not Greek)

ED1 (German)	ED2 (Greek)	Birth Weight
1	0	2813
1	0	2581
1	0	3172
1	0	3314
1	0	3029
1	0	3200
1	0	3212
0	1	3555
0	1	3312
0	1	3643
0	1	3048
0	1	2706
0	1	3453
0	1	2670
0	1	3366
0	0	3442
0	0	3185
0	0	3667
0	0	3622
0	0	2386
0	0	3596
0	0	3135

Multiple Regression now produces the formula Birth Weight (y) = 3290 + (-245ED1) + (-71ED2). This means :

For German babies, where ED1=1 and ED2=0, the birth weight is 3290 - 245 = 3045g
For Greek babies, where ED1=0 and ED2=1, the birth weight is 3290 - 71 = 3219g
For French babies, where ED1=0 and ED2=0, the birth weight is 3219g
F for the model is 0.81, which is not statistically significant.
Except for the rounding error of 1g for German babies, these are the same results as that from One Way Analysis of Variance

Analysis of Covariance for multiple groups.

ED1 (German)	ED2 (Greek)	Gestation	ED1S	ED2S	Birth Weight
1	0	36	36	0	2813
1	0	35	35	0	2581
1	0	37	37	0	3172
1	0	38	38	0	3314
1	0	37	37	0	3029
1	0	39	39	0	3200
1	0	38	38	0	3212
0	1	39	0	39	3555
0	1	38	0	38	3312
0	1	40	0	40	3643
0	1	37	0	37	3048
0	1	36	0	36	2706
0	1	40	0	40	3453
0	1	36	0	36	2670
0	1	39	0	39	3366
0	0	39	0	0	3442
0	0	37	0	0	3185
0	0	41	0	0	3667
0	0	41	0	0	3622
0	0	35	0	0	2386
0	0	41	0	0	3596
0	0	38	0	0	3135

The differences between ethnic groups have been found to be not statistically significant, but this may be caused by the much greater influence of gestational age on birth weight, as can be seen in the plot above. The inclusion of gestational age as a covariate is therefore necessary.

As the three ethnic groups have been converted into two dummy variables ED1 and ED2, the interaction between gestation and both ED variables will now need to be constructed. These are ED1G=ED1*Gest, and ED2G=ED2*Gest. The data is now as shown to the right, and analysis will follow the following steps.

Step 1 : All 5 independent variables, ED1, ED2, gestation, ED1S, ED2S, plus the dependent variable BWt, are subjected to multiple regression analysis. Although the full data output is produced by the program, we are only interested in the model degrees of freedom (5) and Sums of Square (2544655).

Step 2 : The exercise is repeated, excluding the two interaction terms of ED1S and ED2S. The 3 independent variables, ED1, ED2, Gestation, plus the dependent variable BWt is subjected to multiple regression analysis. Again, we are interested in the degrees of freedom (3) and Sums of Square (2527306)

Step 3 : Analysis of Interaction Using the combined information from the two steps , we can now reconstruct the Analysis of Variance Table obtained initially in Step 1, as shown in the table to the right. The Probability of Type I Error for F= 0.49, with 2 and 16 degrees of freedom is α=0.63, and we can now conclude at this point that no significant interaction exists between gestation and ethnic origin of the babies. In other words, the growth rates of babies near term are not different in the three ethnic groups.

	df	SSq	MSq	F
Inclusive of Interaction	5	2544655
Exclusive of Interaction	3	2527306
Attributable to Interaction	2	17349	17349/2=8675	8675/17535=0.49
Residual	16	280560	280560/16=17535

Step 4 : Covariance Analysis . The Regression Formula obtained in Step 2, excluding interactions, can now be examined.

The formula is Birth weight (y in g) = -4166 + 84ED1 +69ED2 + 192Gestation (in weeks)
Birth weight increases by 192g per week near term (t=11.8, α<0.001, statistically significant)
A French baby, at term (40 weeks), averaged 40*192-4166 = 3514g
German babies (ED1) are 84g more than French babies (t=1.13, α=0.27, not statistically significant)
Greek babies are 69g more than French babies (t=1.02, α=0.32, not statistically significant)

Comments : These simple steps demonstrate the mathematical sequence used to handle complex data using the multiple regression algorithm.

The creation of binary dummy variables to replace variables with multiple groups
The creation of interaction variables between different factors, where Interaction value = Factor1 value multiplied by Factor 2 value
The double analysis of variance, with and without the interaction variables, to isolate the interaction effect. This is necessary, because some correlation (and therefore overlapping effect) exists between different factors, and this double procedure allows the overlap to remain with the main effect, so that the uncorrelated interactions can be isolated.
Only when there is no significant interaction, can the covariance analysis be interpreted.

Two very important concepts involved when handling multivariate data are also demonstrated in this model.

Interaction, where the influence of on factor on the dependent variable is altered by another factor. Interaction can be helpful or unhelpful, but they need to be defined, isolated, and interpreted. An example is that interaction between sex and gestation means boys and girls have different growth rates
Confounding, caused by correlations between factors, so that it is difficult or even impossible to identify how much each factor affects the outcome. Confounding is always bad as it results in misleading interpretations, and the greatest virtue of multiple regression analysis is its ability to separate the unique and overlapping parts of effects from multiple factors. An example of correlation and confounding would be if girls are born earlier than boys, so that it is unclear whether it is the sex or the gestation that affects birth weight.

Factorial Analysis of Variance Factorial Analysis of Covariance

The Factorial model of Analysis of Variance was initially used in agriculture and animal laboratories, where subjects (plants or animals) are randomly allocated to groups, which are given a combination of two or more treatments. Such a model has many advantages

The same subject is used in a number of experiments simultaneously, thus greatly reduce the cost of research
In many cases, the combination of two treatments may have greater (synergism) or less (antagonism) effect than the sum of their individual treatment. These are called interactions and provides additional useful information to have'
Mathematically, the analysis of Variance calculates the effect of each treatment (single factors), then in groups of combined treatments (combined factors). The difference between the combined effect and the sums of the single effects then represented the interaction, which can be numerically presented and statistically tested.
The two important underlying assumptions in this model are, firstly, that the treatment must be randomly and independently allocated, so there is no correlation between treatments, and secondly, that all groups and subgroups at different levels have the same sample size.

The Factorial model is a powerful and efficient model of investigation, so gradually it is adopted in all aspects of psychosocial research, and into the clinical area, and from the controlled experiment to the epidemiological model. In doing so, the important assumptions of Factorial models cannot be met, as independent variables are often not randomly allocated treatments, but characteristics in the natural environment, and sample size availability in subgroups are seldom the same.

The sample size in the groups can only be controlled to an extent. For example, the number of boys and girls born are never exactly the same, and to artificially create equal numbers will require removing some cases arbitrarily, and this process itself will introduce a bias.
The difference in birth weight between boys and girls amongst Germans may be different to that amongst Greeks (interaction). Although interaction can be useful information, in clinical investigations they often represents an unwanted distraction making interpretation of data difficult.
We cannot allocate sex at random to different groups, and a possibililty of correlation occurs. For example, the sex ratio may differ in different ethnic groups, so that the influence of ethnicity and sex cannot be separated (confounding).

When the assumptions of the Factorial model is violated, the results produced becomes misleading, and sometimes the numbers do not add up. When there is extensive correlation between independent variables, the overlapping influences are counted repeatedly and thus inflated in the single effects, so that the combined effect is less than the sum of the single effects, resulting in a conceptually unacceptable negative interaction.

The mathematics of multiple regression is able to resolve this difficulty, because it separates those influence (in terms of Sums of Squares) that are unique to each independent variable, and those influence that overlaps between the correlated variables. In short, it treats every factor both as an independent variable and a covariate. In most modern statistical packages therefore, the multiple regression algorithm is used for calculation even though the user interface retains the Analysis of Variance format.

Sex	Ethnicity	BWt
Boy	German	2813
Boy	German	2581
Boy	German	3172
Boy	German	3314
Boy	Greek	3555
Boy	Greek	3312
Boy	Greek	3643
Boy	French	3442
Boy	French	3185
Boy	French	3667
Girl	German	3029
Girl	German	3200
Girl	German	3212
Girl	Greek	3048
Girl	Greek	2706
Girl	Greek	3453
Girl	Greek	2670
Girl	Greek	3366
Girl	French	3622
Girl	French	2386
Girl	French	3596
Girl	French	3135

Factorial Model for Birth Weight The data, as plotted, is shown in the diagram to the left, but for this analysis, we will ignore gestational age, and only examine how the two factors, sex and ethnic origin, affect birth weight. The data is as shown in the table to the right.

To allow multiple regression, the 3 groups in the ethnicity factor is converted into two binary variables, as follows

For Germans, ED1=1, ED2 = 0 (German and not Greek)
For Greeks, ED1=0, ED2 = 1; (Greek and not German)
For French, ED1=0, ED2=0; (Not German and not Greek)

To allow the estimation of interaction, two additional interaction variables are created

Interaction between ED1 and sex ED1S = ED1 * sex
Interaction between ED2 and sex ED2S = ED2 * sex.

sex	Ed1	Ed2	ED1S	ED2S	BWt
0	1	0	0	0	2813
0	1	0	0	0	2581
0	1	0	0	0	3172
0	1	0	0	0	3314
0	0	1	0	0	3555
0	0	1	0	0	3312
0	0	1	0	0	3643
0	0	0	0	0	3442
0	0	0	0	0	3185
0	0	0	0	0	3667
1	1	0	1	0	3029
1	1	0	1	0	3200
1	1	0	1	0	3212
1	0	1	0	1	3048
1	0	1	0	1	2706
1	0	1	0	1	3453
1	0	1	0	1	2670
1	0	1	0	1	3366
1	0	0	0	0	3622
1	0	0	0	0	2386
1	0	0	0	0	3596
1	0	0	0	0	3135

The data is then subjected to analysis using similar steps as that for covariance analysis.

Step 1 : A two stage Analysis of Variance using the multiple regression algorithm, with and without the interaction variables are carried out. In these analysis, only the degrees of freedom and Sums of Squares for the model are of interest.

The first calculation, including 5 independent variables of Sex, ED1, ED2,ED1S, ED2S, and the outcome variable BWt are used. The degrees of freedom = 5, and the Sums of Square = 768244
The second calculation excludes the two interaction variables (ED1S and ED2S). Three independent variables Sex, ED1, ED2, and the outcome variable BWt are analysed. The degrees of freedom = 3, and Sums of Squares = 400694
The Table of Analysis of Variance can now be restructured accordingly, as shown in the table to the right.
Probability of Type I Error for F=1.43, with 2 and 16 degrees of freedom, α=0.27, not statistically significant.

	df	SSq	MSq	F
Inclusive of Interaction	5	768244
Exclusive of Interaction	3	400694
Attributable to Interaction	5-3=2	768244-400694=367550	367550/2=183775	183775/128561=1.43
Residual	16	2056971	2056971/16=128561

At this point therefore, we can conclude that no significant interaction exists between sex and ethnicity. In other words, the difference in birth weight between boys and girls in different ethnic groups are similar.

Step 2 : The regression formula obtained without the interaction variables can now be used to interpret the data.

The formula is Birth weight (y in grams) = 3395 -183(sex) -270(ED1) -61(Ed2)
French (ED1=0 and Ed2=0) Boys (sex=0) averaged 3395g
French girls are 183g less (t = 1.15, α=0.27)
Germans are 270g less than French babies in their respective sexes (t = 1.37, α=0.19)
Greek are 61g less than French babies in their respective sexes. (t = 0.32, α=0.75)
None of these differences are statistically significant

Sex	Ethnicity	Gest	BWt
Boy	German	36	2813
Boy	German	35	2581
Boy	German	37	3172
Boy	German	38	3314
Boy	Greek	39	3555
Boy	Greek	38	3312
Boy	Greek	40	3643
Boy	French	39	3442
Boy	French	37	3185
Boy	French	41	3667
Girl	German	37	3029
Girl	German	39	3200
Girl	German	38	3212
Girl	Greek	37	3048
Girl	Greek	36	2706
Girl	Greek	40	3453
Girl	Greek	36	2670
Girl	Greek	39	3366
Girl	French	41	3622
Girl	French	35	2386
Girl	French	41	3596
Girl	French	38	3135

All previous discussion on Factorial Analysis of Variance and in Covariance Analysis are subsections of the full Factorial Covariance Model , which will be discussed in this section. The data is as presented in the table to the left, and plotted to the right.

The aim is to analyse the influence of two factors, sex and ethnicity on the birth weight of a baby, corrected for a single covariate, the gestational age in weeks. The algorithm to be used in the multiple regression.

As the reasons for the various procedures have already been covered in previous section, only the various stages of computation will be listed here.

Sex	ED1	ED2	ED1S	ED2S	Gest	ED1G	ED2G	ED1SG	ED2SG	BWt
0	1	0	0	0	36	36	0	0	0	2813
0	1	0	0	0	35	35	0	0	0	2581
0	1	0	0	0	37	37	0	0	0	3172
0	1	0	0	0	38	38	0	0	0	3314
0	0	1	0	0	39	0	39	0	0	3555
0	0	1	0	0	38	0	38	0	0	3312
0	0	1	0	0	40	0	40	0	0	3643
0	0	0	0	0	39	0	0	0	0	3442
0	0	0	0	0	37	0	0	0	0	3185
0	0	0	0	0	41	0	0	0	0	3667
1	1	0	1	0	37	37	0	37	0	3029
1	1	0	1	0	39	39	0	39	0	3200
1	1	0	1	0	38	38	0	38	0	3212
1	0	1	0	1	37	0	37	0	37	3048
1	0	1	0	1	36	0	36	0	36	2706
1	0	1	0	1	40	0	40	0	40	3453
1	0	1	0	1	36	0	36	0	36	2670
1	0	1	0	1	39	0	39	0	39	3366
1	0	0	0	0	41	0	0	0	0	3622
1	0	0	0	0	35	0	0	0	0	2386
1	0	0	0	0	41	0	0	0	0	3596
1	0	0	0	0	38	0	0	0	0	3135

Step 1. Preparation of the data

Sex : Boy=0, Girl=1
Creation of two dummy variables. ED1=0 for non-German and 1 for German, and ED2=0 for non-Greek and 1 for Greek
Two Interaction variable between Sex and the dummy variables. ED1S = Sex * ED1, and ED2S = Sex * ED2
Gest : Gestation in weeks
4 interaction variables involving Gestation. ED1G = Gest * ED1, ED2G = Gest * ED2, ED1SG = Gest * ED1S, ED2SG = Gest * ED2S
BWt : Birth weight in grams(g)

Step 2 : Interaction related to Gestation

Two Analysis, inclusive and exclusive of interaction variables of gestation, are carried out, to obtain the degrees of freedom and Sums of Squares of the two models.

The first analysis includes the 10 independent variables of Sex, ED1, ED2, ED1S, ED2S, Gestation, ED1G, ED2G, ED1SG, ED2SG, and the dependent variable Birth weight. The degrees of Freedom = 10, and Sums of Squares = 2729846
The second analysis excludes the 4 gestation related interaction variables (ED1G, ED2G, ED1SG, ED2SG). Six independent variables of Sex, ED1, ED2, ED1S, ED2S, Gestation, and the dependent variable BWt are analysed. The model degrees of freedom is now 6, and Sums of Squares = 2682429
The Analysis of Variance Table can now be constructed, as shown below and to the right. Probability of Type I Error for F=1.37, with 4 and 11 Degrees of Freedom α = 0.31, not statistically significant.

	df	SSq	MSq	F
Inclusive of Interaction	10	2729845.964
Exclusive of Interaction	6	2682429
Attributive to Interaction	10-6=4	2729846-2682429=47417	47417/4=11854	11854/8670=1.37
Residual	11	95369	95369/11=8670

At this point, we can conclude that there is no significant interaction involving gestation. In other words, growth rates in all groups are similar.

Step 3 : Evaluating Interaction between sex and ethnicity

As with gestation, consideration of interaction between sex and ethnicity also involves two analysis.

The first analysis includes 6 independent variables of Sex, Ed1, ED2, ED1S, ED2S, Gestation, and the dependent variable BWt, and these are now subjected to analysis of variance using the multiple regression algorithm. The model degrees of freedom is 6, and Sums of Square = 2682429.
The second analysis, excludes the two interaction variables between sex and ethnicity (ED1S, ED2S). Four independent variables, Sex, Ed1, ED2, Gestation, and the dependent variable BWt, are now subjected to analysis of variance using the multiple regression algorithm. The model degrees of freedom is 4, and Sums of Square = 2673221.
The Analysis of Variance Table can now be constructed, as shown below and to the right. Probability of Type I Error for F=0.48, with 2 and 15 Degrees of Freedom α = 0.63, not statistically significant.

	df	SSq	MSq	F
Inclusive of Interaction	6	2682429
Exclusive of Interaction	4	2673221
Attributive to Interaction	6-4=2	2682429-2673221=9208	9208/2=4604	4604/9519=0.48
Residual	15	142785	142785/15=9519

At this point, we can conclude that there is no significant interaction between sex and ethnicity. In other words, once corrected for gestation, the difference in birth weight between boys and girls are similar in all ethnic groups.

Step 5 : Final Analysis The regression formula in the last analysis, free of any interaction terms, can now be interpreted. T

The formula is Birth Weight (y in g) = -4022 -166Sex + 58ED1 + 77ED2 + 191Gest(week)
Weight gain is 191g per week near term (t=15.94, α<0.0001, statistically highly significant)
A French Boy (Sex=0, ED1=0, ED2=0), at 40 weeks gestation, averaged 40*191-4022 = 3618g
A French girl is 166g lighter (t=4.04, α=0.0009, statistically highly significant)
German babies are 58g heavier than French babies with respective sex and gestation (t=1.06, α=0.30, not statistically significant)
Greek babies are 77g heavier than French babies with respective sex and gestation (t=1.55, α=0.44, not statistically significant)
We have established that gestation and sex of the babies significantly affect birth weight, but ethnic origins do not.

Multiple Regression

Steel RGD, Torrie JH, Dickey DA (1997) Principles and procedures of statistics. A biomedical approach. 3rd Ed. McGraw-Hill Inc New York NY 10020 ISBN 0-07-061028-2 p. 322-351 Sample Size for Multiple Regression

Cohen J. (1988) Statistical Power Analysis for the Behavioural Sciences. Second Edition. Lawrence Erlbaum Associates Publishers. Hillsdale New Jersey USA. ISBN 0-8058-0283-5. p. 407-410; 551. Covariance Models

Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology. McGraw Hill Book Company New York. Library of Congress No. 73-14716407-047935-6 p.415-440.

This provides the template used on this page, of how to use multiple regression to carry out the analysis of covariance.

Steel RGD, Torrie JH, Dickey DA (1997) Principles and procedures of statistics. A biomedical approach. 3rd Ed. McGraw-Hill Inc New York NY 10020 ISBN 0-07-061028-2 p. 322-351

This provides the mathematics of calculating the coefficients and the analysis of variance for the multiple regression model.

Pedhazur E. (1997) Multiple Regression in Behavioral Research. Explanation and Prediction.(3rd. ed.) Harcourt Brace College Publishers, Fort Worth, USA. ISBN 0-03-072831-2 p.181-196.

This is a very detail textbook dealing with multiple regression and the many ways it can be used, and a very useful reference book. It is included here however because it provides an excellent discussion on dummy variables.