correg exp

StatTools : Correlation and Regression Explained

Links : Home Index (Subjects) Contact StatTools

Introduction Pearson's Correlation Regression Compare 2 Regression Lines (Covariance Analysis) References

This page introduces 3 commonly used parametric correlation / regression procedures.

Two nonparametric procedures, Spearman's Correlation is calculated using the Chi Square for Large Contingency Tables and Spearman Correlation Coefficient Program Page , and Regression using proportions are calculated in the Compare Two Regression Lines (Covariance Analysis) Program Page . As both are discussed in Unpaired Proportions Explained Page , they will not be further covered in this page

Regression defines the relationship between measurements hierarchically. There is an assumption of order, that the variable x comes first, and is considered independent, while the variable y comes second, and is dependent upon x. The model is therefore either knowing x will allow a prediction of y, or by changing x, y is also changed.

Statistically, only the y variable needs to be parametric (continuous and normally distributed). The x variable must be at least ordered (3>2>1), so it can be binary, ordinal, interval, or ratio measurements.

The results of the analysis produces the formula y = a + bx, where a is the constant, and b the regression coefficient.

The constant a is the y value when x=0
The coefficient b is how much y is changed for each unit of change in x. As b is estimated from the data, it has an error, the Standard Error of the regression coefficient (SE_b)
Once the formula is established, the total variation of the data can be partitioned into those related to the regression, and the residual, variation around the regression line
However, as the regression line itself also contains an error (variation), calculation of the residual variation around the line must take this into consideration.
The residual error around the line is therefore dependent on the distance from mean x value, where

Example : Interpretations are best demonstrated in the results of analysis using the example data in the Correlation and Regression Program Page . The data related how birth weight (y in grams) is dependent on gestation (x in weeks). The data point, the regression line, and the 95% confidence interval around the regression line, are shown in the diagram to the right.

Regression : y = -5584.8592 + 230.2935x
SE of b = 21.3624
95% CI of b = 185.7325 to 274.8545
Residual Mean Square (Variance) = 43747.5803 and df=20
Standard error of y from regression value for any x value (SEYx)
= sqrt(residual means squares(1/n + sqr(x-meanx)/sum squares x))
= sqrt(43747.5803(1/22 + (x-37.7727)(x-37.7727)/95.8636))
95% CI of y for any x value
= yx ± t SEYx
= -5584.8592 + 230.2935x ± 2.086(sqrt(43747.5803(1/22 + (x-37.7727))))

Please note : Because the regression coefficient itself has a Standard Error, this is added to the residual error, so that the variance around the regression line is narrowest at the mean of x values and widens towards the periphery.

Sample Size : Theoretically, regression differs from correlation in that only the y variable assumes a normal distribution, while the x variable only needs to be ordered. Sample size calculation for regression should therefore follow the Analysis of Variance model, based on the F distribution.

However, the sample sizes so calculated are in most cases similar to that for correlation, and as correlation and regression are often calculated at the same time, the sample size calculated for correlation is used also for regression.

Pearson's Correlation Coefficient (ρ) is a measure of how two normally distributed measurements are related, as shown in the diagram on the left.

When the relationship is precise and in the same order (green dots), ρ=1. When the relationship is precise but in reverse order (red dots), ρ=-1. When there is no relationship at all (blue dots), ρ=0.

The traditional results from analysis are the coefficient (ρ), and it's Standard Error, assuming the coefficient to be a normally distributed variable. The statistical significance (probability of not null (0)) can then be evaluated using the t test.

However, it is increasingly accepted that correlation coefficient is not truly normally distributed, as it cannot have a value outside of ±1, the variance on the extreme side is therefore always narrower than the variance facing the zero value. Correlation coefficient is therefore only truly parametric when it has a value of 0, and the error increases as its value becomes nearer to ±1, as shown in the diagram to the right.

Increasing therefore, the Fisher's Z transformation of the coefficient is used, so that Z is normally distributed. After the Standard Error and confidence interval of Z is calculated, the data is re-transformed to the original ρ unit. The algorithm for Fisher's Z transformation is as follows.

Z = 0.5(log((1+r)/(1-r)), and the Standard Error is SE=1/sqrt(n-3). The 95% confidence interval is Z-1.96SE to Z+1.96SE.
Reverse transformation for mean and the confidence interval limits are ρ = (exp(2Z)-1) / (exp(2Z)+1).

The algorithm is best demonstrated in the results of analysis from the example data from the Correlation and Regression Program Page , which is a correlation between gestation (in weeks) and birth weight (in grams)

n = 22. There are 22 cases or pairs of measurements
x : mean = 37.7727 SD = 2.1366
y : mean = 3113.9545 SD = 532.697
Correlation Coefficient r (ρ) = 0.9237 SE=0.0857 t=10.7803 df=20. This is the traditional display of results
Fisher's Z Transformation Effect : Fisher's Z = 1.6135 SE=0.2294
Two Tail Model : 95% Confidence Interval of Z = 1.1638 to 2.0631
Reverse transformation : 95% Confidence Interval of correlation coefficient r = 0.8223 to 0.9682
One Tail Model : 95% Confidence Interval of Z >=1.235 or <=1.992 95% Confidence Interval of correlation coefficient r >=0.844 or <=0.9635

Although results from both one and two tail models are provided, the one tail model is usually used, as in most cases, the researcher is only interested in whether the correlation coefficient is statistically significant, and therefore is only interested in the tail facing the null value (0).

Comparing two regression lines is the simplest model of covariance analysis. It uses the independent variable x as covariate and dependent variable y as outcome in a 2 group analysis of covariance. Two procedures are carried out.

Firstly, it computes the two regression lines y₁ = a₁ + b₁x and y₂ = a₂ + b₂x, then compare the two regression coefficients to see if they are significantly different. This is equivalent to evaluating whether interaction between groups and covariates exists.

Secondly, it assumes that the two regression coefficients are not significantly different (that there is no significant interaction), calculates a common regression slope for the whole set of data, then compared the mean dependent y values (adjusted for the common regression slope). This second part is the same as covariance analysis, where the dependent variable is y and the covariate is x.

Collectively, the first procedure, comparing the two regression coefficients, is equivalent to evaluate the presence of interaction between the covariates and the groups, and the second procedure, comparing the adjusted means of the two groups, is a simple analysis of covariance for two groups with one covariate, assuming no interaction.

Example

sex	Gest	BWt
G	37	3048
B	36	2813
G	41	3622
G	36	2706
B	35	2581
B	39	3442
G	40	3453
B	37	3172
G	35	2386
B	39	3555
G	37	3029
B	37	3185
G	36	2670
B	38	3314
G	41	3596
B	38	3312
G	39	3200
B	41	3667
B	40	3643
G	38	3212
G	38	3135
G	39	3366

The program in the Compare Two Regression Lines (Covariance Analysis) Program Page is best understood by following the default example in the program.

We wish to compare the birthweight of boys and girls, but we need to take into consideration the gestational age at birth. The data is therefore as follows. Sex is the sex of the baby (G for girls and B for boys). Gest is the gestational age (in weeks) at birth, and BWt is the birthweight in grams. The data are shown in table to the left.

	GrpB	GrpG
n	10	12
meanx	38	38
SDx	1.8	2.0
meany	3268	3119
SDy	351.4	380.3
r	0.96	0.97
t	10.03	12.74
p	<0.0001	<0.0001
Slope (b)	185.3	186.9
Const (a)	-3772	-3998

The initial results are shown in the table to the right.

There were 11 boys and 11 girls in the study. Means and standard deviations for gestation (x) are 38 and 1.8 weeks for boys and 38 and 2.0 weeks for girls respectively. Means and standard deviations for birthweight (y) are 3268 and 351.4 g for boys and 3119 and 380.3 for girls respectively. Correlation coefficients are high for both sexes, and the regressions are BWt(g) = -3772 + 185.3 Gest (weeks) for boys and BWt(g) = -3998 + 186.9 Gest (weeks) for girls.

The two slopes (b) are then compared

The results show that the two slopes are not significantly different, so that assuming a common slope is valid. In other words, there is no significant interaction between sex and gestation.

Assuming a common slope, the adjusted means are as follows.

Using a common slope as a correction, the difference between the adjusted means is 165g, and this is statistically significant (p<0.05).

Diff in slope (b_B-b_G) = 1.6
SE_(Diff) = 23.4
t = 0.07 df = 18 p = 0.95

Estimated common slope = 186.2
Diff. adj. means_(B-G) = 165
SE_Diff = 41
t = 4.03 df = 19 p = 0.0007

We can therefore draw the following conclusions from this study.

The growth rate in the two sexes are not significantly different (at least during 34-40 weeks).
Having adjusted for gestational age (covariate), boys are 165g heavier than girls at birth.

These results are best illustrated in the diagram to the left.

Please note : These conclusions are only correct if the growth of babies near term is linear (in a straight line at 185g per week).

Looking at the data carefully, this appears not to be true, as growth at earlier gestation appear faster and growth rates flatten nearer to 40 weeks. Users should therefore be aware that, regardless of how elegant the results seem to be, the validity of the conclusions ultimately depends on the validity of the model's assumptions.

Please also note : data in this example are computer generated, based on a published growth model. The Data are constructed to demonstrate how the program works, and not meant to represent actual growth physiology.

A note on graphic plotting The program is accompanied by a plot, with the following default settings.

Data point and regression line for group 1 is in blue, and for group 2 in red. Blue dots are shifted slightly to the left, and red dots slightly to the right, to avoid overlapping
Assuming no difference in slope, the common slope line is in black.
The blue and red vertical lines outline the distance between the overall mean and adjusted means of the two groups, at the mean x value.

Comparing two regression coefficients using summary data

Compare Two Summary Regression Coefficients Program Page
is a similar program, but calculations are carried out using summary data. The purpose of this program is to enable covariance analysis, compating the two regression coefficients and adjusted means, using results published by others, without the need to use the raw data itself.

The input is a two column table

Column 1 and 2 are groups 1 and 2
Row 1 = sample size (n)
Row 2 = Mean for x
Row 3 = Standard Deviation for x
Row 4 = Mean for y
Row 5 = Standard Deviation for y
Row 6 = Regression Coefficient (b)

The same numerical results are produced, but without original data, no graphical presentations are made.

Correlation and regression

Armitage P (1980) Statistical Methods in Medical Research. Blackwell Scientific Publications, Oxford UK. ISBN 0-632-05430-1 p. 147-166.

Significance and 95% confidence interval of correlation coefficient

t test : Armitage P. Statistical Methods in Medical Research (1971). Blackwell Scientific Publications. Oxford. P.156-163.

Confidence Interval : Altman DG, Machin D, Bryant TN and Gardner MJ. (2000) Statistics with Confidence Second Edition. BMJ Books ISBN 0 7279 1375 1. p. 89-92

http://www2.sas.com/proceedings/sugi31/170-31.pdfSAS paper discussing the need for Fisher's Transformation

Spearman's Correlation Coefficient

Siegel S and Castellan Jr. N J (2000) Nonparametric Statistics for the Behavioral Sciences (Second Edition). McGraw-Hill International Edition Sydney ISBN 0-07-100326-6 p.235-244, Table Q in p.360-361

Coompare Two Regression Lines

Armitage P.(1980) Statistical Methods in Medical Research. Blackwell Scientific Publications. Oxford UK. ISBN 0 632 05430 1. p.279-301