unpaired diff exp

Comparing values between groups is a very common research model in the biomedical domain. The model can be used in epidemiology and surveys, such as comparing birth weight between boys and girls, or in randomized controlled trials, such as allocating different medications to groups of patients and compare their responses.

The research models themselves are complex and sophisticated, requiring careful control of bias. This page does not cover these aspects, but focussed only on the statistical procedures. To select the correct procedure, the following issues are addressed.

The nature of the data
- If the measurements are continuous and normally distributed, the powerful parametric statistical procedures can be used. Examples of parametric measurements are height and weight
- If the measurements are continuous, but not normally distributed, some form of transformation may be needed before parametric statistical procedures can be used. Examples of transformable measurements are ratios and time to events.
- If the measurements are not continuous, or if they are not normally distributed and cannot be transformed, then the nonparametric statistical procedures can be used. Examples of nonparametric measurements are 5 point Likert items, 10 point semantic differential scales, and many psychometric measurements.
- If the data are not measurements, such as counts or classifications, then they cannot be analysed as measurements. Examples of non-measurements are number of adverse events, proportion of surgical complications, sex of newborns.
The sample size. Programs in the Sample Size for Unpaired Differences Program Page and tables in the Sample Size for Unpaired Differences Tables Page can be used to estimate sample size requirements. For proper statistical inference, the exact sample size required should be calculated and used. Approximations useful in the early stages of research planning are :
- Pilot studies requires 6 to 20 subjects per group
- The main study requires 150-200 subjects per group to detect a small effect, 15-20 for a large effect, and in most clinical research situations 60-70 subjects per group for a moderate effect size.
- Nonparametric tests have approximately 95-99% the power efficiency of the equivalent parametric tests. Approximately, the sample size calculated for parametric tests should increase by 5%-10% for equivalent nonparametric tests.

Comparisons Sample Size Example

Comparison of Variances Comparison (2 Groups) Comparison (>2 Groups)

Parametric comparison of values between groups often do so with an assumption that the variations in all the groups under comparison are similar (homogeneous). The following tests are provided to test the homogeneity of variances in the groups. Variance is the square of Standard Deviation and represents variations in measurements in a group, and homogeneity means that difference between variances of groups are not statistically significant

StatTools provides three commonly used tests for homogeneity of variance.

The F Ratio tests for significant difference between variances from two groups. the formula is :
The ratio is the greater value over the smaller value, so it is always >1. The degrees of freedom of the two variances are one less than the sample size of the appropriate group (df = n-1). The probability of this F value, with the two degrees of freedom, are calculated in the Probability of F Program Page . The variances are then accepted as not significantly different, therefore homogeneous, if the probability is greater then 0.05. The F ratio is easy to compute, but it is excessively sensitive, particularly when the sample size is more than 30 per group.
The Bartlett's Test is the most commonly used test of significant difference between variances from two or more groups. The test uses n, mean, and Standard Deviation from each group under comparison, so it can be carried out using summary data. The result is a Chi Square with degrees of freedom one less than the number of groups. If the Probability of the Chi Square is greater than 0.05, then the variances are accepted as not significantly different, therefore homogeneous.
The Levine Test for significant difference between variances is the most precise one, and is the default test offered by SPSS. It is less often used however, as it requires the original data set of values.

Comparisons Sample Size Example

When measurements are not continuous and normally distributed, they are not parametric. Without the assumptions of normal distribution, the powerful methods using partition of variance cannot be applied.

In many cases, the data has a distribution that is mathematically related to normal distribution, such as a squared or exponential, and after some form of mathematical transformation, the parametric statistical tests can be applied.

In other cases, the nature of the data does not allow for such transformations, and the data has to be analysed as ordinal or ordered arrays. Examples of these can be as finely granular as a personality or depression score with a wide range, or more commonly a 10 point semantic differential scale, a 5 point Likert scale, or as coarse as a 3 point pain score (none (0), little (1), lots (2)).

StatTools offers one nonparametric test comparing two or more groups, two comparing two groups, and one for multiple (>2) groups.

Nonparametric comparisons of two or more groups of measurements : The Median Test

When there are only two groups, the Fisher's Exact Probability Test is used if the total sample size is less than 20, otherwise the Chi Square Test with Yates Correction is used.
Where there are more than 2 groups, the standard Chi Square Test for goodness of fit is used

Nonparametric comparisons of two groups of measurements :

The Wilcoxon Mann-Whitney Test is a test of the null hypothesis for two sets of ordinal data, assuming that the distributions in the two groups are similar.

The Mann-Whitney U Test, described in the 1962 edition of Siegal's Nonparametric Stratistics for behavioral Sciences, has been renamed the The Robust Rank Ordered Test , as the term Mann-Whitney U Test now refers to another test, described in Wikipedia. This test is not provided in StatTools, and the term Mann-Whitney U Test is only used to maintain backwards compatibility of this web site. The correct name of the test should be The Robust Rank Ordered Test .

The Robust Rank Ordered Test is considered more robust than the Wilcoxon Mann Whitney Test, because if makes no assumptions that the two groups are from the dsame population, and it is nearly as powerful as the parametric t test. It tests the null hypothesis that the two medians values are not different.

Nonparametric comparisons of three or more groups of measurements : The Kruskall Wallis One Way Analysis of Variance

As well as providing a significance test (α) for the null hypothesis, the program also produces the mean rank values for each group, which can be used in post hoc analysis comparing individual pairs of groups.
The Dunn's Test is one of the post hoc analysis between groups, and is carried out with the main analysis as it requires the original data for computation
The more flexible and commonly used post hoc test is the Least Significant Difference between Mean Ranks. This allows the comparison of the mean ranks between any two groups, assuming that every group is to be compared with every other group. This test is more flexible as it requires only the total sample size, and the sample sizes and mean rank values of the two groups being compared.

Introduction Example

The Permutation Tests are the most basic of statistical tests, from which other models have developed. StatTools presents two models, the significance test for paired differences presented in Paired Difference Programs Page , and the significance test comparing two groups presented in Unpaired Difference Programs Page .

The general principles are that, in a randomly allocated study, the data obtained could have been in either of two groups being compared. The test consists of calculating every possible permutation of the data, and examine the results. If the results from the original data is near the extremes (e.g. less than 5 percentile or more than 95 percentile in a one tail model), then a decision can be made that it is unlikely to be null and therefore statistically significant.

The advantages of using the Permutation tests are :

Exhaustive permutation allows the calculation of the precise probability that the data presented is null, so the tests calculate the Type I Error (α), with a power (1-β) of 100%.
The tests are not dependent on any assumption of data distribution, so they can be used in any regular interval data (where 10-9 is the same as 4-3). The tests can therefore be used on parametric measurements, ratios, variances, and time.
Because of the above two characteristics, the tests can be used with a very small sample size

The disadvantages of using the tests are related to the computation intensity required, both in the large memory use, and the time required for computation. In the unpaired situation, where the sample size of group 1 is n1 and that of group 2 is n2, and the total nt=n1+n2, the total number of permutation is the Binomial coefficient nt and n1,n2 (Number of permutation = nt!/(n1!n2!). Computation time therefore increases exponentially with increasing sample size, and large dataset may either crash the program when available RAM is exhausted, or the computation becomes unacceptably too long.

The Permutation Test is therefore ideal for comparing two groups using small sets of interval data with uncertain distributions. With larger sample size, the more common non-parametric (the Median Test, the Wilcoxon Mann Whitney Test or the Robust Rank Ordered Test), parametric (Unpaired t test) tests should be preferred.

In theory, the Permutation Test can cope with any sample size. However, a probability of <0.05 is not possible with less than 3 subjects in each group, and computation will take an unacceptably long time if the total sample size (n1 + n2) exceed 26 subjects.

The mathematical argument of the Permutation Test is as follows

In two groups of measurements, the null hypothesis is that there is no difference between the means. In other words, that the values observed can be in either of the group.
The Permutation Test therefore consists of examining the difference between the sums of values in the two groups, preserving the original sample sizes of the groups but have the data distributed in all possible permutations. The total number of permutation is nt!/(n1!n2!).
The difference between the sums in the original data is then compared with all possible sums, so that its probability can be estimated.

Tests for Homogeneity of Variances

Bartlett's Test : I have not read this reference, but it is quoted in the nist web site from which I obtained the algorithm.
Snedecor, George W. and Cochran, William G. (1989), Statistical Methods, Eighth Edition, Iowa State University Press.

Levene's Test : I have not read this reference, but it is quoted in the nist web site from which I obtained the algorithm.
Levene, H. (1960). In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, I. Olkin et al. eds., Stanford University Press, pp. 278-292.

Formulae : I obtained these from the National Institute of Science and Technology (NIST) resource website. The urls are handbook index, Bartlett test, and Levene test.

Parametric Comparisons

Confidence Intervals : Altman DG, Machin D, Bryant TN and Gardner MJ. (2000) Statistics with Confidence Second Edition. BMJ Books ISBN 0 7279 1375 1. p. 28-31

One-Way Analysis of Variance : Armitage P. Statistical Methods in Medical Research (1971). Blackwell Scientific Publications. Oxford. P.189-207.

Least significant difference (Tukey) :

Armitage P. Statistical Methods in Medical Research (1971). Blackwell Scientific Publications. Oxford. P.189-207
Steel R.G.D., Torrie J.H., Dickey D.A. Principles and Procedures of Statistics. A Biomedical Approach. 3rd. Ed. (1997)ISBN 0-07-061028-2 p. 191-192
Studentised range tables : Pearson ES, Hartley HO (1966) Biometrika table for statisticians Ed. 3 Table 29.

Least significant difference (Scheffe) :

Scheffe H (1959) The Analysis of Variance NY Wiley (quoted by everyone else but I have not read it)
Pedhazur E.J. Multiple regression in behavioral research explanation and prediction (3rd Ed) 1993. Harcourt Brace College Publishers, Orlando Florida. ISBN 0-03-072831-2 p. 369-371
Portney LG, Watkins MP (2000) Foundations of Clinical Research. Applications to Practice (Second Edition) Prentice Hall Health, New Jersey. ISBN 0-8385-2695-0. p. 460-461 A Biomedical Approach. 3rd. Ed. (1997)ISBN 0-07-061028-2 p. 189-190

Confidence interval of difference between means : Bird KD (2002) Confidence Intervals for Effect Sizes in Analysis of Variance. Educational and Psychological Measurements 62:2:197-226

Nonparametric Tests :

Siegel S and Castellan Jr. N J (1988) Nonparametric Statistics for the Behavioral Sciences 2nd. Ed. McGraw Hill, Inc. Boston Massachusetts. ISBN 0-07-057357-3

Median Test p.124 (2 groups), p.200 (3 or more groups).
Wilcoxon Mann Whitney Test p. 128-137.
Robust Rank Order Test (used to be called the Mann Whitney U Test) p. 137-144.
Kruskall-Wallis One Way Analysis of Variance p. 206-215.
Least significant difference between mean ranks P213-214.

Dunn's Test :

Zar Z.H. (1974) Biostatistical analysis (3rd.Ed) Prentice Hall, New Jersey. ISBN 0-13-084542-6. p 227-228.
Table for Q values for Dunn's Test: App. 106 Dunn O.J. (1964) Multiple contrasts using rank sums. Technometrics 6:241:252

Permutation Test : Siegel S and Castellan Jr. NJ (2000) Nonparametric Statistics for the Behavioral Sciences. Second Edition. McGraw Hill, Sydney. ISBN0-07-100326-6 p. 151-155

sample size

Two Groups : Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical Studies. Second Ed. Blackwell Science IBSN 0-86542-870-0 p. 24-25

Multiple (3+) Groups : Cohen J (1988) Statistical power analysis for the behavioral sciences. Second edition. Lawrence Erlbaum Associates, Publishers. London. ISBN 0-8058-0283-5 p. 276-279, p. 550

Equivalence

Rogers JL, Howard KI, Vessey JT. (1993) Using significance tests to evaluate equivalence between two experimental groups. Psychological Bulletin 113:553-565.

Jones B, Jarvis P, Lewis JA, Ebbutt AF. (1996) Trials to assess equivalence: the importance of rigorous methods. British Medical Journal 313:36-39

Hwang IK, Morikawa T. (1999) Design issues in noninferiority/equivalence trials. Drug Information Journal 33:1205-1218

Example 1 : Comparing two groups

Example 2 : Comparing 3 groups