SSiz Exp

As explained in the Probability Introduction and Explanation Page , statistical decisions became possible with the development of Type I Error by Fisher. A later improvement came through the use of Type II Error and the statistical significance model by Pearson. Pearson's model was developed with the intension of providing a statistical decision. However, it provides a theoretical framework to mathematically relate between Type I Error (α), Type II Error (β), a non trivial value that represents a difference that matters, Standard Deviation (SD), and sample size. If four (4) of these parameters are known, then the fifth can be calculated.

From this, the sample size required to compare two groups can be estimated if the other four (4) parameters are available. In other words, researchers are able to know the sample size they need for their results to be interpreted with confidence.

Understanding this model and the availability of an objective method to calculate optimal sample size allow further development and emphasis on sample size theories and practices.

At the technical level, statistical modelling allows sample size calculations to extend to data with different types of distributions (e.g., proportions, counts, time to events, ranks) and specialised research situations (e.g., phase II drug trials, post marketing trials, quality testing).

From the researcher's point of view, the availability of sample size estimation greatly assists planning and evaluation of research situations. Knowledge of the appropriate sample size allows the researcher to estimate the time and resources required to complete the study and therefore, the feasibility and viability of the project.

An undersized study produces either uninterpretable results or results that will not stand the test of time. On the other hand, a study larger than necessary wastes resources, inconveniences colleagues and imposes unnecessary risks and discomfort to research subjects.

The absence of adequate sample size considerations therefore symbolises poor research design and indicates bad and possibly unethical research. Increasingly, if sample size considerations are inadequate, granting bodies will not support, regulating bodies will not approve, and editors of scientific publications will not accept results of the research project.

The importance of sample size is discussed clearly and comprehensively by Cohen (see references) and this is very much a recommended reading.

Sample size requirement in a two group comparison depends on the four (4) parameters:

Probability of Type I Error (alpha α). A commonly used value for this is 0.05
Probability of Type II Error (beta β). In many algorithms power=1-β is used. A commonly used value for this is β=0.2 (power=0.8)
A meaningful (non-trivial) difference in the mean values between the two groups that the research model is designed to detect. This is usually nominated, and represents a difference that matters.
Standard Deviation of the measured outcomes in the population to be studied. This is sometime known as the within group Standard Deviation, and represents the background variation of the measurement concerned. This value is usually not available or difficult to obtained but the validity of the model depends on it. The SD obtained historically or from published data are sometimes used, but they may not correctly reflect that of the data to be obtained in the research project. Quite often a "guesstimated" value is used.

The Effect Size. Conceptually, the effect size (ES) is the difference to be detected compared with the background variation, and the sample size required is inversely related to the effect size. In other words, the smaller the difference to be detected, or larger the background variation, the larger will be the sample size required.

Effect size for different statistical models are calculated differently

Difference between two means : ES = Difference / Standard Deviation
Difference between multiple means : ES = f = Largest difference / 2 Standard Deviations
Difference between two proportions : ES = arcsine(proportion1) - arcsine(proportion2)
Odds Ratio : Log(Odds Ratio) / (pi square / 3 = 1.81)
Correlation : ES = correlation coefficient, r

Power. The term power = 1 - β and is sometimes presented as a percentage. A Type II error (β) of 0.2 means a power of 0.8 or 80%, which is commonly used for sample size calculations. The term power is intuitively easier to understand. Conceptually. it means the ability to detect a difference if it is really there.

A power analysis can be carried out at the end of the study using the collected data. This checks to see whether the nominated difference and SD during planning are approximate to those actually obtained in the data, and if not how the interpretation of the data should be modified. Power analysis is particularly important if statistical decisions are based on statistical significance. A statistically not significant conclusion is validated when the power in the data collected accurately reflects that proposed during planning.

With the increased use of confidence intervals, power analysis becomes less important. Much of the information for decision making is conveyed through confidence intervals. If a difference (less than which can be considered trivial) is defined along with a tolerance interval, then a conclusion can be drawn, as to whether the difference is large enough to be considered significant and/or small enough to be consider equivalent, or whether the data lacks power.

One Tail or Two Tail Model One or two tail models are graphically explained in relationship to the t test in the Probability of Student's t Explanation and Tables. The following summarise the concept as it relates to sample size.

In comparing two groups, a one tail model tests whether group A has a higher mean value than group B, without any consideration whether group B has higher mean value than group A. A two tail model tests whether the means from the two groups are different, without specifying which group has the higher value.

Conceptually therefore, in terms of the 95% confidence interval of the difference, a one tail model is from 0% to 95%, while a two tail model is from 2.5% to 97.5% (the 5% divided equally to both sides).

Sample size requirements for the one tail model is therefore very much smaller than the two tail model, but the direction of the difference needs to be pre-specified.

For sample size calculations, the parameters are the same except the Type I Error, where the p value for the one tail model is twice that for the two tail model. For example, when comparing two means where the effect size (difference / SD) is 0.5, power=0.8, and Type I Error =0.05, the required sample size per group is 64. For the same study using the one tail model, an α of p=0.1 is used, and the sample size is 51 per group for Type I Error = 0.05.

For power calculation, the same adjustment is required. For example, in two equal groups of 64, a finding that the difference between means is 0.5 and within group SD = 1, the power is 0.8044 for Type I Error p=0.05. For the one tail model, p=0.05 is replaced with p=0.1, so the power calculated is 0.8799 for Type I Error p=0.05

In StatTools, the two tail model is used for sample size and power estimation, unless otherwise specified.

Parametric Nonparametric

The precision of sample size calculation depends very much on a known and accurate population or within group Standard Deviation. The difficulty is that this Standard Deviation is mostly unknown. Most users use published Standard Deviations, but these are obtained by sampling, so they contain errors.

One way to get around the difficulty is to realise that sample size calculation is not so much based on the within group Standard Deviation, but on the Effect Size, which is a ratio of the difference to be detected and the within group Standard Deviation.

Cohen, in his 1992 paper, suggested that for most research situations, an approximate estimation of the required Effect Size can be nominated, from which an approximate sample size can be determined. Using sample size per group in a two group analysis of variance as a demonstration model, the arguments are as follows.

If the researcher is interested in only the obvious, an Effect Size of 0.7 or more can be considered large. This is a model suitable to detect a difference between means that is 70% or more of the background Standard Deviation. For a two group comparison to detect an Effect Size of 0.7 (α=0.05 and power=0.8), a sample size of 25 subjects per group is required.
If the researcher is interested in a difficult to detect small difference, an Effect Size of 0.3 or less can be considered small. This is a model suitable to detect a difference between means that is 30% or less of the background Standard Deviation. For a two group comparison to detect an Effect Size of 0.3 (α=0.05 and power=0.8), a sample size of 173 subjects per group is required.
If the researcher is interested in a difference that is not immediately obvious but nevertheless clinically meaningful, an Effect Size of 0.5 can be considered moderate. This is a model suitable to detect a difference between means that is half (50%) of a Standard Deviation. For a two group comparison to detect an Effect Size of 0.5, a sample size of 63 subjects per group can be used.

In most clinical research situations, depending on the power required, and an assessment of how obvious the Effect is likely to be, A sample size between 30 and 100 per group can be used for parametric comparisons. Although such an approach appears to be somewhat crude and arbitrary, in reality it is not much worse than guesstimating a background Standard Deviation that in most cases are not reflective of the population or the data collected.

The sample size calculations in StatTools usually provide one or more of the following associated calculations.

Sample Size Calculation Power Estimation Confidence Intervals Pilot studies

Sample size calculations are carried out at the planning stage of a research project, and is based on 3 parameters

The probability of Type I Error (α). 0.05 is the most commonly used value
The power of the model (1 - β). 0.8 is the most commonly used value
The effect size the research model is expected to detect. Calculations therefore differ according to the definitions of the effect size, and the assumption of distribution of this effect size. Conceptually, the effect size is an expression of the ratio of the difference found and the background variation of the measurements. Using the simplest model of comparing two means, the effect size is the ratio of the difference and the background or within group Standard Deviation (es = (mean1 - mean2) / within group SD)

Cohen J (1992) A Power Primer. Psychological Bulletin Vol 112 No. 1 p. 155-159.

Cohen J (1988) Statistical power analysis for the behavioral sciences. Second edition. Lawrence Erlbaum Associates, Publishers. London. ISBN 0-8058-0283-5

Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical Studies. Second Ed. Blackwell Science IBSN 0-86542-870-0. Chapter 1, p. 1 - 10 provides a detailed and clear explanation of sample size and power, and within it, in section 1.3.5 in p. 4, a short but authoritative explanation of the one and two tail model

Chinn S (2000) A simple method for converting an odds ratio to effect size for use in meta-analysis. Statistics in medicine 19:3127-3131

Lancaster GA, Dodd S, Williamson PR (2010) Design and analysis of pilot studies: recommendations for good practice. Journal of Evaluation in Clinical Practice 10:2 p307-312

Johanson GA and Brooks GP (2010) Initial Scale Development: Sample Size for Pilot Studies. Educational and Psychological Measurement 70:3 p.394-400