StatTools : Equivalence Explained and Sample Size Tables

Links : Home Index (Subjects) Contact StatTools

Related link :
Sample Size Introduction and Explanation Page

Equivalence Between Two Means (Sample Size and Analysis) Program Page
Equivalence Between Two Proportions (Sample Size and Analysis) Program
Bioequivalence (Sample Size and Analysis) Program Page

Introduction Equivalence Two Means Equivalence Two Proportions Bioequivalence References
Historical perspective

Details of the development of Probability theories in statistics are provided in Probability Introduction and Explanation Page , and their relationship to sample size in Sample Size Introduction and Explanation Page . This page provides only a quick summary to provide context for discussions regarding equivalence.

In the 19th Century, Fisher developed the idea of Type I Error, based on the Normal Distribution, thus allows a probability estimate for whether the null hypothesis can be rejected. This allows decisions to be made in science and industry on whether a new product or process is better or worse than the current ones available. However, if the null hypothesis cannot be rejected, the researcher cannot draw any statistical conclusion, as a failure to reject null is not the same as an ability to accept null.

A generation later, Pearson added the idea of Type II Error and the statistical significance, so that both the ability to reject and accept the null hypothesis can be made. Although this method was widely used in the twentieth century, it was increasingly criticised because results of research are often nor reproducible because of difficulties in determining the population Standard Deviation.

To provide robustness to statistical conclusions, researchers increasingly used the 95% confidence interval of the difference, which is an intuitively easier to understand expression of the Type I Error, and to carry out power analysis, which makes no assumption about population parameters. The combination of these two approaches allows researchers to draw confident conclusions whether two sets of observations can be considered significantly different. However, the problem remains that a failure to demonstrate significant difference is not the same as to demonstrate similarity, and the ability to robustly demonstrate similarity is increasingly required, particularly in biomedical research.

An example is in cancer treatment. The current treatment may have severe side effects, and a new treatment may have much more acceptable side effects, but the researcher needs to know whether the effectiveness in controlling the cancer is the same, or at least not inferior to the current treatment.

The concept of equivalence

Given the random variations inherent in any set of observations, it is very unlikely to demonstrate two groups to have the same mean values. The term equivalence is therefore used to represent similarity. This is dependent on a pre-determined and arbitrarily assigned Critical Difference (CD) or Tolerance Limit (TL), a difference that can be considered as trivial in the practical sense. Using the 95% confidence interval of the difference to illustrate, the various conclusions that can be drawn are as shown in the diagram to the right. Assuming that the difference is that between group 1 and 2 (diff = mean1 - mean2)

  • If the 95% CI does not intercept the null value, as shown in the 1st and the 5th lines in the diagram, then a significant difference exists. This is true whether the 95% CI is calculated using the one or two tail models.
  • If the 95% CI is on the left (diff<null), intercepts the null, but does not intercept the positive Critical Difference (CD) value, as shown in the 2nd line in the diagram, then group 1 is significantly not greater than group 2. This is true whether the 95% CI is calculated using the one or two tail models.
  • If the 95% CI is on the right (diff>null), intercepts the null, but does not intercept the negative Critical Difference (CD) value, as shown in the 4th line in the diagram, then group 1 is significantly not less than group 2. This is true whether the 95% CI is calculated using the one or two tail models.
  • If the 95% CI is near the center, intercepts the null, but does not intercept either the negative positive Critical Difference (CD) value, as shown in the 3rd line in the diagram, then the two groups is significantly equivalent. This is only true if the 95% CI is calculated using the two tail models.
  • If the 95% CI intercepts both negative and positive Critical Difference values, as shown in the 6th line in the diagram, then the data lack sufficient power to drawn statistical conclusions, usually because the variance is too large or the sample size not big enough. This is true whether the 95% CI is calculated using the one or two tail models.

One tail or two tail In most text books and published papers, statistics related to equivalence uses the one tail model. This is because most equivalence related research are concerned with non-inferiority, so that not significantly greater or not significantly less are the hypotheses to be tested. As the one tail model allows these conclusions and requires smaller sample size, this is the model to use. StatTools however provides calculations for both one and two tails in case any user requires them.

Sample size and power calculation

Sample size and power calculations for equivalence differ from those of significant differences in two ways.

Firstly, the decision is based not only on the relationship between the difference between the two groups and the null value, but also the positive and negative critical values.

Secondly, robustness is required for the conclusion of significant equivalence and not on significant difference. The Probability of Type I Error (α) is therefore relaxed, and the common values of 0.1 or 0.2 are used instead of 0.05 or 0.01. The Probability of Type II Error, in terms of power, is made more strict, so the power value of 0.9, 0.95, or 0.99 are used instead of 0.8

Statistical Decisions

Users should remember that good statistical practice requires the hypothesis to be tested defined at the planning stage, and statistical procedures are used to reject or support that hypothesis. A common malpractice of doing the statistical calculations first, then cherry pick the hypothesis according to how the numbers come together should be avoided. A study to test significant difference or equivalence, the direction of non-inferiority, whether the model should be one or two tail, must be determined at the planning stage before data collection and analysis.

Algorithms available

StatTools provides 3 algorithms for equivalence analysis.