Paired Diff exp

StatTools : Paired Difference Explained

Links : Home Index (Subjects) Contact StatTools

Introduction Parametric Paired t Test Nonparametric Wilcoxon PSRT Permutation Test References

The paired difference is a powerful and commonly used model in clinical research and quality control. By examining the difference outcomes from different causes in the same individual, variations between individuals are very much reduced, so a smaller difference can be detected by a small sample.

In clinical research, questions such as whether husbands are older than wives from the same family, whether boys are heavier than girls in non-identical twins, whether the effects of two different treatments for the same condition in the same patient, all use the paired difference model.

Paired difference is also commonly used in quality control, each pair evaluating a measurement against a standard. Questions such as whether the waiting time for operations exceed a benchmark, whether blood loss in operations exceeds that expected, also used the paired difference model.

StatTools provides three programs for evaluating paired difference, the parametric paired t test and 95% confidence interval of the paired difference, and the nonparametric Wilcoxon Paired Signed Rank Test, and the Permutation Test. The algorithms for calculation are in the Paired Difference Programs Page , and each test is discussed in its own panel.

Paired Difference Sample Size Example

Parametric paired difference calculates the difference between the pairs of values (d = v1 - v2), summarizes these as n, mean, Standard Deviation, and Standard Error of the mean. It then evaluates the mean and its Standard Error against the null hypothesis that mean = 0.

Two tests are performed.

The previously standard t test, and the probability of Type I error (α). Commonly the mean is taken to significantly deviate from null (0) if α<0.05
The 95% confidence interval of the difference. Commonly the mean is taken to significantly deviate from null (0) if the 95% confidence interval does not overlap the null value(0).

Both tests can be in the one tail or two tail model.

The two tail model is when the researcher is interested whether a significant difference exists, both v1>v2, or v1<v2
The one tail model is when the researcher is interested whether a significant difference exists in one direction, v1>v2, or v1<v2, but not both. The one tail model is more powerful, but it requires the researcher to determine the direction of interest before interpreting the results.

The 95% confidence interval is provided by default, as this is the most commonly used. For researchers requiring different percentage (e.g. 80%, 90%, or 99% confidence intervals), the following calculations can be carried out.

Calculate α, where α = (100-%) / 100. For 80% CI α=0.2, for 90% CI α=0.1, for 95% CI α=0.05, for 99% CI α=0.01
Using either the Probability of Student's t Explanation and Tables Page or the Probability of t Program Page , stipulating the degrees of freedom and whether the one or two tail model is being used, the t value for α is obtained.
The confidence interval is then mean±t(SE)

The Sample Size for Mean of Paired Difference Explanations and Tables Page provides 4 programs for sample size issues related to the parametric paired comparison. Theories and interpretaions of these programs are generally discussed in Sample Size Introduction and Explanation Page , and only details specific to paired differences are discussed here.

Two programs are available to assist in the planning phase of a research project

Pilot study evaluates the 95% confidence interval from a nominated Standard Deviation of the paired difference, and the sample size. The programs tabulates the 95% coinfidence interval with increasing sample size, thereby assisting the research planner to determine the most cost effective sample size to use in a pilot study. In most cases, 10 to 30 pairs are used, balancing between costs and precision.
Sample size estimates the sample size (number of pairs) required, base on the following
- Probability of Type I Error (α) to be used to determine statistical significance, the most common value used is α=0.05
- The power or sensitivity of the data to detect a difference if it really exists, the most common value used is power=0.8
- The mean of paired differences to be detected
- The expected Standard Deviation of the paired differences
- Whether the one tail or two tail model is to be used

Two programs are available to assist in the evaluation of the data in the analysis phase of a research project

Power estimation evaluates the power of the data, to determine whether it reaches the level planned. Power is estimated using
- Probability of Type I Error (α) used to determine statistical significance, the most common value used is α=0.05
- The sample size (number of pairs) in the data
- The observed mean of the paired difference
- The observed Standard Deviation of the paired difference.
- Whether the one tail or two tail model is to be used
Confidence interval of the mean of paired differences, is an alternative to Type I Error and statistical significance. Confidence interval calculations uses
- The percent range of the confidence interval, usually the 95% confidence interval is useed
- The sample size (number of pairs) in the data
- The observed mean of the paired difference
- The observed Standard Deviation of the paired difference.
- Whether the one tail or two tail model is to be used

The following example uses computer generated data to demonstrate the processes.

We wish to know, in twin deliveries, whether the first twin is bigger or smaller than the second twin. The paired difference is diff = wt_{twin 1} - wt_{twin 2}

From our records and experience, we estimated the Standard Deviation of the differences between twins to be 150g
We decided a difference of 100g would be clinically meaningful, and our research model need to be powerful enough to detect this difference
We decided that statistical decision should be based on the 95% confidence interval of the difference, which is equivalent to a probability of Type I Error (α) of 0.05
We decided that the data should have a power of 0.8.
As wr are interested to know whether the first twin is larger or smaller, a two tail model is adopted.

Ssiz	CI	Dec	Dec/case	%Dec/case
5	372
10	215	158	32	8
15	166	48	10	5
20	140	26	5	3
25	124	17	3	2
30	112	12	2	2
35	103	9	2	2
40	96	7	1	1

Pilot Study Before the main project, we wish to conduct a pilot study, to confirm that our estimate of Standard Deviation is correct, to make sure that the project is feasible, and for other administrative reasons.

Using Sample Size for Mean of Paired Difference Explanations and Tables Page, and providing a Standard Deviation of paired differences of 150g, we obtained the table as shown to the right.

With a sample size of 15 pairs, further increases in sample size will only reduce the confidence interval of the difference by 10g per pair (5%). We therefore decided that it would not be cost efficient in a pilot study to exceed this, and conducted a pilot study using 15 pairs of twin deliveries.

The pilot study indicated that there was no insurmountable barrier to mount a successful project, and that the Standard Deviation of the paired differences has not contradicted our initial estimate of 150g. We can decide to proceed with the main project.

Sample size Estimation. We calculate the sample size requirement as follows.

The statistical decision will use the probability of Type I Error (α) of 0.05
The power requirement will be 0.8 (80%)
The expected Standard Devision of paired differences is 150g
The mean paired difference the study is designed to detedt is 100g

Twin 1	Twin 2	Difference
3163	3124	39
3245	2807	438
3391	3014	376
2547	2727	-180
3042	3254	-211
3200	3826	-626
3115	2596	519
3294	2952	343
3019	3279	-260
3222	3325	-103
2831	2984	-153
3043	2765	277
2646	3109	-463
3327	2757	570
3182	3781	-599
2984	3061	-77
2878	3658	-780
2770	3665	-895
3092	2739	353
2735	3324	-589

Data Analysis : We proceeded with the main research project as planned, and collected birth weight from 20 sets of twins. The results are shown in the table to the left (weight in grams).

Using the first two columns in the Paired t Test in the Paired Difference Programs Page , the results, rounded to the nearest gram, are as follows.

The probability of Type I Error is α=0.35, indicating that the null hypothesis cannot be rejected. This is confirmed by the 95% confidence interval opf -320g to +118g, which overlaps the null value of 0.

The immediate conclusion is that there is no significant difference related to the order of birth. The results are however confusing, as the mean difference found was 101g, exceeding what the critical value determined at the time of planning.

Returning to the Sample Size for Mean of Paired Difference Program Page , the power estimation shows this result to have a power of 0.15, far short of the 0.8 we stipulated during planning. The reason for this lack of power is that the Standard Deviation of the paired difference, 456g, was far greater than the 150g envisaged at the time of planning.

At this point, a decision is required on how to proceed, the options as explained in the Probability Introduction and Explanation Page ,are as follows.

Option 1 : The correct interpretation, as originally proposed by Pearson, is that the Standard Deviation proposed durimg planning, 150g, remains a constant, and the Standard Deviation in the data, 456g, is an unstable variation. As the mean of paired differences, 101g, exceeds the critical value of 100g stipulated during planning, the conclusion should be that this difference is statistically significant. The results of the t test and 95% confidence interval should be considered irrelevant. Using this option, we would conclude that, on average, the first twin is 101g smaller than the second twin, and this difference is statistically significant.

If we choose option 2, we will either settle for an inconclusive result, or recalibrate the sample size based on the observed Standard Deviation. To detect a paired difference of 100g, if the Standard Deviation is 456g, with α=0.05 and power=0.8, will require 166 pairs for a two tail model.

Explanation Example

The Wilcoxon Paired Signed Rank test is a nonparametric equivalence of the Paired t test. The procedures are as follows

The difference between the two values of the pair (d = v1 - v2) is calculated, in te same manner as for the paired t test.
All tie values (d=0) are discarded, and the remainder are used for calculation
The result is expressed as the probability of Type I Error (α)
- α is only calculated if the sample size is greater than 15 pairs
  - The two tail model is assumed when calculating α
  - If the one tail model is assumed, the &alpha value can be halved (α_{one tail} = α_{two tail} / 2)
- When sample size is <= 15 pairs, α is estimated by exact probability calculations, usually obtained from a table
  - The results are expressed as p>0.05, not significant, p<0.05, p<0.01, or p<0.001
  - The two tail model is assumed when calculating α
  - Results obtained when sample size <15 pairs should not be modified for the one tail model.

Sample size for Wilcoxon Paired Signed Rank Test is discussed in Sample Size for Mean of Paired Difference Explanations and Tables Page and not repeated here.

The data in the examples are made up to demonstrate the methods.

Sample size

We wish to study whether a new analgesic is effective in relieving headaches.

We ask the subjects to describe their headache as none (0), some (1), moderate (2), and severe (3), a 4 point scale (0 to 3), before and after administering the analgesics, and use the paired differences to evaluate the analgesics.

As the paired difference can be from -3 to +3, the range is 6. We can therefore estimate the standard deviation as 6/3.92 = 1.53. We would like our data to be able to detect a paired difference of 1, so our effect size = 1/1.53 = 0.65

We set the power of the study to 0.8, therefore we will use the power of 0.8x0.995=0.84 in the calculation of our sample size.

We set α=0.05, power=0.84, and diff/SD=0.65. Using the table of sample size for the paired t test in Sample Size for Mean of Paired Difference Explanations and Tables Page, we find the sample size to be 23 subjects (pairs).

Data analysis

Before	After	Diff
3	1	2
2	1	1
2	0	2
3	0	3
1	1	0
1	0	1
2	2	0
1	2	-1
1	1	0
3	1	2
1	2	-1
1	1	0
1	1	0
3	0	3
3	1	2
1	2	-1
1	1	0
1	3	-2
1	1	0
3	1	2
3	2	1
1	1	0

We did not managed 23 but managed to study 22 subjects with headaches, and their scores, before and after receiving the analgesic, and the paired difference.

This is shown in the table to the right, and the table of counts constructed from this is shown in the table to the left.

Paired diff	-	+
1	3	3
2	1	5
3	0	2

There were 8 subjects whose headache scores did not change (0), and these are not included in the table of counts.

On the negative side, there were:

3 subjects that got worse (-1)
1 subjects that got a lot worse (-2)

On the positive side, there were

3 subjects that got better (1)
5 subjects that got a lot better (2)
2 subject that got completely better(3)

These can be counted in the Diff column in the data table to the left, and summarised in the table of counts to the right.

The results are n = 14
T+ = 85
T- = 20
p<0.05

We can therefore conclude that headaches decreased significantly after receiving the analgesic.

Explanation Example

The Permutation Tests are the most basic of statistical tests, from which other models have developed. StatTools presents two models, the significance test for paired differences presented in Paired Difference Programs Page , and the significance test comparing two groups presented in Unpaired Difference Programs Page .

The general principles are that, in a randomly allocated study, the data obtained could have been in either of the paired measurements. The test consists of calculating every possible permutation of the data, and examine the results. If the results from the original data is near the extremes (e.g. less than 5 percentile or more than 95 percentile in a one tail model), then a decision can be made that it is unlikely to be null and therefore statistically significant.

The advantages of using the Permutation tests are :

Exhaustive permutation allows the calculation of the precise probability that the data presented is null, so the tests calculate the Type I Error (α), with a power (1-β) of 100%.
The tests are not dependent on any assumption of data distribution, so they can be used in any regular interval data (where 10-9 is the same as 4-3). The tests can therefore be used on parametric measurements, ratios, variances, and time.
Because of the above two characteristics, the tests can be used with a very small sample size

The disadvantages of using the tests are related to the computation intensity required, both in the large memory use, and the time required for computation. The number of permutation is 2ⁿ, were n is the number of pairs. Computation time therefore increases exponentially with increasing sampl size, and large dataset may either crash the program when available RAM is exhausted, or the computation becomes unacceptably too long.

The Permutation Test is therefore ideal for handling small sets of interval data with uncertain distributions. With larger sample size, the more common non-parametric (Wilcoxon PSRT or Mann-Whitney U Test) and parametric (Paired t test) tests should be preferred.

In theory, the Permutation Test can cope with any number of pairs. However, a probability of <0.05 is not possible with less than 6 pairs unless the differences are uniformly in one direction, and computation will take an unacceptably long time with 22 pairs or more.

The mathematical argument of the Permutation Test is as follows

In a pair of measurements, the null hypothesis is that there is no difference between the pair. In other words, that the values observed can be in either of the pair.
The Permutation Test therefore consists of examining the sum of paired differences, in all permutations where the values in each pair are in either groups. The total permutation is therefore 2ⁿ, were n is the number of pairs.
The sum of differences in the original data is then compared with all possible outcomes, so that its probability can be estimated.

v1	v2
1.39	-1.10
-0.25	0.56
0.06	-0.52
0.14	0.66
0.66	1.11
0.34	-1.43
-0.61	1.20
-1.58	1.02
-0.27	1.41
-0.94	-2.64
-1.56	1.93
-0.46	-0.54
-0.33	0.09
0.41	-2.89
-0.66	1.83
1.19	-0.05

We use the default data in the program as an example, as in the table to the left. These are 16 pairs of measurements.

v1	v2	Diff
1.39	-1.10	2.49
-0.25	0.56	-0.81
0.06	-0.52	0.58
0.14	0.66	-0.52
0.66	1.11	-0.45
0.34	-1.43	1.77
-0.61	1.20	-1.81
-1.58	1.02	-2.6
-0.27	1.41	-1.68
-0.94	-2.64	1.7
-1.56	1.93	-3.49
-0.46	-0.54	0.08
-0.33	0.09	-0.42
0.41	-2.89	3.3
-0.66	1.83	-2.49
1.19	-0.05	1.24
sum diff		-3.11

Step 1. The paired difference (v1-v2) for each case is calculated, as shown in the table to the right. The first 2 columns are the paired measurements from the data, the third column is difference. The paired differences are summed, so that there are 16 pairs, and the sum of paired differences is -3.11

Step 2. Calculating the mathematics of permutation. Given 16 pairs, there are 2¹⁶=36636 possible combinations. If we are to use a two tail model at α<0.05, then there are 0.025 (2.5%) of paired differences in the extreme values on either side. One would therefore expect that 0.025 x 36636 = 1638 values in either extremes which can be considered unlikely, therefore significantly deviating from null.

Step 3. The sum of difference for all permutations are calculated, each are compared to determine whether it is less than, the same, or greater than that obtained from the original data. Of the 36636 permutations, there are 22622 values less than -3.11, 42880 values greater than -3.11. Both of these are more than the 1638 which define the decision border for α=0.05, 2 tail model. Therefore, we can conclude that the null hypothesis cannot be rejected, or that the paired difference is not statistically significant.

Looking it another way, there are 22623 values greater than -3.11, so 3.11 is the 22624^th value from the minimum, or 22624/36636 x 100 = 34.52^th percentile of all possible value, not less than 2.5 percentile or more than 97.5 percentile for an α of 0.05 two tail model. In other words, the probability of Type I Error (α) = 0.35 for the one tail model, double to 0.69 for the two tail model.

Paired t Test :
Armitage P. Statistical Methods in Medical Research (1971). Blackwell Scientific Publications. Oxford. P.189-207

Sample Size :
Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical Studies. Second Ed. Blackwell Science IBSN 0-86542-870-0 p. 71-72

Wilcoxon Paired Signed Rank Test :
Siegel S and Castellan Jr. NJ (2000) Nonparametric Statistics for the Behavioral Sciences. Second Edition. McGraw Hill, Sydney. ISBN0-07-100326-6 p. 95

Permutation Test for Paired Differences :
Siegel S and Castellan Jr. NJ (2000) Nonparametric Statistics for the Behavioral Sciences. Second Edition. McGraw Hill, Sydney. ISBN0-07-100326-6 p. 95-101