A common statistical problem is to describe a relationship between two measurements that are not linearly related
(in a straight line).
When such a relationship can be mathematically defined (e.g. y=x2), variables can be transformed using programs in the Numerical Transformation Program Page
and the relatively simple linear relationship retained.
Often however, a curved relationship that exists may appear regular and consistent, but a mathematical definition of that relationship is not available, and an empitical "best fit" algorithm, such as the polynomial curve fitting from the Curve Fitting Program Page
is required.
The polynomial curve fit uses the formula y=a + b1x +
b2x2 + b3x3 + b4x4....
As each increase in power bends the relationship into a sharper curve, the
combination of all the coefficients will be able to produce a curve of
potentially any level of complexity. In bio-social science, however, curve fitting
beyond the third power is seldom necessary or meaningful.
Curve fitting can be easily accomplished by using multiple regression as described in the Multiple Regression Explained Page
, where the single x variable can be transformed into x2, x3,
x4, and so on, and the combination subjected to multiple regression analysis.
Curve fitting has been used successfully in laboratories, to define relationships
between the results of a test (e.g. the depth of a color reaction) to the amount of a chemical (e.g. sugar) present.
The problem of using curve fitting when more than the mean values of the fit
are required is the difficulty of assigning variance and the confidence interval
of the fitted curve. The least square statistics is seldom useful here, as
each of the coefficient has its own variance, and it is difficult to integrate them.
An even more difficult issue is that, for many biological measurements, variance
increases with the scale of measurement, so that the confidence interval
around y increases as the x value increases.
Altman (see reference) described a two stage procedure that solves this problem.
In the first stage, the standard curve fitting for the mean value is carried out.
In the second stage, the distance between y of each data point and the mean y
from the curve fit is obtained, and its absolute value used to perform another
curve fit, so that a variable confidence interval can be defined.
The program in the Curve Fitting Program Page
uses
Altman's algorithm, and it can be used as follows.
- The data is entered as two columns separated by spaces or tabs. Col 1 is
the independent (x) variable, and col 2 the dependent (y) variable. Each data point is in a row.
- The power the curve fit the mean can be defined, 1 a straight line, 2 a curve with
one hump, 3 curve with 2 humps, and so on. The power is capped at 5 as
curves fitted beyond that are seldom meaningful in biosocial sciences.
- The power to curve fit the standard deviation around the mean can also be
defined. Unless there is a good theoretical reason, 0 or 1 is usually sufficient.
The power is capped at 3.
- The percentage confidence interval required by the user. The 95% confidence interval is the most commonly used one, but the program allows users to change this to any percent (such as 90% or 99%)
x | y |
1 | 10 |
1 | 11 |
2 | 18 |
2 | 22 |
3 | 20 |
3 | 30 |
4 | 19 |
4 | 31 |
5 | 30 |
5 | 45 |
6 | 40 |
6 | 60 |
The example data in the table to the left are from the program
Curve Fitting Program Page
is computer generated, so that x and y has a curved relationship, and
the variance of y increases with x.
We will fit the mean y value to the power of 3, and the standard deviation to the power of 1. we will require the program to draw the 95% confidence interval of the curve oger the range of values
The results are as follows.
Mean regression line
| Coeff |
Cons | -7 |
x1 | 23.5317 |
x2 | -6.4881 |
x3 | 0.6944 |
|
StanDard Deviation
| Coeff |
Cons | -1.6711 |
x1 | 2.3276 |
|
The output is to the right. The first table is the curve for the mean value, and here y = -7 + 23.53x
- 6.49x2 + 0.69x3.
This is followed by the regression line for the standard deviation, SD = -1.67 + 2.33x, which defines the Standard Deviation from
the curve fitted mean for any x value
If we were to combine the two formulae, we can now have the two equations that
can be used to draw the 95% confidence interval lines.
From the first table, the curve of mean is y = -7 + 23.5317x - 6.4881x2
+ 0.6944x3
95%CI lines
| Low | High |
Con | -3.7247 | -10.2753 |
x1 | 18.9697 | 28.0938 |
x2 | -6.4881 | -6.4881 |
x3 | 0.6944 | 0.6944 |
|
From the second table, the standard deviation from the mean curve is
SD = -3.7247 + 10.2753x
The 95% confidence interval is mean ±1.96SD, so by combining the two
fitted lines, we can obtain the upper and lower 95% CI lines, as shown in the table to the right. These are as follows.
- The lower line : y = -3.7247 + 18.9697x - 6.4881x2 + 0.6944x3
- The upper line : y = -10.2753 + 28.0836x - 6.4881x2 + 0.6944x3
Please note that the coefficients for the CI lines would be different should percentage confidence other than 95% is used (such as 90% or 99%)
Data points
X | Y | yx | sd | z | Percentile |
1 | 10 | 10.7381 | 0.6565 | -1.1243 | 13.04 |
1 | 11 | 10.7381 | 0.6565 | 0.3989 | 65.50 |
2 | 18 | 19.6667 | 2.9841 | -0.5585 | 28.82 |
2 | 22 | 19.6667 | 2.9841 | 0.7819 | 78.29 |
3 | 20 | 23.9524 | 5.3117 | -0.7441 | 22.84 |
3 | 30 | 23.9524 | 5.3117 | 1.1386 | 87.26 |
4 | 19 | 27.7619 | 7.6392 | -1.147 | 12.57 |
4 | 31 | 27.7619 | 7.6392 | 0.4239 | 66.42 |
5 | 30 | 35.2619 | 9.9668 | -0.5279 | 29.88 |
5 | 45 | 35.2619 | 9.9668 | 0.9771 | 83.57 |
6 | 40 | 50.619 | 12.2944 | -0.8637 | 19.39 |
6 | 60 | 50.619 | 12.2944 | 0.763 | 77.73 |
|
The data points and their deviation from the mean line are then presented, as in the second table to the right. The abbreviations are:
- X and Y are the original x and y values of the data point
- yx is the curved fitted mean y for the x value X
- sd is the standard deviation of y at the x value X
- z = (Y - yx)/sd, and represents the difference between Y and its
curve fitted value yx in standard deviation units.
- Percentile is a transformation of z into probability percentile, assuming a normal distribution.
These coefficients are now available to transform any x value to y value, using the polynomial transformation utility
available in the Numerical Transformation Program Page
Finally, the curvefit bitmap, with the original data points (black round circles), and the 3 curves (best fit mean, the upper confidence inte3rval, and the lower confidence interval (in this example 95% confidence interval), are displayed
Altman DG (1993) Constructing age-related reference
centiles using absolute residuals. Statistics in Medicine 12(10):917-924