Related Links:
R Programming Language Explained Page
Introduction
Data
Named group designation
Numerical group designation
References
Logistic regression is an extension of linear regression, where the outcome is the probability of binomial 0/1 variable. The general formula is
z = const + b1x1 + b2x2 + b3x3 ...etc, where
x1, x2, x3 and so on are independent variables, either binary (0/1) or ordinal (0, 1, 2, 3..etc)
The product of the coefficient and group designation for each independent varaible, is the log odds ratio of that group to the Reference group designated 0
After obtaining y, the probability of the outcome is calculated by y = 1 / (1 + exp(-z))
Different statistical packages have default approaches to calculating logistic regression
- Where an independent variable is binary (0/1), the coefficients is the log odds ratio of group 1 to group 0
- Where an independent variable has more than 2 groups, e.g. 3 grouops of 0, 1, and 2
- The first option is to treat the 3 groups together, so that the log odds ratio of each group to group 0 is the product of group designation and the coefficient. log odds ratio Grp 1/Grp 0 = b, grp2 / Grp 0 = 2b
- The second option is to transform all groups into binary dummy variables. The number of dummy variables being the number of group -1. In the case of 3 groups, two dummy variables are created d1 and d2, so the d1=0 and d2=0 for group 0, d1=1 and d2=0 for group 2, and d1=0 and d2=1 for group 3.
Both options for handling independent variables with multiple groups are available in R, with the following conventions
- Where groups are represented numerically, such as 0, 1, 2, 3... R performs logistic regression with groups in eqach variable
- Where groups are represented by names in text, such as one, two, three, then R convertes each group to the appropriate number of dummy variables
Parity | Complications | HeightGrp | Delivery |
Multipara | Diabetes | Tall | VD |
Nullipara | Hypertension | Tall | VD |
Multipara | None | Short | CS |
Multipara | Diabetes | Tall | VD |
Nullipara | Hypertension | Tall | VD |
Nullipara | None | Medium | CS |
Multipara | Diabetes | Short | CS |
Nullipara | Hypertension | Tall | VD |
Multipara | None | Tall | VD |
Nullipara | Diabetes | Medium | CS |
Multipara | Hypertension | Tall | VD |
Nullipara | None | Tall | VD |
Multipara | Diabetes | Tall | VD |
Nullipara | Hypertension | Medium | CS |
Multipara | None | Short | CS |
Nullipara | Diabetes | Medium | CS |
Multipara | Hypertension | Medium | VD |
Nullipara | None | Short | CS |
Nullipara | Diabetes | Short | CS |
Multipara | Hypertension | Medium | VD |
Multipara | None | Medium | VD |
Multipara | Diabetes | Medium | CS |
The table to the right are artificial data generated by the computer for the logistic regression exercise.
- Parity is a binary variable, being nullipara or multipara. It can also be reprsented numerically where 0=multipara, and 1=nullipara
- Complications contains 3 groups. None (0), Diabetes (1) or hypertension (2). There is a possiblilty, not in this set of data of both (3). Groups designation are names, even when numbers are used, as there is no assumption that any complication will lead to a higher caesarean section rate
- Height also has 3 groups, but these are ordered. Tall (0), Normal (1) and Short (3)
The data can be presented numerically or using their names. The options, their implications, and how they are interpreted are considered in the next two sections
In numerically designated groups, data input is as follows
Input = ("
Nullipara Complications HeightGrp CS
0 2 0 0
1 1 0 0
0 0 2 1
0 2 0 0
1 1 0 0
1 0 1 1
0 2 2 1
1 1 0 0
0 0 0 0
1 2 1 1
0 1 0 0
1 0 0 0
0 2 0 0
1 1 1 1
0 0 2 1
0 2 1 1
0 1 1 0
1 0 2 1
1 2 2 1
0 1 1 0
0 0 1 0
0 2 1 1")
Data = read.table(textConnection(Input),header=TRUE)
Note that all the data are assumed to be ordered
- Nullipara : No=0 and yes=1
- Complications : None=0, Diabetes=1, hypertension=2, both=3
- Height groups : Short = 0, Normal=1, and Tall=2
- Outcome CS : Vaginal delivery=0 caesarean section=1
The same data, in excel, can also be imported using
install.packages("XLConnect")
library('XLConnect')
Data = readWorksheetFromFile("MyLogisticRegressionData.xlsx", sheet="NumericalSheet") #header=TRUE is default
Logistic Regression is run using the following commands
fit <- glm(CS~Nullipara+Complications+HeightGrp,data=Data,family=binomial())
summary(fit)
The results are shown as follows
Call:
glm(formula = CS ~ Nullipara + Complications + HeightGrp, family = binomial(),
data = Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.363e-05 -2.110e-08 -2.110e-08 2.110e-08 1.219e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -164.53 202409.75 -0.001 0.999
Nullipara 93.01 125529.32 0.001 0.999
Complications 46.59 69106.34 0.001 0.999
HeightGrp 94.84 118221.10 0.001 0.999
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.0316e+01 on 21 degrees of freedom
Residual deviance: 9.0924e-10 on 18 degrees of freedom
AIC: 8
Number of Fisher Scoring iterations: 25
Please note the following.
There is only one coefficient for each independent variable
The coefficient represents the log odds ratio between group designated 1 and group designated 0, and between group designated 2 and 0 the coefficient x 2, between group 3 and 0 coefficient x 3, and so on
To calculate the probability of caesarean section
z = -164.5 + 93.0(group number for nullipara) + 49.6 (group number for complication) + 94.8 (group number for height)
Probability of caesarean section = 1 / (1 + exp(-z))
Results of calculations for all combinationTalls are
Combinations | x1,x2,x3 | z=const+x1b1+x2b2+x3b3 | y=1/(1+Exp(-z)) |
Multip+NoComp+Tall | 0,0,0 | -164.5+0+0+0=-164.5 | 0.0 |
Multip+NoComp+Medium | 0,0,1 | -164.5+0+0+94.8=-69.7 | 0.0 |
Multip+NoComp+Short | 0,0,2 | -164.5+0+0+189.6=25.1 | 1.0 |
Multip+Diabetes+Tall | 0,1,0 | -164.5+0+49.6+0=-114.9 | 0.0 |
Multip+Diabetes+Medium | 0,1,1 | -164.5+0+49.6+94.8=-20.1 | 0.0 |
Multip+Diabetes+Short | 0,1,2 | -164.5+0+49.6+189.6=74.7 | 1.0 |
Multip+Hypertension+Tall | 0,2,0 | -164.5+0+99.2+0=-65.3 | 0.0 |
Multip+Hypertension+Medium | 0,2,1 | -164.5+0+99.2+94.8=29.5 | 1.0 |
Multip+Hypertension+Short | 0,2,2 | -164.5+0+99.2+189.6=124.3 | 1.0 |
Nullip+NoComp+Tall | 1,0,0 | -164.5+93+0+0=-71.5 | 0.0 |
Nullip+NoComp+Medium | 1,0,1 | -164.5+93+0+94.8=23.3 | 1.0 |
Nullip+NoComp+Short | 1,0,2 | -164.5+93+0+189.6=118.1 | 1.0 |
Nullip+Diabetes+Tall | 1,1,0 | -164.5+93+49.6+0=-21.9 | 0.0 |
Nullip+Diabetes+Medium | 1,1,1 | -164.5+93+49.6+94.8=72.9 | 1.0 |
Nullip+Diabetes+Short | 1,1,2 | -164.5+93+49.6+189.6=167.7 | 1.0 |
Nullip+Hypertension+Tall | 1,2,0 | -164.5+93+99.2+0=27.7 | 1.0 |
Nullip+Hypertension+Medium | 1,2,1 | -164.5+93+99.2+94.8=122.5 | 1.0 |
Nullip+Hypertension+Short | 1,2,2 | -164.5+93+99.2+189.6=217.3 | 1.0 |
Advantages and disadvantages of the numerical group designation model
The main advantage is simplicity. Only a single coefficient is produced for each independent variable. The model is similar to multiple regression and intuitively easy to understand
The disadvantage is a possible flawif the log odds ratio between groups are not equal. For example, the difference between tall and medium height in their influence on caesarean section may be relatively minor, compared with the difference between medium and short, but the model treats then as steps of equal distance
In named group designation, the independent variables are not numerically represented, but consists of names in text
- Parity is either nullipara or multipara
- Complications are either None, Diabetes, Hypertension, or both(not in this set of data)
- Height groups are Tall, Medium, or Short
- Outcome is vaginal delivery (VD) or caesarean section (CS)
Input = ("
Parity Complications HeightGrp Delivery
Multipara Diabetes Tall VD
Nullipara Hypertension Tall VD
Multipara None Short CS
Multipara Diabetes Tall VD
Nullipara Hypertension Tall VD
Nullipara None Medium CS
Multipara Diabetes Short CS
Nullipara Hypertension Tall VD
Multipara None Tall VD
Nullipara Diabetes Medium CS
Multipara Hypertension Tall VD
Nullipara None Tall VD
Multipara Diabetes Tall VD
Nullipara Hypertension Medium CS
Multipara None Short CS
Nullipara Diabetes Medium CS
Multipara Hypertension Medium VD
Nullipara None Short CS
Nullipara Diabetes Short CS
Multipara Hypertension Medium VD
Multipara None Medium VD
Multipara Diabetes Medium CS")
Data = read.table(textConnection(Input),header=TRUE)
Again, the data can be placed in excel and imported as follows
install.packages("XLConnect")
library('XLConnect')
Data = readWorksheetFromFile("MyLogisticRegressionData.xlsx", sheet="NamedSheet") #header=TRUE is default
and the command to perform logistic regression
fit <- glm(Delivery~Parity+Complications+HeightGrp,data=Data,family=binomial())
summary(fit)
By using text names instead of ordinal numbers as values for independent variables, R interpretes the data as nominal (names) and not ordinal (in order). The algorithm therefore creates dummy variables for analysis
Call:
glm(formula = Delivery ~ Parity + Complications + HeightGrp,
family = binomial(), data = Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-8.161e-06 -2.110e-08 2.110e-08 2.931e-06 7.677e-06
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -24.13 96660.83 0 1
ParityNullipara -49.17 117230.92 0 1
ComplicationsHypertension 48.98 134921.65 0 1
ComplicationsNone 48.37 136363.41 0 1
HeightGrpShort -49.35 155080.86 0 1
HeightGrpTall 50.48 129005.27 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.0316e+01 on 21 degrees of freedom
Residual deviance: 3.8803e-10 on 16 degrees of freedom
AIC: 12
Number of Fisher Scoring iterations: 25
Please note : In every variable, the number of dummy variables created is 1 less than the groups. Unless otherwise specified, the groupings are done alphabetically, with the first in alphabeticalorder left out. Thus
- The outcome, delivery, is designated caesarean section(CS) or vaginal delivery(VD), the algorithm assigns the second in alphabetical order as the outcome to predict, so that the result is hthe probability of vaginal delivery.
- In parity, the dummy variable is the second alphabetically, nullipara Yes (1). The log odds ratio for vaginal delivery Nullipara/Multipara = -49.2
- In complications, 2 dummy variables, None No (0) or Yes(1), Hypertension No(0) or Yes(1). The first in alphabetical order, Diabetes, is left out. This means that no complication is represented as 0 1, hypertension as 1 0, and Diabetes as 0 0. The log odds ratio for vaginal delivery hypertension Yes/No = 49.0, and for No complication yes/no is 48.4
- In height groups, 2 dummy variables, Short No (0) and Yes(1), Tall No(0) and Yes(1). The first in alphabetical order, Medium, is left out. This means that height groups are represented as Medium 0 0, Short 0 1, and Tall 1 0. The log odds ratio for vaginal delivery maternal height Short Yes/No=-49.4, and for Tall Yes/No = 50.5
To calculate the probability of caesarean section
z = -24.1 - 41.2(if nullipara) + 49.0(if hypertension) + 48.4(if no complication) - 49.4(if short) + 50.5(if tall)
Probability of vaginal delivery y = 1 / (1 + exp(-z))
The following table shows calculations for the probability of vaginal deliveries for all combinations of independent variables
Combinations | Nullip,Hypertension, NoComp,Short,Tall | z=const+sum(xibi) for Nullip, Hypertension,NoComp,Short,Tall | y=1/(1+Exp(-z)) P(VD) |
Multip+NoComp+Tall | 0,0,1,0,1 | -24.1+0+0+48.37+0+50.48=74.72 | 1.0 |
Multip+NoComp+Medium | 0,0,1,0,0 | -24.1+0+0+48.37+0+0=24.24 | 1.0 |
Multip+NoComp+Short | 0,0,1,1,0 | -24.1+0+0+48.37+-49.35+0=-25.11 | 0.0 |
Multip+Diabetes+Tall | 0,0,0,0,1 | -24.1+0+0+0+0+50.48=26.35 | 1.0 |
Multip+Diabetes+Medium | 0,0,0,0,0 | -24.1+0+0+0+0+0=-24.1 | 0.0 |
Multip+Diabetes+Short | 0,0,0,1,0 | -24.1+0+0+0+-49.35+0=-73.48 | 0.0 |
Multip+Hypertension+Tall | 0,1,0,0,1 | -24.1+0+48.98+0+0+50.48=75.33 | 1.0 |
Multip+Hypertension+Medium | 0,1,0,0,0 | -24.1+0+48.98+0+0+0=24.85 | 1.0 |
Multip+Hypertension+Short | 0,1,0,1,0 | -24.1+0+48.98+0+-49.35+0=-24.5 | 0.0 |
Nullip+NoComp+Tall | 1,0,1,0,1 | -24.1+-49.17+0+48.37+0+50.48=25.55 | 1.0 |
Nullip+NoComp+Medium | 1,0,1,0,0 | -24.1+-49.17+0+48.37+0+0=-24.93 | 0.0 |
Nullip+NoComp+Short | 1,0,1,1,0 | -24.1+-49.17+0+48.37+-49.35+0=-74.28 | 0.0 |
Nullip+Diabetes+Tall | 1,0,0,0,1 | -24.1+-49.17+0+0+0+50.48=-22.82 | 0.0 |
Nullip+Diabetes+Medium | 1,0,0,0,0 | -24.1+-49.17+0+0+0+0=-73.3 | 0.0 |
Nullip+Diabetes+Short | 1,0,0,1,0 | -24.1+-49.17+0+0+-49.35+0=-122.65 | 0.0 |
Nullip+Hypertension+Tall | 1,1,0,0,1 | -24.1+-49.17+48.98+0+0+50.48=26.16 | 1.0 |
Nullip+Hypertension+Medium | 1,1,0,0,0 | -24.1+-49.17+48.98+0+0+0=-24.32 | 0.0 |
Nullip+Hypertension+Short | 1,1,0,1,0 | -24.1+-49.17+48.98+0+-49.35+0=-73.67 | 0.0 |
The major advantage of using name designation is accuracy. There is no assumption of order between groups in any independent variable, and all independent variables are converted to dummy variables before calculation.
The disadvantages are that a large number of dummy variables are created, and in a complex modelthis may cause some confusion to the inexperienced data analyst.
The names of each groups can be labeled in such a manner that they lined up alphabetically in a convenient manner. For example 0VD and 1CS will result in 1CS being the outcome dummy, and the model estimates the probability of caesarean section.
[https://en.wikipedia.org/wiki/Logistic_regression] Logistic Regression by Wikipedia
Cox, DR (1958). "The regression analysis of binary sequences (with discussion)". J Roy Stat Soc B. 20 (2): 215 - 242. JSTOR 2983890.
Portney LR, Watkins MP (2000) Foundations of Clinical Research Applications to Practice Second Edition.ISBN 0-8385-2695 0 p. 597 - 603
|