LogReg

Logistic regression is an extension of linear regression, where the outcome is the probability of binomial 0/1 variable. The general formula is

Different statistical packages have default approaches to calculating logistic regression

Where an independent variable is binary (0/1), the coefficients is the log odds ratio of group 1 to group 0
Where an independent variable has more than 2 groups, e.g. 3 grouops of 0, 1, and 2
- The first option is to treat the 3 groups together, so that the log odds ratio of each group to group 0 is the product of group designation and the coefficient. log odds ratio Grp 1/Grp 0 = b, grp2 / Grp 0 = 2b
- The second option is to transform all groups into binary dummy variables. The number of dummy variables being the number of group -1. In the case of 3 groups, two dummy variables are created d1 and d2, so the d1=0 and d2=0 for group 0, d1=1 and d2=0 for group 2, and d1=0 and d2=1 for group 3.

Both options for handling independent variables with multiple groups are available in R, with the following conventions

Where groups are represented numerically, such as 0, 1, 2, 3... R performs logistic regression with groups in eqach variable
Where groups are represented by names in text, such as one, two, three, then R convertes each group to the appropriate number of dummy variables

Parity	Complications	HeightGrp	Delivery
Multipara	Diabetes	Tall	VD
Nullipara	Hypertension	Tall	VD
Multipara	None	Short	CS
Multipara	Diabetes	Tall	VD
Nullipara	Hypertension	Tall	VD
Nullipara	None	Medium	CS
Multipara	Diabetes	Short	CS
Nullipara	Hypertension	Tall	VD
Multipara	None	Tall	VD
Nullipara	Diabetes	Medium	CS
Multipara	Hypertension	Tall	VD
Nullipara	None	Tall	VD
Multipara	Diabetes	Tall	VD
Nullipara	Hypertension	Medium	CS
Multipara	None	Short	CS
Nullipara	Diabetes	Medium	CS
Multipara	Hypertension	Medium	VD
Nullipara	None	Short	CS
Nullipara	Diabetes	Short	CS
Multipara	Hypertension	Medium	VD
Multipara	None	Medium	VD
Multipara	Diabetes	Medium	CS

The table to the right are artificial data generated by the computer for the logistic regression exercise.

Parity is a binary variable, being nullipara or multipara. It can also be reprsented numerically where 0=multipara, and 1=nullipara
Complications contains 3 groups. None (0), Diabetes (1) or hypertension (2). There is a possiblilty, not in this set of data of both (3). Groups designation are names, even when numbers are used, as there is no assumption that any complication will lead to a higher caesarean section rate
Height also has 3 groups, but these are ordered. Tall (0), Normal (1) and Short (3)

The data can be presented numerically or using their names. The options, their implications, and how they are interpreted are considered in the next two sections

In numerically designated groups, data input is as follows


Input = ("
Nullipara	Complications	HeightGrp	CS
0	2	0	0
1	1	0	0
0	0	2	1
0	2	0	0
1	1	0	0
1	0	1	1
0	2	2	1
1	1	0	0
0	0	0	0
1	2	1	1
0	1	0	0
1	0	0	0
0	2	0	0
1	1	1	1
0	0	2	1
0	2	1	1
0	1	1	0
1	0	2	1
1	2	2	1
0	1	1	0
0	0	1	0
0	2	1	1")
Data = read.table(textConnection(Input),header=TRUE)

Note that all the data are assumed to be ordered

Nullipara : No=0 and yes=1
Complications : None=0, Diabetes=1, hypertension=2, both=3
Height groups : Short = 0, Normal=1, and Tall=2
Outcome CS : Vaginal delivery=0 caesarean section=1

The same data, in excel, can also be imported using


install.packages("XLConnect")
library('XLConnect')
Data = readWorksheetFromFile("MyLogisticRegressionData.xlsx", sheet="NumericalSheet") #header=TRUE is default

Logistic Regression is run using the following commands


fit <- glm(CS~Nullipara+Complications+HeightGrp,data=Data,family=binomial())
summary(fit)

The results are shown as follows


Call:
glm(formula = CS ~ Nullipara + Complications + HeightGrp, family = binomial(), 
    data = Data)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.363e-05  -2.110e-08  -2.110e-08   2.110e-08   1.219e-05  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     -164.53  202409.75  -0.001    0.999
Nullipara         93.01  125529.32   0.001    0.999
Complications     46.59   69106.34   0.001    0.999
HeightGrp         94.84  118221.10   0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.0316e+01  on 21  degrees of freedom
Residual deviance: 9.0924e-10  on 18  degrees of freedom
AIC: 8

Number of Fisher Scoring iterations: 25

Please note the following.

There is only one coefficient for each independent variable

The coefficient represents the log odds ratio between group designated 1 and group designated 0, and between group designated 2 and 0 the coefficient x 2, between group 3 and 0 coefficient x 3, and so on

To calculate the probability of caesarean section

Results of calculations for all combinationTalls are

Combinations	x1,x2,x3	z=const+x1b1+x2b2+x3b3	y=1/(1+Exp(-z))
Multip+NoComp+Tall	0,0,0	-164.5+0+0+0=-164.5	0.0
Multip+NoComp+Medium	0,0,1	-164.5+0+0+94.8=-69.7	0.0
Multip+NoComp+Short	0,0,2	-164.5+0+0+189.6=25.1	1.0
Multip+Diabetes+Tall	0,1,0	-164.5+0+49.6+0=-114.9	0.0
Multip+Diabetes+Medium	0,1,1	-164.5+0+49.6+94.8=-20.1	0.0
Multip+Diabetes+Short	0,1,2	-164.5+0+49.6+189.6=74.7	1.0
Multip+Hypertension+Tall	0,2,0	-164.5+0+99.2+0=-65.3	0.0
Multip+Hypertension+Medium	0,2,1	-164.5+0+99.2+94.8=29.5	1.0
Multip+Hypertension+Short	0,2,2	-164.5+0+99.2+189.6=124.3	1.0
Nullip+NoComp+Tall	1,0,0	-164.5+93+0+0=-71.5	0.0
Nullip+NoComp+Medium	1,0,1	-164.5+93+0+94.8=23.3	1.0
Nullip+NoComp+Short	1,0,2	-164.5+93+0+189.6=118.1	1.0
Nullip+Diabetes+Tall	1,1,0	-164.5+93+49.6+0=-21.9	0.0
Nullip+Diabetes+Medium	1,1,1	-164.5+93+49.6+94.8=72.9	1.0
Nullip+Diabetes+Short	1,1,2	-164.5+93+49.6+189.6=167.7	1.0
Nullip+Hypertension+Tall	1,2,0	-164.5+93+99.2+0=27.7	1.0
Nullip+Hypertension+Medium	1,2,1	-164.5+93+99.2+94.8=122.5	1.0
Nullip+Hypertension+Short	1,2,2	-164.5+93+99.2+189.6=217.3	1.0

Advantages and disadvantages of the numerical group designation model

The main advantage is simplicity. Only a single coefficient is produced for each independent variable. The model is similar to multiple regression and intuitively easy to understand

The disadvantage is a possible flawif the log odds ratio between groups are not equal. For example, the difference between tall and medium height in their influence on caesarean section may be relatively minor, compared with the difference between medium and short, but the model treats then as steps of equal distance

In named group designation, the independent variables are not numerically represented, but consists of names in text

Parity is either nullipara or multipara
Complications are either None, Diabetes, Hypertension, or both(not in this set of data)
Height groups are Tall, Medium, or Short
Outcome is vaginal delivery (VD) or caesarean section (CS)


Input = ("
Parity	Complications	HeightGrp	Delivery
Multipara	Diabetes	Tall	VD
Nullipara	Hypertension	Tall	VD
Multipara	None	Short	CS
Multipara	Diabetes	Tall	VD
Nullipara	Hypertension	Tall	VD
Nullipara	None	Medium	CS
Multipara	Diabetes	Short	CS
Nullipara	Hypertension	Tall	VD
Multipara	None	Tall	VD
Nullipara	Diabetes	Medium	CS
Multipara	Hypertension	Tall	VD
Nullipara	None	Tall	VD
Multipara	Diabetes	Tall	VD
Nullipara	Hypertension	Medium	CS
Multipara	None	Short	CS
Nullipara	Diabetes	Medium	CS
Multipara	Hypertension	Medium	VD
Nullipara	None	Short	CS
Nullipara	Diabetes	Short	CS
Multipara	Hypertension	Medium	VD
Multipara	None	Medium	VD
Multipara	Diabetes	Medium	CS")
Data = read.table(textConnection(Input),header=TRUE)

Again, the data can be placed in excel and imported as follows


install.packages("XLConnect")
library('XLConnect')
Data = readWorksheetFromFile("MyLogisticRegressionData.xlsx", sheet="NamedSheet") #header=TRUE is default

and the command to perform logistic regression


fit <- glm(Delivery~Parity+Complications+HeightGrp,data=Data,family=binomial())
summary(fit)

By using text names instead of ordinal numbers as values for independent variables, R interpretes the data as nominal (names) and not ordinal (in order). The algorithm therefore creates dummy variables for analysis


Call:
glm(formula = Delivery ~ Parity + Complications + HeightGrp, 
    family = binomial(), data = Data)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-8.161e-06  -2.110e-08   2.110e-08   2.931e-06   7.677e-06  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)
(Intercept)                  -24.13   96660.83       0        1
ParityNullipara              -49.17  117230.92       0        1
ComplicationsHypertension     48.98  134921.65       0        1
ComplicationsNone             48.37  136363.41       0        1
HeightGrpShort               -49.35  155080.86       0        1
HeightGrpTall                 50.48  129005.27       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.0316e+01  on 21  degrees of freedom
Residual deviance: 3.8803e-10  on 16  degrees of freedom
AIC: 12

Number of Fisher Scoring iterations: 25

Please note : In every variable, the number of dummy variables created is 1 less than the groups. Unless otherwise specified, the groupings are done alphabetically, with the first in alphabeticalorder left out. Thus

The outcome, delivery, is designated caesarean section(CS) or vaginal delivery(VD), the algorithm assigns the second in alphabetical order as the outcome to predict, so that the result is hthe probability of vaginal delivery.
In parity, the dummy variable is the second alphabetically, nullipara Yes (1). The log odds ratio for vaginal delivery Nullipara/Multipara = -49.2
In complications, 2 dummy variables, None No (0) or Yes(1), Hypertension No(0) or Yes(1). The first in alphabetical order, Diabetes, is left out. This means that no complication is represented as 0 1, hypertension as 1 0, and Diabetes as 0 0. The log odds ratio for vaginal delivery hypertension Yes/No = 49.0, and for No complication yes/no is 48.4
In height groups, 2 dummy variables, Short No (0) and Yes(1), Tall No(0) and Yes(1). The first in alphabetical order, Medium, is left out. This means that height groups are represented as Medium 0 0, Short 0 1, and Tall 1 0. The log odds ratio for vaginal delivery maternal height Short Yes/No=-49.4, and for Tall Yes/No = 50.5

To calculate the probability of caesarean section

The following table shows calculations for the probability of vaginal deliveries for all combinations of independent variables

Combinations	Nullip,Hypertension, NoComp,Short,Tall	z=const+sum(x_ib_i) for Nullip, Hypertension,NoComp,Short,Tall	y=1/(1+Exp(-z)) P(VD)
Multip+NoComp+Tall	0,0,1,0,1	-24.1+0+0+48.37+0+50.48=74.72	1.0
Multip+NoComp+Medium	0,0,1,0,0	-24.1+0+0+48.37+0+0=24.24	1.0
Multip+NoComp+Short	0,0,1,1,0	-24.1+0+0+48.37+-49.35+0=-25.11	0.0
Multip+Diabetes+Tall	0,0,0,0,1	-24.1+0+0+0+0+50.48=26.35	1.0
Multip+Diabetes+Medium	0,0,0,0,0	-24.1+0+0+0+0+0=-24.1	0.0
Multip+Diabetes+Short	0,0,0,1,0	-24.1+0+0+0+-49.35+0=-73.48	0.0
Multip+Hypertension+Tall	0,1,0,0,1	-24.1+0+48.98+0+0+50.48=75.33	1.0
Multip+Hypertension+Medium	0,1,0,0,0	-24.1+0+48.98+0+0+0=24.85	1.0
Multip+Hypertension+Short	0,1,0,1,0	-24.1+0+48.98+0+-49.35+0=-24.5	0.0
Nullip+NoComp+Tall	1,0,1,0,1	-24.1+-49.17+0+48.37+0+50.48=25.55	1.0
Nullip+NoComp+Medium	1,0,1,0,0	-24.1+-49.17+0+48.37+0+0=-24.93	0.0
Nullip+NoComp+Short	1,0,1,1,0	-24.1+-49.17+0+48.37+-49.35+0=-74.28	0.0
Nullip+Diabetes+Tall	1,0,0,0,1	-24.1+-49.17+0+0+0+50.48=-22.82	0.0
Nullip+Diabetes+Medium	1,0,0,0,0	-24.1+-49.17+0+0+0+0=-73.3	0.0
Nullip+Diabetes+Short	1,0,0,1,0	-24.1+-49.17+0+0+-49.35+0=-122.65	0.0
Nullip+Hypertension+Tall	1,1,0,0,1	-24.1+-49.17+48.98+0+0+50.48=26.16	1.0
Nullip+Hypertension+Medium	1,1,0,0,0	-24.1+-49.17+48.98+0+0+0=-24.32	0.0
Nullip+Hypertension+Short	1,1,0,1,0	-24.1+-49.17+48.98+0+-49.35+0=-73.67	0.0

The major advantage of using name designation is accuracy. There is no assumption of order between groups in any independent variable, and all independent variables are converted to dummy variables before calculation.

The disadvantages are that a large number of dummy variables are created, and in a complex modelthis may cause some confusion to the inexperienced data analyst.

The names of each groups can be labeled in such a manner that they lined up alphabetically in a convenient manner. For example 0VD and 1CS will result in 1CS being the outcome dummy, and the model estimates the probability of caesarean section.

[https://en.wikipedia.org/wiki/Logistic_regression] Logistic Regression by Wikipedia

Cox, DR (1958). "The regression analysis of binary sequences (with discussion)". J Roy Stat Soc B. 20 (2): 215 - 242. JSTOR 2983890.

Portney LR, Watkins MP (2000) Foundations of Clinical Research Applications to Practice Second Edition.ISBN 0-8385-2695 0 p. 597 - 603