Glossary
The
Fundamental Equation
Working with odds
ratios
Why no error term?
Why isn't Y in the
fundamental equation?
How
does maximum-likelihood estimation work?
Testing the
significance of the whole model
Testing the
significance of an explanatory variable
Performing a Logistic Regression with Computers
JMP IN
SPSS
SAS
Interpreting the output
Suppose that as the manager of a store's credit department,
you have access to financial information (including categorical and
numeric variables) about customers from credit-reporting agencies.
Furthermore, based on your experience with these customers in the past
you can objectively classify them as good borrowers and those who have
defaulted on a payment to the store. Logistic regression lets
you see how the financial data relate to being a good borrower. With
this information you could tell which new customers should be granted store
credit and which should not.
| Independent variable | One or more numeric variables. (may include dummy variables) |
| Dependent variable | A categorical variable with only two
values: Yes/No or On/Off etc. (such a variable is called a dichotomous variable for obvious reasons) CAUTION: if Yes or No refers to whether an event happens after a variable lenght of time, logistic regression is inappropriate. See time-to-event (survival) models. |
| Null hypothesis (H0) | None of the independent variables affects the probability that the dependent variable will be Yes or No. This implies that ß1, ß2, and ß3 are all zero and that only ß0 differs from zero. |
| Research hypothesis | The dependent variable is more likely to be Yes for some values of the independent variables than for others. This implies that some of ß1, ß2, and ß3 differ from zero. |
| Test statistic | c² |
| Rejection region | Right tail (values of c² that are significantly larger than its d.f.) |
R = e(ß0 + ß1X1 + ß2X2 + ß3X3)
or
lnR = ß0 + ß1X1 + ß2X2 + ß3X3
where
R is the odds ratio that an event will occur,
Xi are the
independent variables,
Y is the dependent variable, called the response
variable.
This equation says that the odds that something will happen
(and hence the probability that it will happen) depend on some explanatory
variables, the Xi.
For example, if ß3 is positive, then an increase in X3 will increase the odds that the event will occur.
Working
with odds ratios
The odds ratio is the ratio of the probability
that something will happen to the probability that it will not happen.
Let R = the odds ratio and let P = the probability
that something will happen.
Then the odds ratio is R = P ÷ (1 -
P) and P = R ÷ (1 + R).
Furthermore, lnR = ln P - ln(1 - P).
For example,
We might use the probability that the result would occur.
This would be better than a
dichotomous variable in that a probability
must be a real number between 0 and 1. But the regression could
predict that the probability is 1.2 or -0.3, and we know that probabilities
cannot have such values.
The odds ratio is even better because it can have any non-negative real value.
On the other hand, the logistic regression method discussed here uses maximum-likelihood estimation. For each observation i, the logistic regression method calculates the probability of observing Yj based on the odds ratio Rj for that observation. The method then chooses the set of ßis that maximizes the probability of observing the Yj that we did observe in the sample.
To perform a maximum-likelihood estimation we need a function that tells how the probabilities of the results are determined. The fundamental equation gives us this information by giving us a formula for the odds ratio. These probabilities are applied to the Yj.
The maximum-likelihood criterion for estimating a set of parameters specifies a probability function in terms of a set of parameters and then finds the set of values for those parameters that gives the greatest likelihood of observing the results that we actually did observe.
For example, take a simple one-parameter function: the
probability that one of Acme's
widgets will be good is P and the probability
that it will be defective is (1 - P). We sample four of their widgets
and find that the third one is defective and the others are good.
To apply this probability scheme to the observed results (the Yj) we assume that these widgets are selected randomly and independently.
The probability of finding such a sample is then L = P x P x (1 - P)
x P = P3 - P4. Use calculus to find the P that maximizes this function.
dL/dP = 3P2 - 4P3 = 0. So P = 3/4,
which happens to be the sample proportion. This illustrates that
the sample proportion is a maximum-likelihood estimator of the likelihood
that a randomly chosen Y will be a success.
Essentially, this is what logistic regression does except
that instead of having a simple
constant P as we had, logistic regression lets
P be a function of the independent variables, Xi.
Interpreting the ßi
Recall that ex+y = exey. So the fundamental equation can be written as
R = eß0 eß1X1 eß2X2 eß3X3
If one of the ßi is zero, then eßiXi is 1, so that term drops out of the equation--multiplying by 1 has no effect.
If ßi is different
than 1 then changes in Xi will have
a multiplicative effect on R.
A one-unit change in Xi will change eßiXi to
eßi(Xi +1), which is eßiXi eßi. So a one-unit change in Xi will increase R by a factor of eßi.
So, if ßi = 2.0 then a one-unit change in Xi will increase R by a factor of 7.389.
If ßi = 0.5 then a one-unit change in Xi will increase R by a factor of 1.649.
If ßi = -0.4 then a one-unit change in Xi will increase R by a factor of 0.6703, which is really a
decrease of 32.07%.
R would double as a result of a one-unit increase
in Xi if ßi = 0.6931.
When ßi = 2.0
then a one-unit change in Xi will increase
R by a factor of 7.389. So a
three-unit change in Xi will increase R by a factor of 3x7.389, right?
WRONG! Each one-unit increase
in Xi causes R to increase by a multiplicative
factor of 7.389. So three one-unit increases in Xi will cause R to increase by a factor of
Testing the significance of the whole equation
To test the significance of the whole equation (or the whole model) means to test whether any of the explanatory variables (the Xs) have an effect on the dependent variable (Y). If we find that the answer is "yes", we would also want to test the significance of individual explanatory variables. The approach used here is to measure the "fit" of the model to the data with and without the explanatory variables. The "fit" is measured by the likelihood (probability) that the observed sample data should be observed under the assumptions of the model. Technically, the fit is measured as the negative of twice the natural logarithm of the likelihood, -2LL.
The discussion below uses the following dataset for an example:
X |
Y |
Predicted Y |
Predicted Y |
| 2 |
Hit |
||
| 2 |
Hit |
||
| 2 |
Miss |
||
| 2 |
Hit |
||
| 4 |
Miss |
||
| 4 |
Hit |
||
| 4 |
Miss |
||
| 4 |
Miss |
||
| 6 |
Miss |
||
| 6 |
Miss |
||
| 6 |
Miss |
||
| 6 |
Miss |
-2LL. The natural logarithm of the likelihood
function can be used to test the null hypothesis that all of the ßi are equal to zero. This is similar to the null hypothesis
of the F statistic in a familiar multiple linear regression.
If all of the ßi except for ß0 are equal to zero, the best prediction
for the odds ratio is the odds ratio for all Y in the whole sample.
Prediction under the null hypothesis. The
null hypothesis says that none of the X variables has any influence
on the probability or on the odds that Y will be a success. If
this is true, we could get the best estimates of the odds ratio by running
a logistic regression with no X variables. (JMP IN actually lets us do
this.) Under this circumstance, the best estimate of the odds ratio,
R, would be the odds ratio for the whole sample. For example, consider
a sample of 12 observations in which one-third of the Ys are successes
and two-thirds are failures. The naive estimate (using the null
hypothesis) of the odds ratio would be 1/3÷2/3 = 1/2. R
will be 1/2 if ß0 is ln(1/2)
= -0.69315. (To verify this, note that e-0.69315 =
1/2.) If the null hypothesis is true, each success had a 1/3 probability
of occurring and each failure had a 2/3 probability. Based on these
probabilities, the probability of observing four successes and eight
failures (as we did observe in this sample) would be (1/3)4(2/3)8
= 0.00048170916. The natural logarithm of this probability is
-7.63817.
The JMP IN output for this model--using no X variable--is
as shown below:
| Model |
-LogLikelihood |
DF |
ChiSquare |
Prob>ChiSq |
| Difference |
0 |
0 |
0 |
0.0000 |
| Full |
7.63817002 | |||
| Reduced |
7.63817002 | |||
| RSquare (U) |
0 |
|||
| Observations |
12 |
| Term |
Estimate |
Std. Error |
Chi Square |
Prob>ChiSq |
| Intercept |
-0.6931472 | 0.61237244 |
1.28120803 |
0.2577 |
Prediction under the alternative hypothesis. The
alternative hypothesis says that the X variables do influence the
probability of success and the odds ratio. The alternative hypothesis
does not constrain the ß1, ß2, and ß3 to equal zero. In the 12-observation
sample used above, it is apparent that a success ("Hit") is more likely
when X is 2 than when X is 6. The JMP IN results of this logistic
regression are the following:
| Model |
-LogLikelihood |
DF |
Chi Square |
Prob>ChiSq |
| Difference |
3.03305336 |
1 |
6.0661067 |
0.01378004 |
| Full |
4.60511666 | |||
| Reduced |
7.63817002 | |||
| RSquare (U) |
0.3971 |
|||
| Observations |
12 |
| Term |
Estimate |
Std. Error |
Chi Square |
Prob>ChiSq |
| Intercept |
3.75134428 |
2.3346487 |
2.58184856 |
0.10809536 |
| X |
-1.2703325 |
0.69619146 |
3.3294876 |
0.06804807 |
This time, notice that the -LogLikelihoods for the Full
and Reduced models are different. The -LogLikelihood for the Reduced
model (without the X variables, or with ß1 set equal to zero) is the same as before. The -LogLikelihood
for the Full model (with the X variables in the equation) is less, indicating
that the estimated likelihood of getting the sample that we observed is
greater than the likelihood that the naive model gives us. The
-LogLikelihood for the Difference between the models is a measure of the
how adding the X variables to the model improves the model's fit and predicting
power. Statisticians have demonstrated that for large sample sizes,
-2xLogLikelihood has a probability distribution that is approximately a
chi-square distribution. Note that the chi-square statistic is exactly
twice the -LogLikelihood for the Difference. The number of degrees
of freedom is equal to the number of X variables in the model. The
Prob>ChiSq is the p value for the whole-model test. If there
is a significant improvement in the fit of the model, the difference in
the -LogLikelihoods will be great and the chi-square will be large. The
probability that we would randomly find a chi-square value this large or
larger will therefore be small. So, small p values (less than
5% usually, sometimes less than 1% if we want to be more cautious) indicate
that changes in X do have an effect on the probability of success. (This
is a whole-model test. If there had been more than one X variable,
a small p value would indicate that at least one, but not necessarily
all, of the X variables has an effect on the probabilty of success.) On
the other hand, if there really is no effect of X on the probability of success,
the difference would be small (it should be zero but there is always some
randomness) and the chi-square value would be small and the p value
would be closer to 0.5 or even 0.
The Logit R² (sometimes denoted by U) is similar
to the R² from a linear regression analysis. It gives the percentage
of the overall fit that is due to using the X variables in the model to
change the probabilities. It is calculated as
or in terms of the JMP IN output
In the example above, we have R² = 3.033 ÷
7.6381 = 0.3971. If the X variable had not had any influence on the
probability of success, R² would be close to zero. At the other
extreme, if knowing X could let us predict success or failure perfectly,
R² would be 1.00.
Making predictions. Once we have estimates
of the parameters (the ßi) of the fundamental equation, we can estimate the
probability of success for any observation or any hypothetical values of
the X variables. A natural criterion for predicting success is to only
predict success when the probability is greater than 1/2 (and the odds ratio,
R, is greater than 1.0 and ß0 + ß1X1 + ß2X2 +
ß3X3 is greater than zero). However, sometimes
we may be willing to miss a few true successes, so that we are less apt to
incorrectly predict success when in fact the case turns out to be a failure.
For example, we might be trying to predict which patients will be suitable
for a new drug. If the drug has fatal side-effects, we would want to
reduce the change of falsely predicting success.
Classification matrix: Whatever criterion we use, we can classify our predictions according to the table below. The percentage correctly classified (also called the hit ratio) is the sum of the number of successes correctly predicted and the number of failures correctly predicted divided by the total number of cases. Some researchers choose their criterion so as to maximize their hit ratio. JMP IN's Receiver Operating Characteristic curve (ROC) plots similar information for various criterion levels.
| Predicted success |
Predicted failure |
|
| Actual success |
3 | 1 |
| Actual failure |
1 |
7 |
Testing the significance of an explanatory variable
The Wald statistic is used to test the null
hypothesis that a ßi is equal
to zero, similar to a t statistic in the familiar multiple linear
regression. If the p value for a coefficient bi is small (less
than or equal to a), the null hypothesis
should be rejected, implying that the variable Xi does influence the probability that Y will be 1.0 (a "success").
If we find that some of the variables do not seem to have
an effect on Y, it is usually good practice to reestimate the model without
those variables. (However, if there is a theoretical reason to believe
that a variable should have an effect on Y, then it would be better to leave
the variable in the equation. Failing to prove that variable
does not influence Y is not the same as proving that it does not influence
Y.)