Machine learning (ML) develops algorithms to identify patterns in data (unsupervised ML) or make predictions and inferences (supervised ML).
Supervised ML trains the machine to learn from prior examples to predict either a categorical outcome (classification) or a numeric outcome (regression), or to infer the relationships between the outcome and its explanatory variables.
Two early forms of supervised ML are linear regression (OLS) and generalized linear models (GLM) (Poisson regression and logistic regression). These methods have been improved with advanced linear methods, including stepwise selection, regularization (ridge, lasso, elastic net), principal components regression, and partial least squares. With greater computing capacity, non-linear models are now in use, including polynomial regression, step functions, splines, and generalized additive models (GAM). Decision trees (bagging, random forests, and, boosting) are additional options for regression and classification, and support vector machines is an additional option for classification.
These notes cover GLMs, including logistic regression, and the similarly structured Poisson regression. Both are generalizations of multiple linear regression, the first for categorical responses, the second for count responses.
In generalized linear models, the modeled response is a function of the mean of Y, not Y itself. This function is called a link function. The purpose of the link function is to 1) convert the response value from a range in logistic and probit and for Poisson to a value ranging from , and 2) create a linear relationship with the predictor variables.
For a standard linear regression, the link function is the indentity function,
The standard linear regression is thus a spacial case of the GLM. For a logistic regression, the link funciton is
For a probit regression, the link function is
For a Poisson regression, the link function is
The difference between logistic and probit link function is theoretical - the practical significance is slight. Logistic regression has the advantage that it can be back-tranformed from log odds to odds ratios.
In R, specify a GLM just like an linear model, but with the glm()
function, specifying the distribution with the family
parameter.
family = "gaussian"
: linear regressionfamily = "binomial"
: logistic regressionfamily = binomial(link = "probit")
: probitfamily = "poisson"
: Poisson regressionLogistic regression estimates the probability of a particular level of a categorical response variable given a set of predictors. The response levels can be binary, nominal (multiple categories), or ordinal (multiple levels).
The binary logistic regression model is
where is the “success probability” that an observation is in a specified category of the binary Y variable. You can solve this for to get
The model predicts the log odds of the response variable. For a sample of size n, the likelihood function is
The log-likelihood is
Maximizing the log-likelihood has no closed-form solution, so the coefficient estimates are found through interatively reweighted least squares.
Dataset leuk
contains response variable REMISS
indicating whether leukemia remission occurred (1|0) and several explanatory variables.
#download.file(url = "https://newonlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission/index.txt", destfile = "./Data/leukemia_remission.txt")
leuk <- read.delim(file = "./Data/leukemia_remission.txt", sep = "\t")
str(leuk)
## 'data.frame': 27 obs. of 7 variables:
## $ REMISS: int 1 1 0 0 1 0 1 0 0 0 ...
## $ CELL : num 0.8 0.9 0.8 1 0.9 1 0.95 0.95 1 0.95 ...
## $ SMEAR : num 0.83 0.36 0.88 0.87 0.75 0.65 0.97 0.87 0.45 0.36 ...
## $ INFIL : num 0.66 0.32 0.7 0.87 0.68 0.65 0.92 0.83 0.45 0.34 ...
## $ LI : num 1.9 1.4 0.8 0.7 1.3 0.6 1 1.9 0.8 0.5 ...
## $ BLAST : num 1.1 0.74 0.18 1.05 0.52 0.52 1.23 1.35 0.32 0 ...
## $ TEMP : num 1 0.99 0.98 0.99 0.98 0.98 0.99 1.02 1 1.04 ...
I’ll just compare REMISS ~ LI
. In a situation like this where there the relationship is bivariate, start with a visualization.
ggplot(leuk, aes(x = LI, y = REMISS)) +
geom_jitter() +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(title = "Remmissions by LI")
Fit a logistic regression in R using glm(formula, data, family = binomial)
where family = binomial
specifies a binomial error distribution. For simplicity, model REMISS ~ LI
.
##
## Call:
## glm(formula = REMISS ~ LI, family = binomial, data = leuk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9448 -0.6465 -0.4947 0.6571 1.6971
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.777 1.379 -2.740 0.00615 **
## LI 2.897 1.187 2.441 0.01464 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34.372 on 26 degrees of freedom
## Residual deviance: 26.073 on 25 degrees of freedom
## AIC: 30.073
##
## Number of Fisher Scoring iterations: 4
The predicted value is the estimated log odds of the response variable,
Suppose LI
is 1.0, then the log odds of REMISS
is .
## 1
## -0.8798763
The log odds is not easy to interpet, but it is convenient for updating prior probabilities in Bayesian analyses. See this article in Statistics How To. Exponentiate the log odds to get the more intuitive odds.
In this case, the odds of REMISS
when LI = 1
is . You can solve the equation for to get the probability.
In this case, the probability of REMISS
when LI = 1
is . The predict()
function for a logistic model returns log-odds, but can also return with parameter type = "response"
.
## 1
## 0.2932034
It is common to express the results in terms of the odds ratio. The odds ratio is the ratio of the odds before and after an increment to the predictors. It tells you how much the odds would be multiplied after a unit increase in .
In this simple bi-variate case, increasing LI
by 10% increases the odds of REMISS
by a factor of . The odds ratio can be calculated from the model using oddsratio::or_glm()
, or by using the tidy()
function with exponentiate = TRUE
. Note that is 1, so if you want to see a .1 increase, change the model from y ~ x
to y ~ I(x*10)
.
tidy(glm(REMISS ~ I(LI*10), data = leuk, family = "binomial"), exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0229 1.38 -2.74 0.00615 0.000916 0.244
## 2 I(LI * 10) 1.34 0.119 2.44 0.0146 1.09 1.77
## # A tibble: 1 x 5
## predictor oddsratio `CI_low (2.5)` `CI_high (97.5)` increment
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 LI 1.34 1.09 1.77 0.1
The predicted values can also be expressed as the probabilities . This produces the familiar signmoidal shape of the binary relationship.
augment(m1, type.predict = "response") %>%
ggplot(aes(x = LI, y = REMISS)) +
geom_point() +
geom_line(aes(y = .fitted), color = "red") +
labs(x = "LI",
y = "Probability of Event",
title = "Binary Fitted Line Plot")
Whereas in linear regression the the coefficient p-values use the t test (t statistic), logistic regression coefficient p-values use the Wald test **Z*-statistic).
## (Intercept) LI
## -2.740 2.441
## (Intercept) LI
## 0.006 0.015
Evaluate a logistic model fit with an analysis of deviance. Deviance is defined as -2 times the log-likelihood . The null deviance is the deviance of the null model and is analagous to SST in ANOVA. The residual deviance is analagous to SSE in ANOVA.
## 'log Lik.' 21.59385 (df=7)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: REMISS
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 26 34.372
## LI 1 8.2988 25 26.073
##
## Call: glm(formula = REMISS ~ LI, family = binomial, data = leuk)
##
## Coefficients:
## (Intercept) LI
## -3.777 2.897
##
## Degrees of Freedom: 26 Total (i.e. Null); 25 Residual
## Null Deviance: 34.37
## Residual Deviance: 26.07 AIC: 30.07
##
## Call:
## glm(formula = REMISS ~ LI, family = binomial, data = leuk)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9448 -0.6465 -0.4947 0.6571 1.6971
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.777 1.379 -2.740 0.00615 **
## LI 2.897 1.187 2.441 0.01464 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34.372 on 26 degrees of freedom
## Residual deviance: 26.073 on 25 degrees of freedom
## AIC: 30.073
##
## Number of Fisher Scoring iterations: 4
The deviance of the null model (no regressors) is 34.372. The deviance of the full model is 26.073.
## # A tibble: 1 x 7
## null.deviance df.null logLik AIC BIC deviance df.residual
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 34.4 26 -13.0 30.1 32.7 26.1 25
Use the GainCurvePlot()
function to plot the gain curve (background on gain curve at Data Science Central
from the model predictions. The x-axis is the fraction of items seen
when sorted by the predicted value, and the y-axis is the cumulative
summed true outcome. The “wizard” curve is the gain curve when the data
is sorted by the true outcome. If the model’s gain curve is close to the
wizard gain curve, then the model sorted the response variable well.
The grey area is the gain over a random sorting.
augment(m1) %>% data.frame() %>%
GainCurvePlot(xvar = ".fitted", truthVar = "REMISS", title = "Logistic Model")
## Warning in ggplot2::geom_abline(mapping = ggplot2::aes(x = pctpop, y = pct_outcome, : Using `intercept` and/or `slope` with `mapping` may not have the desired result as mapping is overwritten if either of these is specified
REMISS
equals 1 in 9 of the 27 responses.
Another way to evaluate the predictive model is the ROC curve. It evaluates all possible thresholds for splitting predicted probabilities into predicted classes. This is often a much more useful metric than simply ranking models by their accuracy at a set threshold, as different models might require different calibration steps (looking at a confusion matrix at each step) to find the optimal classification threshold for that model.
## Warning: package 'caTools' was built under R version 3.6.1
## [,1]
## 0 vs. 1 0.8549383
Poisson models count data, like “traffic tickets per day”, or “website hits per day”. The response is an expected rate or intensity. For count data, specify the generalized model, this time with family = poisson
or family = quasipoisson
.
Recall that the probability of achieving a count when the expected rate is is distributed
The poisson regression model is
You can solve this for to get
That is, the model predicts the log of the response rate. For a sample of size n, the likelihood function is
The log-likelihood is
Maximizing the log-likelihood has no closed-form solution, so the coefficient estimates are found through interatively reweighted least squares.
Poisson processes assume the variance of the response variable equals its mean. “Equals” means the mean and variance are of a similar order of magnitude. If that assumption does not hold, use the quasi-poisson. Use Poisson regression for large datasets. If the predicted counts are much greater than zero (>30), the linear regression will work fine. Whereas RMSE is not useful for logistic models, it is a good metric in Poisson.
Dataset fire
contains response variable injuries
counting the number of injuries during the month and one explanatory variable, the month mo
.
## Parsed with column specification:
## cols(
## ID = col_double(),
## `Injury Date` = col_datetime(format = ""),
## `Total Injuries` = col_double()
## )
fire <- fire %>%
mutate(mo = as.POSIXlt(`Injury Date`)$mon + 1) %>%
rename(dt = `Injury Date`,
injuries = `Total Injuries`)
str(fire)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 300 obs. of 4 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ dt : POSIXct, format: "2005-01-10" "2005-01-11" ...
## $ injuries: num 1 1 1 5 2 1 1 1 1 1 ...
## $ mo : num 1 1 1 1 1 1 2 2 2 4 ...
In a situation like this where there the relationship is bivariate, start with a visualization.
ggplot(fire, aes(x = mo, y = injuries)) +
geom_jitter() +
geom_smooth(method = "glm", method.args = list(family = "poisson")) +
labs(title = "Injuries by Month")
Fit a poisson regression in R using glm(formula, data, family = poisson)
. But first, check whether the mean and variance of injuries
are the same magnitude? If not, then use family = quasipoisson
.
## [1] 1.36
## [1] 1.020468
They are of the same magnitude, so fit the regression with family = poisson
.
##
## Call:
## glm(formula = injuries ~ mo, family = poisson, data = fire)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3987 -0.3473 -0.3034 -0.2502 4.3185
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22805 0.10482 2.176 0.0296 *
## mo 0.01215 0.01397 0.870 0.3844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 139.87 on 299 degrees of freedom
## Residual deviance: 139.11 on 298 degrees of freedom
## AIC: 792.08
##
## Number of Fisher Scoring iterations: 5
The predicted value is the estimated log of the response variable,
Suppose mo
is January (mo = ), then the log of
injuries` is . Or, more intuitively, the expected count of injuries is
## 1
## 0.2401999
## 1
## 1.271503
Here is a plot of the predicted counts in red.
augment(m2, type.predict = "response") %>%
ggplot(aes(x = mo, y = injuries)) +
geom_point() +
geom_point(aes(y = .fitted), color = "red") +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Month",
y = "Injuries",
title = "Poisson Fitted Line Plot")
Evaluate a logistic model fit with an analysis of deviance.
## # A tibble: 1 x 7
## null.deviance df.null logLik AIC BIC deviance df.residual
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 140. 299 -394. 792. 799. 139. 298
## [1] 0.005413723
The deviance of the null model (no regressors) is 139.9. The deviance of the full model is 132.2. The psuedo-R2 is very low at .05. How about the RMSE?
## [1] 1.006791
The average prediction error is about 0.99. That’s almost as much as the variance of injuries
- i.e., just predicting the mean of injuries
would be almost as good! Use the GainCurvePlot()
function to plot the gain curve.
augment(m2, type.predict = "response") %>%
ggplot(aes(x = injuries, y = .fitted)) +
geom_point() +
geom_smooth(method ="lm") +
labs(x = "Actual",
y = "Predicted",
title = "Poisson Fitted vs Actual")
augment(m2) %>% data.frame() %>%
GainCurvePlot(xvar = ".fitted", truthVar = "injuries", title = "Poisson Model")
## Warning in ggplot2::geom_abline(mapping = ggplot2::aes(x = pctpop, y = pct_outcome, : Using `intercept` and/or `slope` with `mapping` may not have the desired result as mapping is overwritten if either of these is specified
It seems that mo
was a poor predictor of injuries
.