13 min read

Lesson 10 | GLMs & GLMMs

Materials

Data

This lesson uses those same data on arrests around athletics venues in Cleveland, Ohio:1

AllClevelandCrimeData.csv

Script

GLM_GLMM.R

Video lecture

Video on YouTube

Walking through the script

Packages & data

Two external packages are required, but I recommend installing them but not loading them into the session. They either contain functions that conflict with tidyverse functions, or automatically load up other packages that have conflicts. Therefore, install them and refer to functions using the package::function convention.

Remember, only install them to a given machine once. Bad idea to constantly re-install packages already installed. And never put a call to install.packages in an .Rmd file.

install.packages(c("car", "lme4"))

And we always need tidyverse!

pacman::p_load(tidyverse)

We use the same data as in Lesson 9, the Cleveland crime data:

clev_d <- read_csv("./data/AllClevelandCrimeData.csv")
clev_d <-
  clev_d %>%
    mutate(GameDay = recode(GameDay, "0" = "No", 
                                     "1" = "Yes")) %>%
      select(GameDay, Venue)

Analyzing count data

Recall the central research question: Are more arrests made on days with games at the two professional athletics facilities in Cleveland, Ohio?

Visualize with a bar chart…

ggplot(clev_d) + theme_bw(16) +
  geom_bar(aes(x=GameDay), 
           stat="count")

… generate a summary table to create the data to test…

GD <- clev_d %>%
        group_by(GameDay) %>%
          summarize(charges=n())
GD
## # A tibble: 2 x 2
##   GameDay charges
##   <chr>     <int>
## 1 No           86
## 2 Yes         155

Fitting a GLM

… and test with a generalized linear model, or GLM. GLMs are what we use to conduct linear regressions on non-continuous data like counts and probabilities, or even continuous data that don’t fit a normal distribution. GLMs assume the data fit non-Gaussian distributions; since there are many, we must specify which one we want the model to use.

The family= argument is required to tell glm() which distribution we want to use. It includes an optional link= argument. Link functions are not optional, but each family in \({\bf\textsf{R}}\) has a default. Link functions are applied to the response variable, which because it is not normally distributed, requires a transformation to link it to the math that drives the general linear regression. That’s why these are called generalized linear models–they use math to connect data that follow non-normal distributions to generally fit a linear regression.

Choosing appropriate options for family= and link= are beyond the scope of Intro to \({\bf\textsf{R}}\), but Poisson is a common distribution for count data, and we’ll use it here:

glm_gd <- glm(charges ~ GameDay, GD, 
              family=poisson(link = "identity"))

Evaluating glm objects is similar to lm objects, but the results are a bit different due to differences in the underlying math. The main difference is that instead of the typical \(t\) test based on a \(t\) distribution or ANOVA based on the \(F\) distribution, many GLM functions in \({\bf\textsf{R}}\) conduct an Analysis of Deviance that basically determines statically whether the ratio of the counts deviates from a null hypothesis of 50:50.

For example, if there were 100 arrests and game day had no effect, we’d expect about 50 to have occurred on game days, and 50 on days without games. The Analysis of Deviance tests the degree to which the observed data deviate from 50/50.

Because there is a little more/different math involved in the Analysis of Deviance than an ANOVA, there are more ways to get there. Thus there are different test statistic options, and various ways to code them in \({\bf\textsf{R}}\).

A few examples include:

summary(glm_gd)
## 
## Call:
## glm(formula = charges ~ GameDay, family = poisson(link = "identity"), 
##     data = GD)
## 
## Deviance Residuals: 
## [1]  0  0
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   86.000      9.274   9.274  < 2e-16 ***
## GameDayYes    69.000     15.524   4.445  8.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2.0034e+01  on 1  degrees of freedom
## Residual deviance: 9.7700e-15  on 0  degrees of freedom
## AIC: 17.177
## 
## Number of Fisher Scoring iterations: 2

This is fairly familiar output: the structure is the same, but instead of a \(t\) statistic like summary() on an lm object, the summary of a glm object returns a standardized z statistic.

Like the summary of lm objects, we can identify parts of our data in the coefficients of the glm object. Recall how lm analyzes factors by assigning the first level as the intercept \(\beta_0\) and estimates the slope \(\beta_1\) as the difference between the intercept and another factor level. Here, \(\beta_0 =86\), and since \(155-86=69\), \(\beta_1=69\).

Let’s try another function to evaluate regression objects:

anova(glm_gd)
## Analysis of Deviance Table
## 
## Model: poisson, link: identity
## 
## Response: charges
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev
## NULL                        1     20.034
## GameDay  1   20.034         0      0.000

The structure of this table is familiar but it is missing an important component: we don’t get a \(P\) value denoting the significance of the GameDay term. This is because there are several options in how to perform the Analysis of Deviance, and anova needs to be told how to do it via the test= argument:

anova(glm_gd, test = "Chisq") 
## Analysis of Deviance Table
## 
## Model: poisson, link: identity
## 
## Response: charges
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        1     20.034              
## GameDay  1   20.034         0      0.000 7.606e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This table confirms that the increase in arrests on game days is statistically significant.
This is based on a \(\chi^2\) statistic, instead of the \(z\) score in summary, similar to how with an lm object, summary returns a \(t\) statistic and anova returns an \(F\) statistic. They are just based on different math.

The car package

At this point we need to meet the Companion to Applied Regression package, car. car contains a number of helpful functions (I’ve only begun to scratch the surface), some of which are discussed more in Analysis of Ecosystems.

Here we focus on the Anova function. It extends the functionality of base stats::anova. For example, when it detects a glm object, it defaults to an Analysis of Deviance based on a \(\chi^2\) statistic; it doesn’t need us to tell it:2

car::Anova(glm_gd)    
## Analysis of Deviance Table (Type II tests)
## 
## Response: charges
##         LR Chisq Df Pr(>Chisq)    
## GameDay   20.034  1  7.606e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analyzing non-independent data

One of the assumptions of statistical analysis is that the observations in a dataset are independent–the value of a given data point doesn’t have anything to do with the value of another data point any more than any other data point. Researchers often satisfy this assumption by randomly assigning treatment groups across multiple replicated plots and assuring samples are collected at sufficient distances to avoid spatial autocorrelation.

Such randomization is often not an option for observational studies, and many hierarchical sampling designs intentionally build in spatial dependence to determine to account for or even test for spatial patterns. Data from such designs often do not meet the assumption of independence and are instead often examples of pseudoreplication–when the number of observational units (samples) is greater than the number of experimental units (true replicates).3

Take, for example, a study with 5 plots from which 3 samples each have been collected. The number of observational units = 15, but \(n \neq 15\). Instead, \(n=5\), the number of experimental units. One cannot run a basic statistical test like lm or glm on the whole dataset, because the model will assume \(n=15\) and \(df = 14\); this artifically high statistical power will return a problematic \(P\) value that is lower than would be returned by the actual statistical power of five true replicates (\(df=4\)).

Our present data set poses this problem. Arrests in the dataset are not independent, as there are two main centers around the venues from which arrests were made. There are differences in the venues that are not part of our hypotheses. For example, the venues have different capacities, and since they have different sports in different seasons, they might have different types of fans. For these reasons and likely others, arrests made at one venue cannot be assumed to be independent of one another.

 ggplot(clev_d) + theme_bw(16) +
   geom_bar(aes(x=GameDay, 
                fill=Venue), 
            stat="count",
            position = position_dodge2())

Note the use of position_dodge2 to create side-by-side, rather than stacked, bars.

Indeed, even though both venues did increase as we expected, the degree of their responses differed across the venues.

Among several possible ways to handle this situation, the primary two are:

  1. Calculate summary statistics for each experimental unit from the sub-sampled observations
  2. Use a model that can properly account for the nested sampling scheme (i.e., and not assume a simple variance structure in which variance is pooled among all observations, but calculate unique variance estimates for each discrete experimental unit)

Option (1) is the easiest and facilitates a basic (g)lm with the appropriate degrees of freedom. However, option (1) has drawbacks: in a study with a lot of plots, it can be demoralizing to see all of the effort put into collecting multiple sub-samples per plot essentially erased by compositing them into a summary statistic.

Option (2) allows us to retain not only the sample size provided by the sub-sampled observations, but also the information about the variance structure within each experimental unit. There are a few examples of statistical models, all based on the basic linear regression/ANOVA framework we learned before. One option is to include a blocking factor in the ANOVA model: ?aov gives us the example of aov(yield ~ N*P*K + Error(block), npk).

In many cases, the better option is to use a mixed-effect model. By mixed-effect, we mean that the model will include terms for the independent predictor variables we are interested in (fixed effects) in addition to independent variables we know might contribute to variation but we don’t have a hypothesis for. We add these variables to the model as random effects, which allows the model to control for them, rather than test them.

Note that this assumption of independence is a different issue than having continuous and/or normally-distributed data. Thus, mixed-effect models are available for cases in which we’d fit a linear regression model to continous data that follow a normal distribution, and cases in which we’d turn to a generalized linear model.

Several packages fit mixed-effect models in \({\bf\textsf{R}}\). I use a very popular package called lme4; you can read all about it here.

Fitting a GLMM

GLMM stands for Generalized Linear Mixed-Effect regression Model. We will fit one with lme4::glmer.

But first, let’s look at the summary count table of arrests by game day or not across venues:

GD_v <- clev_d %>%
          group_by(Venue, GameDay) %>%
            summarize(charges=n()) %>% 
          ungroup
GD_v
## # A tibble: 4 x 3
##   Venue GameDay charges
##   <fct> <chr>     <int>
## 1 FE    No           43
## 2 FE    Yes          70
## 3 GP    No           43
## 4 GP    Yes          85

lme4 functions follow the same Y ~ X formula notation as (g)lm. We simply add the random effect in parentheses. It has two components: a term to modify the slope of the regression lines, and a term to modify the intercept of the regression lines, separated by the vertical bar, |.

Recall from our lesson on multiple regression and interaction terms that it is pretty serious business to start modelling different slopes, so in this case, we’ll leave it alone by entering \(1\) on the left of the bar.

Recall also that there are many reasons why we might allow the intercept to vary among groups. If we have a hypothesis for how a variable that might cause different intercepts affects the response variable and we want to test for it statistically, we’d add it as a fixed effect and build a multiple regression model.

But if we know a variable might cause different intercepts among the groups but we want to control for that variation rather than explicitly test for it, we’d add it to the right side of the random effect term.

In these data, we want to ensure that the model groups variation by the individual venues, rather than pooling it all together (and effectively assuming there is a single intercept). We add the random effect term (1|Venue) to make the model pool variation within each Venue group, and allow the intercepts to vary by venue, as well:

glmer_gd <- lme4::glmer(charges ~ GameDay + (1|Venue), 
                        data = GD_v, 
                        family=  poisson(link = "identity"))
## boundary (singular) fit: see ?isSingular

You can ignore the warning about singular fit.4

Another nice thing about functions created by lme4 is that they play along well with the same base functions we used to query conventional regression objects:

summary on glmer object

summary(glmer_gd)  
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( identity )
## Formula: charges ~ GameDay + (1 | Venue)
##    Data: GD_v
## 
##      AIC      BIC   logLik deviance df.resid 
##     31.0     29.2    -12.5     25.0        1 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -0.8519 -0.2130  0.0000  0.2130  0.8519 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  Venue  (Intercept) 0        0       
## Number of obs: 4, groups:  Venue, 2
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   43.000      4.637   9.274  < 2e-16 ***
## GameDayYes    34.500      7.762   4.445  8.8e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## GameDayYes -0.597
## convergence code: 0
## boundary (singular) fit: see ?isSingular

Aside from the additional information about the mixed-effect model fitting and random effect stats, the Fixed effects coefficient table looks like our standard (g)lm summary. Note that \(\beta_0=43\), coincidentally the same count of non-game day arrests for both venues.

Can you determine how \(\beta_1\) was derived? Here’s a hint:

mean(c(70, 85)) - 43

Analysis of Deviance on glmer object

anova gave us trouble on glm objects, and gives us even more on glmer objects:

anova(glmer_gd, test = "Chisq")
## Analysis of Variance Table
##         Df Sum Sq Mean Sq F value
## GameDay  1 19.755  19.755  19.755

This is why I introduced you to car::Anova up above:

car::Anova(glmer_gd)
## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: charges
##          Chisq Df Pr(>Chisq)    
## GameDay 19.755  1  8.804e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova does work if one sets up the null (intercept-only) model for comparison against the model testing the alternative hypothesis:

glmer_0 <- lme4::glmer(charges ~ 1 + (1|Venue), 
                       GD_v, 
                       family=poisson(link = "identity"))
## boundary (singular) fit: see ?isSingular
anova(glmer_0, glmer_gd)
## Data: GD_v
## Models:
## glmer_0: charges ~ 1 + (1 | Venue)
## glmer_gd: charges ~ GameDay + (1 | Venue)
##          Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## glmer_0   2 49.065 47.838 -22.533   45.065                             
## glmer_gd  3 31.031 29.190 -12.515   25.031 20.034      1  7.606e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is my preferred approach.


  1. Menaker, BE, DA McGranahan & RD Sheptak Jr. 2019. Game Day Alters Crime Pattern in the Vicinity of Sport Venues in Cleveland, OH. Journal of Sport Safety and Security 4(1):art.1

  2. We can, however, specify other tests via test.statistic=.

  3. Hurlbert, SH. 1984. Pseudoreplication and the design of ecological field experiments. Ecological Monographs 54(2):187-211.

  4. This is basically the model telling us that there wasn’t any variability among the venues; if we were using the mixed-effect model to explore differences in variability across venues, the warning indicates that we won’t find much. But that’s ok, and probably an artifact of comparing count data without replication within the venues.