Predictive Distribution • crdata

In this short vignette, I show how I might use the holland2015 data set to illustrate a posterior predictive distribution. I like to…

Fit the model.
Simulate fake data from the model.
Plot the fake data alongside the real data to understand ways in which the model might not describe the observed data set well.

It’s important to note that deviations between the fake and real data may or may not matter for the conclusions, depending on the quantity of interest. However, I’d suggest it’s usually “good to know” how the model deviates from the data, regardless of whether that affects the conclusions the author chooses to focus on.

In the example below, I focus on the Santiago cases from Holland’s data, which are overdispersed relative to her model. Econometricians have argued that this overdispersion is not a problem, so long as the author uses robust standard errors (which Holland does).

Let me show you quickly how I might generate these fake data using {rstanarm} and {tidybayes} with little data wrangling mixed in.

First, load the holland2015 data set from my personal {crdata} package on GitHub. Then subset to the Sanitago cases. These cases have a bit of overdispersion, so it’s a good example.


# load packages
library(tidyverse)

# install (or update if needed) crdata package
devtools::install_github("carlislerainey/crdata")
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#> * checking for file ‘/private/var/folders/h9/y3l6d7hs1m5fsvhdzj_jkdfr0000gp/T/RtmpuU4tmU/remotes183042372136e/carlislerainey-crdata-b424406/DESCRIPTION’ ... OK
#> * preparing ‘crdata’:
#> * checking DESCRIPTION meta-information ... OK
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> Removed empty directory ‘crdata/vignettes’
#> * building ‘crdata_0.0.0.9000.tar.gz’

# load holland's data set and subset to santiago only
santiago <- crdata::holland2015 %>%
  filter(city == "santiago")

# show the santiago observations
glimpse(santiago)
#> Rows: 34
#> Columns: 7
#> $ city       <chr> "santiago", "santiago", "santiago", "santiago", "santiago",…
#> $ district   <chr> "Cerrillos", "Cerro Navia", "Conchali", "El Bosque", "Estac…
#> $ operations <dbl> 0, 0, 0, 0, 12, 0, 0, 0, 1, 1, 0, 10, 1, 5, 0, 0, 0, 4, 4, …
#> $ lower      <dbl> 52.2, 69.8, 54.8, 58.4, 43.6, 58.3, 41.0, 38.3, 36.7, 60.1,…
#> $ vendors    <dbl> 0.50, 0.60, 5.00, 1.20, 1.00, 0.30, 0.05, 1.25, 2.21, 0.70,…
#> $ budget     <dbl> 337.24, 188.87, 210.71, 153.76, 264.43, 430.42, 312.75, 255…
#> $ population <dbl> 6.6160, 13.3943, 10.7246, 16.8302, 11.1702, 8.5761, 5.1277,…

Next, fit the model with {rstanarm}. This model reproduces the results for Model 1 in her Table 2. Not, though, that Holland uses rescaled and exponentiated coefficients, so the coefficients won’t agree.

# load packages
library(rstanarm)

# formula corresponds to model 1 for each city in holland (2015) table 2
f <- operations ~ lower + vendors + budget + population

# fit poisson regression model using Stan
fit <- stan_glm(f, family = poisson, data = santiago, 
                cores = 4, chains = 4)

Next, use add_predicted_draws() from the {tidybayes} package to add fake observations to the original dataset. Then we need to do a bit of wrangling to get those fake and observed data ready for

# load packages
library(tidybayes)

# simulate 11 fake observations 
fake <- santiago %>%
  add_predicted_draws(fit, ndraws = 11) %>%
  glimpse()
#> Rows: 374
#> Columns: 12
#> Groups: city, district, operations, lower, vendors, budget, population, .row [34]
#> $ city        <chr> "santiago", "santiago", "santiago", "santiago", "santiago"…
#> $ district    <chr> "Cerrillos", "Cerrillos", "Cerrillos", "Cerrillos", "Cerri…
#> $ operations  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ lower       <dbl> 52.2, 52.2, 52.2, 52.2, 52.2, 52.2, 52.2, 52.2, 52.2, 52.2…
#> $ vendors     <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.6…
#> $ budget      <dbl> 337.24, 337.24, 337.24, 337.24, 337.24, 337.24, 337.24, 33…
#> $ population  <dbl> 6.6160, 6.6160, 6.6160, 6.6160, 6.6160, 6.6160, 6.6160, 6.…
#> $ .row        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
#> $ .chain      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ .iteration  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ .draw       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 7, 8,…
#> $ .prediction <int> 1, 2, 0, 1, 2, 0, 3, 0, 3, 1, 2, 1, 3, 0, 1, 1, 1, 0, 0, 1…

# wrangle fake data for plotting
gg_fake <- fake %>%
  ungroup() %>%
  # pivot real and fake data into the same column
  pivot_longer(cols = c(operations, .prediction), 
               names_to = "type", 
               values_to = "operations") %>%
  # label each observation type; real or fake + simulation number
  mutate(obs_type = case_when(type == "operations" ~ "Real Data", 
                              type == ".prediction" ~ paste0("Fake Data #", .draw))) %>%
  # remove duplicated real data observations (above produces one per draw)
  filter(!(obs_type == "Real Data" & .draw > 1)) %>%
  # make .draw = 0 for real data
  mutate(.draw = ifelse(obs_type == "Real Data", 0, .draw)) %>%
  # order obs_type by draw number
  mutate(obs_type = reorder(obs_type, .draw)) %>%
  glimpse()
#> Rows: 408
#> Columns: 13
#> $ city       <chr> "santiago", "santiago", "santiago", "santiago", "santiago",…
#> $ district   <chr> "Cerrillos", "Cerrillos", "Cerrillos", "Cerrillos", "Cerril…
#> $ lower      <dbl> 52.2, 52.2, 52.2, 52.2, 52.2, 52.2, 52.2, 52.2, 52.2, 52.2,…
#> $ vendors    <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,…
#> $ budget     <dbl> 337.24, 337.24, 337.24, 337.24, 337.24, 337.24, 337.24, 337…
#> $ population <dbl> 6.6160, 6.6160, 6.6160, 6.6160, 6.6160, 6.6160, 6.6160, 6.6…
#> $ .row       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,…
#> $ .chain     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ .iteration <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ .draw      <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, …
#> $ type       <chr> "operations", ".prediction", ".prediction", ".prediction", …
#> $ operations <dbl> 0, 1, 2, 0, 1, 2, 0, 3, 0, 3, 1, 2, 0, 1, 3, 0, 1, 1, 1, 0,…
#> $ obs_type   <fct> Real Data, Fake Data #1, Fake Data #2, Fake Data #3, Fake D…

# plot fake data
ggplot(gg_fake, aes(x = lower, y = operations, color = type)) + 
  facet_wrap(vars(obs_type)) + 
  geom_point(shape = 21) + 
  theme_bw() + 
  theme(legend.position = "none") + 
  labs(title = "Illustrating Posterior Predictive Distribution", 
       x = "Share of Lower-Class Residents in a District",
       y = "Number of Enforcement Operations")

This figure shows that the real data noticeably differ from the Poisson model. For districts with few lower-class residents, the real data have a much larger variance than the model suggestions. If you repeat this exercise for Lima or Bogota, you will find much better correspondence between the two.