Multi-Level Regression with Post-Stratification

This blogpost is a reproduction of this. Multi-level regression with post-stratification (MRP) is one of the more powerful predictive strategies for opinion polling and election forecasting. It combines the power of multi-level modeling which anticipate and predict group and population level effects which have good small N predictive properties with post-stratification which helps to temper the predictions to the demographics or other variables of interest. This method has been put out by Andrew Gelman et al and there is a good deal of literature on the subject.

This blog post then seeks to reproduce an example made by Tim Mastny using the brms package and tidyverse tooling.

Reproduction

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✓ ggplot2 3.3.5.9000     ✓ purrr   0.3.4.9000
✓ tibble  3.1.6.9001     ✓ dplyr   1.0.8.9000
✓ tidyr   1.1.4.9000     ✓ stringr 1.4.0.9000
✓ readr   2.1.2          ✓ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

library(lme4)

Loading required package: Matrix


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

library(brms)

Loading required package: Rcpp

Loading 'brms' package (version 2.16.1). Useful instructions
can be found by typing help('brms'). A more detailed introduction
to the package is available through vignette('brms_overview').


Attaching package: 'brms'

The following object is masked from 'package:lme4':

    ngrps

The following object is masked from 'package:stats':

    ar

library(rstan)

Loading required package: StanHeaders


rstan version 2.26.3 (Stan version 2.26.1)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
change `threads_per_chain` option:
rstan_options(threads_per_chain = 1)


Attaching package: 'rstan'

The following object is masked from 'package:tidyr':

    extract

devtools::install_github("hrbrmstr/albersusa")

Using github PAT from envvar GITHUB_PAT

Downloading GitHub repo hrbrmstr/albersusa@HEAD

Rcpp     (1.0.7      -> 1.0.8.3   ) [CRAN]
wk       (0.5.0      -> 0.6.0     ) [CRAN]
e1071    (1.7-7      -> 1.7-9     ) [CRAN]
sp       (1.4-5      -> 1.4-7     ) [CRAN]
units    (0.7-2      -> 0.8-0     ) [CRAN]
s2       (1.0.6.9000 -> 1.0.7.9000) [CRAN]
magrittr (2.0.2.9000 -> 2.0.3.9000) [CRAN]
DBI      (1.1.1      -> 1.1.2     ) [CRAN]
maptools (1.1-1      -> 1.1-4     ) [CRAN]
rgdal    (1.5-23     -> 1.5-32    ) [CRAN]
rgeos    (0.5-7      -> 0.5-9     ) [CRAN]
sf       (1.0-2      -> 1.0-8     ) [CRAN]

Installing 12 packages: Rcpp, wk, e1071, sp, units, s2, magrittr, DBI, maptools, rgdal, rgeos, sf

Installing packages into '/usr/local/lib/R/4.1/site-library'
(as 'lib' is unspecified)

Warning in i.p(...): installation of package 'Rcpp' had non-zero exit status

Warning in i.p(...): installation of package 'wk' had non-zero exit status

Warning in i.p(...): installation of package 'e1071' had non-zero exit status

Warning in i.p(...): installation of package 'sp' had non-zero exit status

Warning in i.p(...): installation of package 'magrittr' had non-zero exit status

Warning in i.p(...): installation of package 's2' had non-zero exit status

Warning in i.p(...): installation of package 'maptools' had non-zero exit status

Warning in i.p(...): installation of package 'rgdal' had non-zero exit status

Warning in i.p(...): installation of package 'rgeos' had non-zero exit status

Warning in i.p(...): installation of package 'sf' had non-zero exit status

* checking for file ‘/private/var/folders/0x/6bnjy4n15kz8nbfbbwbyk3pr0000gn/T/Rtmp6Nlkt5/remotes2f4974e2165b/hrbrmstr-albersusa-07aa87f/DESCRIPTION’ ... OK
* preparing ‘albersusa’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘albersusa_0.4.1.tar.gz’

Installing package into '/usr/local/lib/R/4.1/site-library'
(as 'lib' is unspecified)

library(albersusa)

Warning in fun(libname, pkgname): rgeos: versions of GEOS runtime 3.10.1-CAPI-1.16.0
and GEOS at installation 3.9.1-CAPI-1.14.2differ

library(cowplot)

rstan_options(auto_write=TRUE)
options(mc.cores=parallel::detectCores())

Data

All the data required to reproduce this example is located here

marriage.data <- foreign::read.dta('gay_marriage_megapoll.dta',
                                   convert.underscore=TRUE)
Statelevel <- foreign::read.dta("state_level_update.dta",
                                convert.underscore = TRUE)
Census <- foreign::read.dta("poststratification 2000.dta",
                            convert.underscore = TRUE)

Tidy and Munge

In this section the data are being cleaned

# add state level predictors to marriage.data
Statelevel <- Statelevel %>% rename(state = sstate)

marriage.data <- Statelevel %>%
  select(state, p.evang, p.mormon, kerry.04) %>%
  right_join(marriage.data)

Joining, by = "state"

# Combine demographic groups
marriage.data <- marriage.data %>%
  mutate(race.female = (female * 3) + race.wbh) %>%
  mutate(age.edu.cat = 4 * (age.cat - 1) + edu.cat) %>%
  mutate(p.relig = p.evang + p.mormon) %>%
  filter(state != "")

Buildign the Post-stratifications

In this step Tim is

# change column names for natural join with marriage.data
Census <- Census %>% 
  rename(state = cstate,
         age.cat = cage.cat,
         edu.cat = cedu.cat,
         region = cregion)

Census <- Statelevel %>%
  select(state, p.evang, p.mormon, kerry.04) %>%
  right_join(Census)

Joining, by = "state"

Census <- Census %>%
  mutate(race.female = (cfemale * 3 ) + crace.WBH) %>%
  mutate(age.edu.cat = 4 * (age.cat - 1) + edu.cat) %>%
  mutate(p.relig = p.evang + p.mormon)

Now the fully Bayesian model can be built with the associated group level effects specified.

bayes.mod <- brm(yes.of.all ~ (1 | race.female) + (1 | age.cat) + (1 | edu.cat)
  + (1 | age.edu.cat) + (1 | state) + (1 | region)
  + p.relig + kerry.04,
data = marriage.data, family = bernoulli(),
prior = c(
  set_prior("normal(0,0.2)", class = "b"),
  set_prior("normal(0,0.2)", class = "sd", group = "race.female"),
  set_prior("normal(0,0.2)", class = "sd", group = "age.cat"),
  set_prior("normal(0,0.2)", class = "sd", group = "edu.cat"),
  set_prior("normal(0,0.2)", class = "sd", group = "age.edu.cat"),
  set_prior("normal(0,0.2)", class = "sd", group = "state"),
  set_prior("normal(0,0.2)", class = "sd", group = "region")
), iter = 1000, chains = 2, cores = 2, silent = TRUE
)

Running /usr/local/Cellar/r/4.1.2/lib/R/bin/R CMD SHLIB foo.c
/usr/local/opt/llvm/bin/clang  -I"/usr/local/Cellar/r/4.1.2/lib/R/include" -DNDEBUG   -I"/usr/local/lib/R/4.1/site-library/Rcpp/include/"  -I"/usr/local/lib/R/4.1/site-library/RcppEigen/include/"  -I"/usr/local/lib/R/4.1/site-library/RcppEigen/include/unsupported"  -I"/usr/local/lib/R/4.1/site-library/BH/include" -I"/usr/local/lib/R/4.1/site-library/StanHeaders/include/src/"  -I"/usr/local/lib/R/4.1/site-library/StanHeaders/include/"  -I"/usr/local/lib/R/4.1/site-library/RcppParallel/include/"  -I"/usr/local/lib/R/4.1/site-library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -DBOOST_NO_AUTO_PTR  -include '/usr/local/lib/R/4.1/site-library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/opt/gettext/include -I/usr/local/opt/readline/include -I/usr/local/opt/xz/include -I/usr/local/include   -fPIC  -g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /usr/local/lib/R/4.1/site-library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /usr/local/lib/R/4.1/site-library/RcppEigen/include/Eigen/Dense:1:
In file included from /usr/local/lib/R/4.1/site-library/RcppEigen/include/Eigen/Core:88:
/usr/local/lib/R/4.1/site-library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
namespace Eigen {
^
/usr/local/lib/R/4.1/site-library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
namespace Eigen {
               ^
               ;
In file included from <built-in>:1:
In file included from /usr/local/lib/R/4.1/site-library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /usr/local/lib/R/4.1/site-library/RcppEigen/include/Eigen/Dense:1:
/usr/local/lib/R/4.1/site-library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
#include <complex>
         ^~~~~~~~~
3 errors generated.
make: *** [foo.o] Error 1

Examine the model output. If this were for a paper or a real prediction it would be important to run the model through a few more iterations with further diagnostic plots to ensure convergence.

summary(bayes.mod)

Warning in seq.default(from = 1, len = along - 1): partial argument match of
'len' to 'length.out'

Warning in seq.default(to = N - 1, len = N - along): partial argument match of
'len' to 'length.out'

Warning in seq.default(len = N): partial argument match of 'len' to 'length.out'

Warning in seq.default(along = arg.names): partial argument match of 'along' to
'along.with'

Warning in seq.default(len = length(arg.list)): partial argument match of 'len'
to 'length.out'

Warning in seq.default(along = perm): partial argument match of 'along' to
'along.with'

Warning in seq.default(along = perm): partial argument match of 'along' to
'along.with'

Warning in seq.default(along = perm): partial argument match of 'along' to
'along.with'

Warning in seq.default(along = perm): partial argument match of 'along' to
'along.with'

Warning in seq.default(len = ncol(arg.dim)): partial argument match of 'len' to
'length.out'

Warning in seq.default(len = N): partial argument match of 'len' to 'length.out'

Warning in seq.default(along = arg.names): partial argument match of 'along' to
'along.with'

Warning in seq.default(along = arg.names): partial argument match of 'along' to
'along.with'

Warning in seq.default(len = length(arg.names)): partial argument match of 'len'
to 'length.out'

Warning in seq.default(along = perm): partial argument match of 'along' to
'along.with'

 Family: bernoulli 
  Links: mu = logit 
Formula: yes.of.all ~ (1 | race.female) + (1 | age.cat) + (1 | edu.cat) + (1 | age.edu.cat) + (1 | state) + (1 | region) + p.relig + kerry.04 
   Data: marriage.data (Number of observations: 6341) 
  Draws: 2 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 1000

Group-Level Effects: 
~age.cat (Number of levels: 4) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.41      0.10     0.26     0.63 1.00      843      569

~age.edu.cat (Number of levels: 16) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.10      0.06     0.01     0.24 1.00      486      517

~edu.cat (Number of levels: 4) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.32      0.09     0.17     0.54 1.00     1054      818

~race.female (Number of levels: 6) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.24      0.07     0.12     0.42 1.00      623      821

~region (Number of levels: 5) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.22      0.09     0.09     0.43 1.00      633      643

~state (Number of levels: 49) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.06      0.04     0.00     0.16 1.00      425      541

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -1.49      0.49    -2.48    -0.50 1.00      729      636
p.relig      -0.01      0.00    -0.02    -0.00 1.00     1377      920
kerry.04      0.02      0.01     0.01     0.03 1.01     1271      784

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Do some visualisations

library(tidybayes)


Attaching package: 'tidybayes'

The following objects are masked from 'package:brms':

    dstudent_t, pstudent_t, qstudent_t, rstudent_t

bayes.mod %>%
  gather_draws(`sd_.*`, regex=TRUE) %>%
  ungroup() %>%
  mutate(group = stringr::str_replace_all(.variable, c("sd_" = "","__Intercept"=""))) %>%
  ggplot(aes(y=group, x = .value)) + 
  ggridges::geom_density_ridges(aes(height=..density..),
                                rel_min_height = 0.01, stat = "density",
                                scale=1.5)

Warning: as.mcmc.brmsfit is deprecated and will eventually be removed.

Now Bring them Together

The next step then is to combine the predictions from the Bayesian HLM with the postratified values. This can then be used to estimate support.

ps.bayes.mod <- bayes.mod %>%
  add_predicted_draws(newdata=Census, allow_new_levels=TRUE) %>%
  rename(support = .prediction) %>%
  mean_qi() %>%
  mutate(support = support * cpercent.state) %>%
  group_by(state) %>%
  summarise(support = sum(support))

ps.bayes.mod %>% 
  knitr::kable()

state	support
AK	0.3449819
AL	0.1767903
AR	0.1941187
AZ	0.3834098
CA	0.4481412
CO	0.4021573
CT	0.4254337
DC	0.5047281
DE	0.3364499
FL	0.2889513
GA	0.2405605
HI	0.4258809
IA	0.3148850
ID	0.2647354
IL	0.3451166
IN	0.2643830
KS	0.2613921
KY	0.2058567
LA	0.2375723
MA	0.4656944
MD	0.3378291
ME	0.3969487
MI	0.3320497
MN	0.3347855
MO	0.2662837
MS	0.1924042
MT	0.3360362
NC	0.2436259
ND	0.2705786
NE	0.2504293
NH	0.4152502
NJ	0.4137025
NM	0.3944292
NV	0.3876809
NY	0.4362904
OH	0.3147848
OK	0.1767060
OR	0.4027852
PA	0.3680869
RI	0.4445157
SC	0.2167481
SD	0.2619026
TN	0.2099279
TX	0.2374630
UT	0.1921838
VA	0.2869182
VT	0.4445150
WA	0.4199311
WI	0.3183661
WV	0.2624303
WY	0.2859423

With uncertainity

Now we can add some additional quantification of uncertainity by adding multiple draws from the posterior distribution.

pred<-bayes.mod %>%
  add_predicted_draws(newdata=Census, allow_new_levels=TRUE, n = 10) %>%
  rename(support = .prediction) %>%
  mutate(support = support * cpercent.state)%>%
  group_by(state, add = TRUE) %>%
  mean_qi()

Warning: 
In add_predicted_draws(): The `n` argument is a deprecated alias for `ndraws`.
Use the `ndraws` argument instead.
See help("tidybayes-deprecated").

Warning: Argument 'nsamples' is deprecated. Please use argument 'ndraws'
instead.

Warning: The `add` argument of `group_by()` is deprecated as of dplyr 1.0.0.
Please use the `.add` argument instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{dewitt2018,
  author = {Michael DeWitt},
  title = {MRP Using Brms},
  date = {2018-11-07},
  url = {https://michaeldewittjr.com/programming/2018-11-07-mrp-using-brms},
  langid = {en}
}

For attribution, please cite this work as:

Michael DeWitt. 2018. “MRP Using Brms.” November 7, 2018. https://michaeldewittjr.com/programming/2018-11-07-mrp-using-brms.