OCTOBER 18, 2018: Algebra and the Missing Oxen - Greta Style

Author: Adam J. Fleischhacker

Work here is very much motivated by Richard McElreath’s blog entry: http://elevanth.org/blog/2018/01/29/algebra-and-missingness/

One of my overarching goals is to empower more people with the ability to use a Bayesian business analytics workflow. While the underlying computational engine for this workflow is frequently Stan, I have found greta to be simpler to use for teaching Bayesian business analytics; three reasons for this are:

  1. greta models are written in R; hence students do not have to learn another language,
  2. small models run faster in greta as there is no time required for compilation, and
  3. the syntax in greta is much easier to digest than the conceptually challenging program blocks found in Stan.

In this blog entry, I share my exploration of using greta when seeking to sample discrete parameters. The computational engines underlying Stan and greta are rooted in using derivatives of the Hamiltonian Monte Carlo algorithm. As such, they are not suited for sampling discrete parameters. Since the underlying engine of both Stan and greta do not sample from discrete parameters, I wanted to see how a known method of adapting Stan for discrete parameters would play out when using greta. Hence, Richard McElreath’s blog entry Algebra and the Missing Oxen served as a starting point for my journey.

0.5 Backstory from Richard McElreath’s Blog Entry

An ox in his enclosed stable. Figure 8: An ox in his enclosed stable.

First off, McElreath is quite funny, so I really enjoy all of his content. So instead of paraphrasing, here is the motivating story as he tells it:

In a particular village in central Asia, children are not allowed to drink tea in the evening, until the family ox is stabled. No one remembers why this is the rule, but it has always been so.

You are the village enforcer of oxen customs. No one remembers why, but it has always been so. Each evening, you must determine which children have properly stabled their oxen. For many houses, you can see whether or not the ox is stabled. But other houses have enclosed stables, so you cannot observe the ox without appearing to accuse that child of violating the rule. To do so and then discover a properly stabled ox would be embarrassing for everyone. No one remembers why, but it has always been so.

… you cannot see inside all the stables [i.e. they are enclosed], but you can see which children are drinking tea. Everyone likes tea, so most children who have stabled their oxen are drinking tea. These children have earned their tea. And since everyone likes tea so much, some children who have not yet stabled their oxen are also drinking tea. These children are cheaters.

Is it worth your time, and potential embarrassment, to check the covered [i.e. enclosed] stables of children who are drinking tea? What about the children who are not drinking tea? Should you investigate them as well? If you had to choose, which nets you more empty stables, drinkers or non-drinkers? To answer these questions, we need a model.

0.6 Setting Up The Statistical Model

On McElreath’s blog, the above example is used to motivate a discusion on how to incorporate discrete model parameters into a modern MCMC algorithm like that used in Stan. We will now similarly use the story, but to motivate its application in greta. It is worth noting, that for many models with discrete parameters, there is a way to get them working in either Stan or greta, but some mathematical manipulation has to be done. We show this manipulation in a greta workflow.

For me, I really like showing graphical models prior to the statistical formulation. They help my brain. So, a graphical model depicting the oxen story is shown below:

Figure 9: An oxen graphical model - lighter fill is used for latent/unobserved variables, darker fill for observed variables, and the mixed fill used for variables that are partially observed (e.g. whether a particular ox is stabled). The double oval indicates a relationship where the child is purely a deterministic function of its parents.

And now, the statistical model is easier to digest after being visually grounded by Figure 9:

\[ \begin{aligned} TEA_i \sim& \textrm{ Bernoulli}(\pi_i) \\ \pi_i =& s_i \times p_s + (1 - s_i) \times p_u \\ s_i \sim& \textrm{ Bernoulli}(\sigma) \\ p_s \sim& \textrm{ Beta(2,2)}\\ p_u \sim& \textrm{ Beta(2,2)}\\ \sigma \sim& \textrm{ Beta(2,2)}\\ \end{aligned} \]

where, \(i\) is the index of the observed ox-child pair, and

\[ \begin{aligned} \sigma &\equiv \textrm{Pr}(stabled) \\ p_s &\equiv \textrm{Pr}(drinking|stabled) \\ p_u &\equiv \textrm{Pr}(drinking|unstabled) \\ s_i &\equiv \textrm{Indicator where } s_i = 1 \textrm{ when ox } i \textrm{ is stabled, and 0 otherwise} \\ \pi_i &\equiv \textrm{Pr}(drinking) \textrm{: Probability child } i \textrm{ is drinking tea}. \\ TEA_i &\equiv \textrm{Indicator where } TEA_i = 1 \textrm{ when child } i \textrm{ is drinking, and 0 otherwise} \\ \end{aligned} \]

To finish the problem set-up, some simulated data is created and stored in a data frame (dataDF):

library(dplyr)
# Actual Values That Models Will Try To Recover
actual_p_u = 0.5; actual_p_s = 1; actual_sigma = 0.75
set.seed(1)
N_children <- 51   ## 51 children-ox pairs
s <- rbinom( N_children , size=1 , prob=actual_sigma )
s_obs <- s   # copy to add censoring
## make 21 censored obs (i.e. NA), leave 30 uncensored
s_obs[ sample( 1:N_children , size=21 ) ] <- NA  
tea <- rbinom( N_children , size=1 , prob= s*actual_p_s + (1-s)*actual_p_u )
## create data frame of observations
## NA is used to indicate an enclosed barn 
# (hence, data regarding ox is censored or unobserved)
dataDF = tibble(s_obs,tea)

where dataDF$s_obs == 0 when an ox is observed to be out of its stable, dataDF$s_obs == 1 when an ox is observed in its stable, and dataDF$s_obs == NA when the stable is enclosed and hence the data is considered censored/unobserved.

0.7 A simpler model: no censoring

I always recommend iterating through potential models and increasing complexity with each iteration. So, let’s start with the simpler model of Figure 10 where we discard all censored data (i.e. keep only data where all the ox can be observed in or out of their stables).

Figure 10: An uncensored oxen model.

The greta model for this case is shown here:

### this model looks at only the 30 uncensored s_i
library(dplyr)
library(greta)

nonCensoredDataDF = dataDF %>% filter(!is.na(s_obs))

s_i = as_data(nonCensoredDataDF$s_obs) #data
tea_i = as_data(nonCensoredDataDF$tea) #data

p_s = beta(2,2) #prior
p_u = beta(2,2) #prior
sigma = beta(2,2) #prior

pi_i = s_i * p_s + (1 - s_i) * p_u #operation

distribution(s_i) = bernoulli(sigma) #likelihood
distribution(tea_i) = bernoulli(pi_i) #likelihood

m = model(p_s,p_u,sigma) #model

# plot(m) # uncomment to visually verify model
draws = mcmc(m, warmup = 400) #sampling
bayesplot::mcmc_intervals(draws) #estimates

Considering the sample of 30 uncensored observations, we can note that \(p_s\), \(p_u\), and \(\sigma\) are all reasonably recovered; they have credibility allocated near their respective “true” values of 1, 0.5, and 0.75 (note that the \(\textrm{Beta}(2,2)\) prior on \(p_s\) prevents the posterior from including the true value in its credible interval).

0.8 A simpler model: all censoring

Just like we can model the scenario which assumed no censoring, we can also look to model the case with complete censoring. As shown in Figure 11, whether an ox is stabled or not stabled is completely unobserved (i.e. \(s_i\) node is filled with a light shade).

Figure 11: An uncensored oxen model.

While this model is poor because it has no way to discern the components of \(\pi_i\), we could still try to run it in greta. Here is the attempt:

### this model includes only censored data, it gives an error
library(greta)
censoredDataDF = dataDF %>% filter(is.na(s_obs))
# commented out for this model s_i = as_data(s) #data
tea_i = as_data(censoredDataDF$tea) 
p_s = beta(2,2) #prior
p_u = beta(2,2) #prior
sigma = beta(2,2) # hyperprior
s_i = bernoulli(sigma) # prior
pi_i = s_i * p_s + (1 - s_i) * p_u #operation
# commented out for this modeldistribution(s_i) = bernoulli(sigma) #likelihood
distribution(tea_i) = bernoulli(pi_i) #likelihood
m = model(p_s,p_u,sigma) #model
## Error: model contains a discrete random variable that doesn't have a fixed value, so cannot be sampled from

Note the error: Error: model contains a discrete random variable that doesn't have a fixed value, so cannot be sampled from. That is because \(s_i\) is no longer data, but rather a discrete unobserved random variable. And when it comes to unobserved random variables, It is only unobserved discrete random variables that are not readily accomodated, observed discrete random variables and unobserved continuous random varaibles are both easily accomodated. only the continuos ones are readily handled.

So to get around this, we need to eliminate this discrete latent variable, \(s_i\). Fortunately, we can simply use algebra to create a joint marginal distribution which excludes \(s_i\). The way to exclude it from the model is to replace the original formula:

\[ \pi_i = s_i \times p_s + (1 - s_i) \times p_u \\ \]

that works only for uncensored observations with its marginalized counterpart that excludes \(s_i\):

\[ \begin{aligned} \pi_j &= P(stabled) \times P(drinking|stabled) + P(unstabled) \times P(drinking|unstabled) \\ &= \sigma p_s + (1-\sigma) p_u \end{aligned} \]

where we use the index \(j\) instead of \(i\) to refer to the censored observations.

To eliminate the discrete variable, examine the following alternate representation of Figure 11 where our goal is to create a graphical representation of a joint marginal distribution which excludes \(s_i\). Pictorially, we get:

Figure 12: Oxen graphical model assuming completely censored observations (i.e. the oxen hidden in enclosed stables) and the rest of the data is non-censored.

And to check our mathematical trick, let’s run the model in Figure 12 using only the censored observations. Subsequently, we will figure out how to combine the censored and uncensored models into one. Here is the working model for censored observations that uses the mathematical trick to get rid of \(s_i\) as a latent random variable:

library(greta)

rm(s_i) # remove reference to s_i so this 
        # model does not include it

tea_i = as_data(censoredDataDF$tea) #data

p_s = beta(2,2) #prior
p_u = beta(2,2) #prior
sigma = beta(2,2) #prior

pi_i = sigma * p_s + (1 - sigma) * p_u #operation

distribution(tea_i) = bernoulli(pi_i) #likelihood

m = model(p_s,p_u,sigma) #model

draws = mcmc(m, warmup = 400)
bayesplot::mcmc_intervals(draws)

Cool, it ran. Uncool, the estimates stink - they have very broad ranges of uncertainty. And, that is to be expected. If we do not have any data regarding stabled or unstabled oxen, then we are stuck in a very uncertain world. The key is to intelligently combine the censored data with the uncensored data in one model.

0.9 Combining the censored and uncensored models

Figure 13 combines the two working models. This method allows for optimal use of the information that is observed - although in this example the extra information does not noticably improve estimates. In theory, the combined model incorporates both information regarding \(\sigma\) (i.e. \(\textrm{P}(stabled)\) gleamed from the observations of stabled/unstabled oxen as well as the information regarding \(\pi_j\) which creates a better estimate of \(\textrm{P}(drinking)\) by incorporating the entire sample of children’s drinking observations - not just the uncensored subset.

Figure 13: Expanded oxen graphical model to show that some data results from censored observations (i.e. the oxen hidden in enclosed stables) and the rest of the data is non-censored.

Typically, we see all of the data following the same statistical model, but now, the model changes depending on whether we have censored or uncensored data. To put this in greta, we can still follow the standard recipe (after eliminating all partially observed nodes in the graphical model and specifying the apprpriate statistical model):

  1. Data: Create data greta arrays for all observed data nodes (dark ovals).
  2. Priors: Create variable greta arrays for all unobserved latent random variables (light ovals).
  3. Operations: Create operation greta arrays for all deterministic nodes (double ovals).
  4. Likelihood: Assign distributions for all data nodes (dark ovals).
  5. Modelling: Creata a greta model object representing the statistical model (all ovals).
  6. Inference: Sample from the joint posterior distribution.
library(greta)
s_i = as_data(nonCensoredDataDF$s_obs) #data
tea_i = as_data(nonCensoredDataDF$tea) #data
tea_j = as_data(censoredDataDF$tea) #data

p_s = beta(2,2) #prior
p_u = beta(2,2) #prior
sigma = beta(2,2) # hyperprior

pi_i = s_i * p_s + (1 - s_i) * p_u #operation for i
pi_j = sigma * p_s + (1 - sigma) * p_u # operation for j

distribution(s_i) = bernoulli(sigma) #likelihood i
distribution(tea_i) = bernoulli(pi_i) #likelihood i
distribution(tea_j) = bernoulli(pi_j) #likelihood j

m = model(p_s,p_u,sigma) #model
# plot(m)
draws = mcmc(m)
library (bayesplot)
mcmc_trace(draws)

mcmc_intervals(draws)

summary(draws, quantiles = c(0.055,0.945))
## 
## Iterations = 1:1000
## Thinning interval = 1 
## Number of chains = 4 
## Sample size per chain = 1000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##         Mean      SD  Naive SE Time-series SE
## p_s   0.9252 0.04392 0.0006945       0.001082
## p_u   0.4846 0.13815 0.0021844       0.003494
## sigma 0.7591 0.06926 0.0010951       0.002014
## 
## 2. Quantiles for each variable:
## 
##         5.5%  94.5%
## p_s   0.8446 0.9825
## p_u   0.2644 0.7109
## sigma 0.6408 0.8623

Yay, it seemed to work! Not only did it run, but the numbers are very close to those in the original McElreath blog entry.

0.10 Final Predictions

The key quantities we are interested in are \(P(unstabled|drinking)\) and \(P(unstabled|notDrinking)\). These are derived via Bayes rule:

\[ \begin{aligned} P(unstabled|drinking) &= \frac{P(unstabled)P(drinking|unstabled)}{P(drinking)} \\ &= \frac{P(unstabled)P(drinking|unstabled)}{P(unstabled)P(drinking|unstabled) + P(stabled)P(drinking|stabled)} \\ &= \frac{(1 - \sigma) \times p_u}{(1 - \sigma) \times p_u + \sigma \times p_s} \end{aligned} \]

\[ \begin{aligned} P(unstabled|notDrinking) &= \frac{P(unstabled)P(notDrinking|unstabled)}{P(notDrinking)} \\ &= \frac{P(unstabled)P(notDrinking|unstabled)}{P(unstabled)P(notDrinking|unstabled) + P(stabled)P(notDrinking|stabled)} \\ &= \frac{(1 - \sigma) \times (1-p_u)}{(1 - \sigma) \times (1-p_u) + \sigma \times (1 - p_s)} \end{aligned} \]

And modifying our representative sample of the posterior distribution to include these quantities allows us to then plot the distribution for these derived quantities:

drawsDF = tidybayes::tidy_draws(draws) %>%
  mutate(probUnstabledGivenDrinking = ((1-sigma)*p_u) / ((1-sigma)*p_u+sigma*p_s),
         probUnstabledGivenNotDrinking = ((1-sigma)*(1-p_u)) / ((1-sigma)*(1-p_u)+sigma*(1-p_s)))

### make tidy dataframe of just to rv's of interest
library(tidyr)
plotDF = drawsDF %>% 
  select(.draw,probUnstabledGivenDrinking,probUnstabledGivenNotDrinking) %>%
  gather(key = "variable", value = "probability", -.draw)



library(ggplot2)
ggplot(plotDF, aes(x = probability)) +
  geom_density(aes(fill = variable), alpha = 0.5) +
  theme(legend.position = c(0.75,0.75),
        legend.title = element_blank())

As in McElreath’s blog, we leave the statistical decision theory portion for another day.

0.11 Summary

Using the oxen story that motivated McElreath’s lesson on marginalizing out discrete parmaeters in Stan, we pursued the analogous greta workflow. Upon comparing the two, the simplicity of greta is much appreciated. There are no for loops and no variable declarations - it just works. I hope you find the above useful.

0.12 Appendix: Alternate Greta Model

We could have also used a mixture model by eliminating \(\pi_j\) and replacing the distribution for \(tea_j\) so that it is a mixture model. An unrun snippet of that model is shown below:

pi_i = s_i * p_s + (1 - s_i) * p_u

distribution(s_i) = bernoulli(sigma)
distribution(tea_i) = bernoulli(pi_i)
distribution(tea_j) = mixture(bernoulli(p_s),
                              bernoulli(p_u),
                              weights = c(sigma, 1 - sigma))