Hierarchical models

Theory: what is a hierarchical model?

In general: a model with hyperparameters, i.e. parameters that probabilistically control other parameters.

E.g.

\[\begin{align*} y_i &\sim N(\alpha_{group(i)} + \beta \cdot x_i, \sigma) \\ \alpha{group(i)} &\sim N(\mu, \tau) \end{align*}\]

In this model \(\tau\) is a hyperparameter. Is \(\mu\) a hyperparameter???

Hierarchical models are great for describing the situation where you know some measurements have something in common (e.g. they come from the same group), but you don’t know how much.

Learn more!

Example: always be closing!

Plushycorp employs 10 salespeople who go door to door selling cute plushies. The number of plushies that each salesperson sold every working day for two weeks was recorded. What can Plushycorp find out from this data?

To answer the question in a best-case scenario, we can use a hierarchical model to run a “digital twin” of this experiment with known parameters and data generating process. Specifically, we can assume that the number \(y_{ij}\) of plushies that salesperson \(i\) sells on day \(j\) depends on a combination of factors:

The baseline amount \(\mu\) that a totally average salesperson would sell on a normal day
The salesperson’s ability \(ability_i\)
An effect \(day\ effect_j\) for the day of the week: people are thought to buy fewer and fewer plushies as the week drags on.
Some random variation

A good first step for modelling count data is the Poisson distribution, so let’s assume that the sales measurements follow the following Poisson distribution:¹

¹ Note the use of the log link function.

\[\begin{align*} y_{ij} &\sim Poisson(\lambda) \\ \ln\lambda &= \mu + ability_i + day\ effect_j \end{align*}\]

We know that the salespeople have different abilities, but how just different are they? Since this isn’t really clear to Plushycorp, it makes sense to introduce a parameter \(\tau_{ability}\) into the model:

\[\begin{equation*} ability \sim N(0, \tau^{ability}) \end{equation*}\]

Now we have a hierarchical model!

We can make a similar argument for the day of the week effects:²

² Can you think of a better model for day effects given the information above??

\[\begin{equation*} day\ effect \sim N(0, \tau^{day}) \end{equation*}\]

Finally we can complete our model by specifying prior distributions for the non-hierarchical parameters:³

³ \(HN\) here refers to the “half-normal” distribution, a decent default prior for hierarchical standard deviations

\[\begin{align*} \mu &\sim LN(0, 1) \\ \tau_ability &\sim HN(0, 1) \\ \tau_day &\sim HN(0, 1) \end{align*}\]

To test out our model with fake data, we can use Python to generate a fake set of salespeople and days, then generate some sales consistently with our model. Next we can generate some data,

from pathlib import Path
import json
import numpy as np
import pandas as pd

N_SALESPERSON = 10
N_WEEK = 2
DAY_NAMES = ["Mon", "Tue", "Wed", "Thu", "Fri"]
BASELINE = 2  # 2 plushies in one day is fine
TAU_ABILITY = 0.35
TAU_DAY = 0.2

SEED = 12345
DATA_DIR = Path("../data")

rng = np.random.default_rng(seed=SEED)

with open(DATA_DIR / "names.json", "r") as f:
    name_directory = json.load(f)

names = [
    f"{first_name} {surname}"
    for first_name, surname in zip(
        *map(
            lambda l: rng.choice(l, size=N_SALESPERSON, replace=False),
            name_directory.values()
        )
    )
]

abilities = rng.normal(loc=0, scale=TAU_ABILITY, size=N_SALESPERSON)

salespeople = pd.DataFrame({"salesperson": names, "ability": abilities})

salespeople

	salesperson	ability
0	Morten Andersen	0.489643
1	Lene Poulsen	0.462804
2	Rasmus Jensen	-0.104894
3	Hanne Madsen	0.316022
4	Mette Rasmussen	-0.567554
5	Christian Christensen	-0.055366
6	Helle Kristensen	0.157319
7	Charlotte Hansen	-0.470260
8	Maria Petersen	-0.028591
9	Jette Thomsen	0.603659

day_effects = sorted(
    rng.normal(loc=0, scale=TAU_DAY, size=len(DAY_NAMES))
)[::-1]  # This (i.e. `[::-1]`) is a nice way to reverse a list
days = pd.DataFrame({"day": DAY_NAMES, "day_effect": day_effects})
days

	day	day_effect
0	Mon	0.523632
1	Tue	0.165727
2	Wed	0.155472
3	Thu	-0.191798
4	Fri	-0.241878

sales = (
    days
    .merge(salespeople, how="cross")
    .merge(pd.DataFrame({"week":[1, 2, 3, 4]}), how="cross")
    .assign(
        sales=lambda df: rng.poisson(
            np.exp(np.log(BASELINE) + df["ability"] + df["day_effect"])
        )
    )
    [["week", "day", "salesperson", "day_effect", "ability", "sales"]]
    .copy()
)
sales.head()

	week	day	salesperson	day_effect	ability	sales
0	1	Mon	Morten Andersen	0.523632	0.489643	10
1	2	Mon	Morten Andersen	0.523632	0.489643	3
2	3	Mon	Morten Andersen	0.523632	0.489643	4
3	4	Mon	Morten Andersen	0.523632	0.489643	4
4	1	Mon	Lene Poulsen	0.523632	0.462804	4

Here is the fortnightly sales chart

total_sales = (
    sales.groupby("salesperson")["sales"].sum().sort_values(ascending=False)
)

total_sales.plot(kind="bar", ylabel="Plushies sold", title="Fortnightly sales")

It’s pretty straightforward to represent hierarchical models with Stan, almost like Stan was designed for it!

from cmdstanpy import CmdStanModel

model = CmdStanModel(stan_file="../src/stan/plushies.stan")
print(model.code())

data {
 int<lower=1> N;
 int<lower=1> N_salesperson;
 int<lower=1> N_day;
 array[N] int<lower=1,upper=N_salesperson> salesperson;
 array[N] int<lower=1,upper=N_day> day;
 array[N] int<lower=0> sales;
 int<lower=0,upper=1> likelihood;
}
parameters {
 real log_mu;
 vector[N_salesperson] ability;
 vector[N_day] day_effect;
 real<lower=0> tau_ability;
 real<lower=0> tau_day;
}
transformed parameters {
 vector[N] log_lambda = log_mu + ability[salesperson] + day_effect[day]; 
}
model {
  log_mu ~ normal(0, 1);
  ability ~ normal(0, tau_ability);
  day_effect ~ normal(0, tau_day);
  tau_ability ~ normal(0, 0.5);
  tau_day ~ normal(0, 0.5);
  if (likelihood){
    sales ~ poisson_log(log_lambda);
  }
}
generated quantities {
 real mu = exp(log_mu);
 vector[N] lambda = exp(log_lambda);
 array[N] int yrep = poisson_rng(lambda);
 vector[N] llik; 
 for (n in 1:N){
   llik[n] = poisson_lpmf(sales[n] | lambda[n]);
 }
}

import arviz as az
from stanio.json import process_dictionary

def one_encode(l):
    """One-encode a 1d list-like thing."""
    return dict(zip(l, range(1, len(l) + 1)))


salesperson_codes = one_encode(salespeople["salesperson"])
day_codes = one_encode(days["day"])
data_prior = process_dictionary({
        "N": len(sales),
        "N_salesperson": len(salespeople),
        "N_day": len(days),
        "salesperson": sales["salesperson"].map(salesperson_codes),
        "day": sales["day"].map(day_codes),
        "sales": sales["sales"],
        "likelihood": 0
    }
)
data_posterior = data_prior | {"likelihood": 1}
mcmc_prior = model.sample(data=data_prior)
mcmc_posterior = model.sample(data=data_posterior)
idata = az.from_cmdstanpy(
    posterior=mcmc_posterior,
    prior=mcmc_prior,
    log_likelihood="llik",
    posterior_predictive="yrep",
    observed_data=data_posterior,
    coords={
        "salesperson": salespeople["salesperson"],
        "day": days["day"],
        "observation": sales.index
    },
    dims={
        "lambda": ["observation"],
        "ability": ["salesperson"],
        "day_effect": ["day"],
        "llik": ["observation"],
        "yrep": ["observation"]
    }
)
idata

17:35:22 - cmdstanpy - INFO - CmdStan start processing
17:35:22 - cmdstanpy - INFO - CmdStan done processing.
17:35:22 - cmdstanpy - WARNING - Some chains may have failed to converge.
    Chain 1 had 5 divergent transitions (0.5%)
    Chain 2 had 6 divergent transitions (0.6%)
    Chain 3 had 5 divergent transitions (0.5%)
    Chain 4 had 1 divergent transitions (0.1%)
    Use the "diagnose()" method on the CmdStanMCMC object to see further information.
17:35:22 - cmdstanpy - INFO - CmdStan start processing
17:35:23 - cmdstanpy - INFO - CmdStan done processing.
17:35:23 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'plushies.stan', line 23, column 2 to column 34)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'plushies.stan', line 23, column 2 to column 34)
Consider re-running with show_console=True if the above output is unclear!

arviz.InferenceData

posterior

posterior_predictive
log_likelihood
sample_stats

prior

sample_stats_prior

observed_data

az.summary(idata, var_names="~lambda", filter_vars="regex")

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
log_mu	0.800	0.217	0.372	1.206	0.008	0.006	748.0	978.0	1.01
ability[Morten Andersen]	0.431	0.179	0.109	0.785	0.005	0.004	1114.0	1359.0	1.00
ability[Lene Poulsen]	0.370	0.177	0.040	0.702	0.006	0.004	1039.0	1275.0	1.00
ability[Rasmus Jensen]	-0.175	0.191	-0.526	0.187	0.005	0.004	1260.0	1344.0	1.00
ability[Hanne Madsen]	0.094	0.183	-0.247	0.442	0.005	0.004	1216.0	1405.0	1.00
ability[Mette Rasmussen]	-0.546	0.209	-0.945	-0.174	0.005	0.004	1722.0	1618.0	1.00
ability[Christian Christensen]	-0.133	0.194	-0.503	0.215	0.005	0.004	1345.0	1521.0	1.00
ability[Helle Kristensen]	0.177	0.186	-0.139	0.564	0.005	0.004	1192.0	1511.0	1.00
ability[Charlotte Hansen]	-0.284	0.201	-0.666	0.083	0.005	0.004	1460.0	1835.0	1.00
ability[Maria Petersen]	-0.216	0.195	-0.581	0.148	0.005	0.004	1391.0	1644.0	1.00
ability[Jette Thomsen]	0.407	0.179	0.089	0.759	0.006	0.004	1062.0	1269.0	1.00
day_effect[Mon]	0.326	0.179	0.002	0.695	0.006	0.004	1054.0	1190.0	1.00
day_effect[Tue]	0.105	0.181	-0.221	0.465	0.006	0.004	1078.0	1279.0	1.00
day_effect[Wed]	0.113	0.180	-0.224	0.466	0.006	0.004	1089.0	1270.0	1.00
day_effect[Thu]	-0.318	0.187	-0.700	0.000	0.005	0.004	1211.0	1137.0	1.00
day_effect[Fri]	-0.191	0.182	-0.536	0.166	0.005	0.004	1132.0	1362.0	1.00
tau_ability	0.397	0.116	0.221	0.619	0.003	0.002	1816.0	1979.0	1.00
tau_day	0.345	0.152	0.123	0.625	0.003	0.002	2064.0	2605.0	1.00
mu	2.279	0.496	1.291	3.155	0.018	0.013	748.0	978.0	1.01

The problem with hierarchical models: funnels

Did you notice that cmdstanpy printed some divergent transition warnings above? This illustrates a pervasive problem with hierarchical models: funnel-shaped marginal posterior distributions. The plot below shows the values of the parameter \(\tau_{day}\) and the corresponding day effect values for Monday in the prior samples:

az.plot_pair(
    idata.prior,
    var_names=["tau_day", "day_effect"],
    coords={"day": ["Mon"]},
);

As we discussed previously, funnels are hard to sample because of their inconsistent characteristic lengths. Unfortunately, they are often inevitable in hierarchical models. Do you get an idea why from the graph?

There are three main solutions to funnels: add more information, tune the HMC algorithm or reparameterise the model.

Add more information

The posterior distribution didn’t have any divergent transitions. This is probably because the extra information in the measurements made it easier to sample. Comparing the marginal distributions from above illustrates how this can happen: note that the difference in scale between the neck and the bowl of the funnel is less extreme for the posterior samples.

from matplotlib import pyplot as plt
f, ax = plt.subplots()
az.plot_pair(
    idata.prior,
    var_names=["tau_day", "day_effect"],
    coords={"day": ["Mon"]},
    ax=ax,
    scatter_kwargs={"label": "prior"},
);
az.plot_pair(
    idata.posterior,
    var_names=["tau_day", "day_effect"],
    coords={"day": ["Mon"]},
    ax=ax,
    scatter_kwargs={"label": "posterior"},
);
ax.legend(frameon=False);

If better measurements aren’t available, divergences can often be avoided by searching for extra information that can justify narrower priors.

Tune the algorithm

Stan allows increasing the length of the warmup phase (iter_warmup, default 2000), bringing the target acceptance probability close to 1 (adapt_delta, default 0.8) and by increasing the leapfrog integrator’s maximum tree depth (max_treedepth, default 10). All of these changes trade speed for reliability.

mcmc_prior_2 = model.sample(
    data=data_prior,
    iter_warmup=3000,
    adapt_delta=0.99,
    max_treedepth=12
)

17:35:24 - cmdstanpy - INFO - CmdStan start processing
17:35:30 - cmdstanpy - INFO - CmdStan done processing.
17:35:30 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'plushies.stan', line 22, column 2 to column 35)
Consider re-running with show_console=True if the above output is unclear!
17:35:30 - cmdstanpy - WARNING - Some chains may have failed to converge.
    Chain 4 had 1 divergent transitions (0.1%)
    Use the "diagnose()" method on the CmdStanMCMC object to see further information.

Unfortunately even quite aggressive tuning doesn’t get rid of all the divergent transitions in this case.

Reparameterise

The idea with reparameterisation is to define auxiliary parameters which don’t have problematic relationships, then recover the problematic parameters later.

“Non-centred” parameterisations take a distribution with the form \(\alpha\sim D(\mu,\sigma)\) and express it as follows:

\[\begin{align*} u \sim D(0, 1)\\ \alpha = \mu + u * \sigma \end{align*}\]

model_nc = CmdStanModel(stan_file="../src/stan/plushies-nc.stan")
print(model_nc.code())

data {
 int<lower=1> N;
 int<lower=1> N_salesperson;
 int<lower=1> N_day;
 array[N] int<lower=1,upper=N_salesperson> salesperson;
 array[N] int<lower=1,upper=N_day> day;
 array[N] int<lower=0> sales;
 int<lower=0,upper=1> likelihood;
}
parameters {
 real log_mu;
 vector[N_salesperson] ability_z;
 vector[N_day] day_effect_z;
 real<lower=0> tau_ability;
 real<lower=0> tau_day;
}
transformed parameters {
 vector[N_salesperson] ability = ability_z * tau_ability;
 vector[N_day] day_effect = day_effect_z * tau_day;
 vector[N] log_lambda = log_mu + ability[salesperson] + day_effect[day]; 
}
model {
  log_mu ~ normal(0, 1);
  ability_z ~ normal(0, 1);
  day_effect_z ~ normal(0, 1);
  tau_ability ~ normal(0, 1);
  tau_day ~ normal(0, 1);
  if (likelihood){
    sales ~ poisson_log(log_lambda);
  }
}
generated quantities {
 real mu = exp(log_mu);
 vector[N] lambda = exp(log_lambda);
 array[N] int yrep = poisson_rng(lambda);
 vector[N] llik; 
 for (n in 1:N){
   llik[n] = poisson_lpmf(sales[n] | lambda[n]);
 }
}

mcmc_prior_nc = model.sample(
    data=data_prior,
    iter_warmup=3000,
    adapt_delta=0.999,
    max_treedepth=12
)

17:35:30 - cmdstanpy - INFO - CmdStan start processing
17:35:36 - cmdstanpy - INFO - CmdStan done processing.
17:35:36 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'plushies.stan', line 23, column 2 to column 34)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'plushies.stan', line 22, column 2 to column 35)
Consider re-running with show_console=True if the above output is unclear!
17:35:37 - cmdstanpy - WARNING - Some chains may have failed to converge.
    Chain 4 had 2 divergent transitions (0.2%)
    Use the "diagnose()" method on the CmdStanMCMC object to see further information.

Beware of using non-centred parameterisation as a default: it isn’t guaranteed to be better.

So how many plushies do I need to sell?

f, ax = plt.subplots()
az.plot_forest(
    np.exp(idata.posterior["log_mu"] + idata.posterior["ability"]),
    kind="forestplot",
    combined=True,
    ax=ax,
    show=False,
);
ax.scatter(
    np.exp(np.log(BASELINE) + salespeople["ability"]), 
    ax.get_yticks()[::-1], 
    color="red", 
    label="True expected sales",
    zorder=2
)
ax.scatter(
    sales.groupby("salesperson")["sales"].mean().reindex(salespeople["salesperson"]), 
    ax.get_yticks()[::-1], 
    color="black", 
    label="Observed sales per day",
    zorder=3
)
ax.set(title="", xlabel="Number of plushies sold per day")
ax.axvline(BASELINE, linestyle="--", label="baseline", linewidth=0.8, color="black")
ax.legend(frameon=False);

Takeaways

Hierarchical models are a powerful way to capture structural information
You may run into problematic sampling, but you have options!
There is surprisingly little information in low-expected-value count data.