Regression models for describing measurements

What is regression?

Regression is a nice way of modelling measurements.

Maybe you have seen this regression model before:

\[ y = \alpha + X\cdot\beta + \epsilon \]

Components:

  • \(y\): a measurement
  • \(\alpha\) and \(\beta\) some unknown parameters, the same for every measurement
  • \(X\) some real-valued features
  • \(\epsilon\) an unknown number quantifying the difference between the prediction and the observation, different for each measurement

Often \(\epsilon\) is assumed to have a normal distribution with location zero and scale parameter \(\sigma\).

Another way of saying the same thing:

\[ y \sim N(\alpha + X\cdot\beta, \sigma) \]

I prefer this notation because it nicely separates the different components, and doesn’t hide the error parameter \(\sigma\). It also makes it clear what you might change. For example \(\alpha + X\cdot\beta\) is just one option: there are many ways in which measurements and predictors can be related. The normal distribution is also not required, in fact it is often inappropriate.

The key features of a regression model:

  • a predictor that is some function of the parameters and features
  • a probabilistic relationship between the predictor and the measurement
Important

In the context of regression modelling, Bayesian inference lets you give free rein to your creativity when designing regressions, so you can make models that represent what you know about the measurement process.

Generalised linear models

This is a popular and powerful class of regression model with the following distinguishing characteristics:

  • The predictor is a linear function of the parameters and features, e.g.  \(\alpha + X\cdot\beta\)

  • The probability distribution describing measurement errors is not necessarily the normal distribution, e.g. log-normal, Poisson, Dirichlet, Gamma, …

  • There is a link function that connects the linear predictor with the probability distribution, e.g. \(\ln(\alpha + X\cdot\beta)\).

An example GLM for positive-constrained measurements:

\[ y \sim LN(\ln(\alpha+X\cdot\beta), \sigma) \]

Tips for designing regression models

  • Start with the simplest GLM that obeys all known data constraints.
  • Try log-transforming things.
  • Aim to explicitly represent how you think the measurement apparatus works.
  • Heavy tailed distributions are often better than the normal distribution.
  • Remember that varying measurement error is an option.

Reading

Example

Teddy has never been in the lab and is bad at pipetting. Unfortunately, some label needed to be put in some Eppendorf tubes, and noone else was available to do it themselves or even supervise.

Each tube had a required volume of label which Teddy tried to hit, but probably there was some inaccuracy due to bad eyesight, poor hand control or something.

In addition, there was also probably some consistent bias, as the pipette was consistently adding too much or too little liquid. It seemed pretty old.

Luckily, someone was able to measure how much label ended up in 8 out of the 24 tubes with pretty good accuracy. Now we want to estimate how much label there is in the other 16 tubes, taking into account these measurements as well as what we know about the likely biased pipette and inconsistent pipetter.

To describe this situation we’ll first think of a regression model that describes the measurement setup, then use Python to simulate data from the model given some plausible parameter values. Next we’ll implement the model in Stan, then fit the simulated data using MCMC and then analyse the results.

Regression model

To model the noise that Teddy introduced by being bad at pipetting and the bias introduced by the bad pipette, we need some parameters that connect the known target volumes with the unknown true volumes. Let’s call them \(noise\) and \(bias\). Since the volumes are constrained positive, a distribution that automatically excludes negative numbers is probably a good idea: the log-normal distribution is usually a good starting point. This equation describes a plausible relationship:

\[ volume \sim LN(\ln{(target\cdot bias)}, noise) \]

To model the helpful measurements, we use another log-normal distribution and assume the measuring device is unbiased and has known log-scale standard error \(cal\ error\):1

1 NB the scale parameter of a lognormal distribution represents multiplicative error

\[ measurements \sim LN(\ln{volume}, cal\ error) \]

To round off the model we can think about the likely values of the unknown parameters \(bias\) and \(noise\). \(bias\) is likely not to be too far away from 1, otherwise someone would have probably thrown the pipette away already. A prior distribution that puts most of its mass between 0.75 and 1.25 therefore seems reasonable. Similarly, a prior for \(noise\) should probably not imply that Teddy’s pipetting errors regularly exceeded 30%. This consideration motivates a prior for \(noise\) that puts most of its mass below 0.15.

Simulating fake data

First some imports: as usual for this course we’ll be using arviz, matplotlib, cmdstanpy, pandas and numpy. stanio is a handy utility library for Stan: it is a dependency of cmdstanpy so you shouldn’t need to install it explicitly.

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import stanio

from cmdstanpy import CmdStanModel

Now some hardcoded numbers, including true values for the parameters here: \(bias\) is 0.88 and \(noise\) is 0.1. Note that \(cal\ error\) is much smaller than \(bias\).

N = 24
N_CAL = 8
TARGET_VOLUMES = np.array(
    [
      *([200] * 8),
      *([400] * 8),
      *([800] * 8),
    ]
)
MEASUREMENT_IX = np.array([1, 4, 9, 11, 15, 19, 21, 22])
CAL_ERROR = 0.02
BIAS_FACTOR = 0.88
NOISE = 0.1
RNG_SEED = 12345

Simulate the true volumes

rng = np.random.default_rng(seed=RNG_SEED)
ln_mean = [
  np.log(target * BIAS_FACTOR)
  for target in TARGET_VOLUMES
]
volumes = rng.lognormal(mean=ln_mean, sigma=NOISE)
volumes
array([152.64294349, 199.70810807, 161.32449302, 171.49715397,
       174.67894069, 163.43175945, 153.50063823, 187.79919405,
       364.94147086, 289.55488638, 445.13256694, 387.79655766,
       326.2592979 , 385.23402356, 335.94110379, 349.87019831,
       761.78380169, 620.86368119, 745.73037525, 809.71007835,
       803.52489045, 683.21425317, 770.52360505, 598.61585552])

Plot the volumes and the targets.

f, ax = plt.subplots()
bins = np.logspace(np.log10(100), np.log10(1000), 30)
ax.hist(volumes, bins=bins)
for t in (200, 400, 800):
    ax.axvline(t, color="red")
ax.semilogx()
ax.set_xticks([200, 400, 800], [200, 400, 800]);
ax.set(
    xlabel="volume ($\\mu$l)",
    ylabel="Frequency",
    title="How much label ended up in the tubes"
);

Simulate measurements for tubes in the MEASUREMENT_IX.

measurements = [
  rng.lognormal(np.log(vol), CAL_ERROR)
  for vol in volumes[MEASUREMENT_IX]
]
pd.DataFrame({
  "target volume": np.array(TARGET_VOLUMES)[MEASUREMENT_IX], 
  "actual volume": volumes[MEASUREMENT_IX],
  "measured volume": measurements
})
target volume actual volume measured volume
0 200 199.708108 199.077273
1 200 174.678941 176.256328
2 400 289.554886 281.877576
3 400 387.796558 387.163512
4 400 349.870198 362.149468
5 800 809.710078 853.238785
6 800 683.214253 693.919342
7 800 770.523605 783.399634

Writing the model in Stan and sampling the simulated data

I wrote up the implied statistical model in a Stan file at src/stan/ pipette.stan. This code loads this Stan file as a CmdStanModel object, checks its formatting and prints it out.

Note that the model internally puts the data on log scale and then standardises it: this is a bit annoying but makes it way easier to set priors and can ultimately save you a lot of trouble.

model = CmdStanModel(stan_file="../src/stan/pipette.stan")
model.format(overwrite_file=True, canonicalize=True, backup=False)
print(model.code())
17:34:52 - cmdstanpy - INFO - compiling stan file /Users/tedgro/repos/dtu-qmcm/bayesian_statistics_for_computational_biology/src/stan/pipette.stan to exe file /Users/tedgro/repos/dtu-qmcm/bayesian_statistics_for_computational_biology/src/stan/pipette
17:34:58 - cmdstanpy - INFO - compiled model executable: /Users/tedgro/repos/dtu-qmcm/bayesian_statistics_for_computational_biology/src/stan/pipette
functions {
  vector standardise(vector v, real m, real s) {
    return (v - m) / s;
  }
  real standardise(real v, real m, real s) {
    return (v - m) / s;
  }
  vector unstandardise(vector u, real m, real s) {
    return m + u * s;
  }
  real unstandardise(real u, real m, real s) {
    return m + u * s;
  }
}
data {
  int<lower=1> N;
  int<lower=0> N_cal;
  vector<lower=0>[N] target_volume;
  vector<lower=0>[N_cal] y;
  array[N_cal] int<lower=1, upper=N> measurement_ix;
  real<lower=0> cal_error;
  int<lower=0, upper=1> likelihood;
}
transformed data {
  vector[N_cal] y_ls = standardise(log(y), mean(log(y)), sd(log(y)));
  vector[N] target_volume_ls = standardise(log(target_volume), mean(log(y)),
                                           sd(log(y)));
  real cal_error_s = cal_error / sd(log(y));
}
parameters {
  real<lower=0> volume_noise_s;
  real bias_factor_l;
  vector[N] volume_ls;
}
model {
  volume_noise_s ~ lognormal(log(0.1), 0.5);
  bias_factor_l ~ normal(0, 0.15);
  volume_ls ~ normal(target_volume_ls + bias_factor_l, volume_noise_s);
  if (likelihood) {
    for (i in 1 : N_cal) {
      y_ls[i] ~ normal(volume_ls[measurement_ix[i]], cal_error_s);
    }
  }
}
generated quantities {
  real bias_factor = exp(bias_factor_l);
  real volume_noise = volume_noise_s * sd(log(y));
  vector[N] volume = exp(unstandardise(volume_ls, mean(log(y)), sd(log(y))));
  vector[N_cal] y_rep;
  vector[N_cal] llik;
  for (i in 1 : N_cal) {
    int ms_ix = measurement_ix[i];
    y_rep[i] = lognormal_rng(log(volume[ms_ix]), cal_error);
    llik[i] = lognormal_lpdf(y[i] | log(volume[ms_ix]), cal_error);
  }
}

This code loads some data into a dictionary that is compatible with Stan and carries out two MCMC runs, one in prior mode and one in posterior mode.

stan_input_posterior = stanio.json.process_dictionary(
    {
      "N": N,
      "N_cal": N_CAL,
      "target_volume": TARGET_VOLUMES,
      "y": measurements,
      "measurement_ix": MEASUREMENT_IX + 1,
      "cal_error": CAL_ERROR,
      "likelihood": 1,
  }
)
stan_input_prior = stan_input_posterior | {"likelihood": 0}
mcmc_prior = model.sample(
    data=stan_input_prior,
    adapt_delta=0.999,
    max_treedepth=12,
    seed=RNG_SEED,
)
mcmc_posterior = model.sample(data=stan_input_posterior, seed=RNG_SEED)
mcmc_prior.diagnose()
mcmc_posterior.diagnose()
17:34:58 - cmdstanpy - INFO - CmdStan start processing
17:34:58 - cmdstanpy - INFO - CmdStan done processing.
17:34:58 - cmdstanpy - INFO - CmdStan start processing
17:34:58 - cmdstanpy - INFO - CmdStan done processing.
17:34:58 - cmdstanpy - WARNING - Non-fatal error during sampling:
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'pipette.stan', line 38, column 2 to column 71)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'pipette.stan', line 38, column 2 to column 71)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'pipette.stan', line 38, column 2 to column 71)
Exception: normal_lpdf: Scale parameter is 0, but must be positive! (in 'pipette.stan', line 38, column 2 to column 71)
Consider re-running with show_console=True if the above output is unclear!
                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                
'Processing csv files: /var/folders/ql/z_6fb5792v1_8tscf4hg5byc0000gp/T/tmpmwrg_bjx/pipette4esu09xh/pipette-20240503173458_1.csv, /var/folders/ql/z_6fb5792v1_8tscf4hg5byc0000gp/T/tmpmwrg_bjx/pipette4esu09xh/pipette-20240503173458_2.csv, /var/folders/ql/z_6fb5792v1_8tscf4hg5byc0000gp/T/tmpmwrg_bjx/pipette4esu09xh/pipette-20240503173458_3.csv, /var/folders/ql/z_6fb5792v1_8tscf4hg5byc0000gp/T/tmpmwrg_bjx/pipette4esu09xh/pipette-20240503173458_4.csv\n\nChecking sampler transitions treedepth.\nTreedepth satisfactory for all transitions.\n\nChecking sampler transitions for divergences.\nNo divergent transitions found.\n\nChecking E-BFMI - sampler transitions HMC potential energy.\nE-BFMI satisfactory.\n\nEffective sample size satisfactory.\n\nSplit R-hat values satisfactory all parameters.\n\nProcessing complete, no problems detected.\n'

The diagnostics seem ok, though interestingly the prior was pretty tricky to sample accurately.

Loading the MCMC results with arviz

This code loads both MCMC runs into an arviz InferenceData object.

coords = {"obs": MEASUREMENT_IX, "tube": range(N)}
dims={
    "y": ["obs"],
    "y_rep": ["obs"],
    "target_volume": ["tube"],
    "true_volume": ["tube"],
    "volume": ["tube"],
    "tube": ["tube"]
}
idata = az.from_cmdstanpy(
    posterior=mcmc_posterior,
    prior=mcmc_prior,
    log_likelihood="llik",
    observed_data=stan_input_posterior | {
        "true_volume": volumes, "tube": range(N)
    },
    posterior_predictive={"y": "y_rep"},
    coords=coords,
    dims=dims
)
idata
arviz.InferenceData
    • <xarray.Dataset> Size: 2MB
      Dimensions:          (chain: 4, draw: 1000, volume_ls_dim_0: 24, tube: 24)
      Coordinates:
        * chain            (chain) int64 32B 0 1 2 3
        * draw             (draw) int64 8kB 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
        * volume_ls_dim_0  (volume_ls_dim_0) int64 192B 0 1 2 3 4 5 ... 19 20 21 22 23
        * tube             (tube) int64 192B 0 1 2 3 4 5 6 7 ... 17 18 19 20 21 22 23
      Data variables:
          volume_noise_s   (chain, draw) float64 32kB 0.149 0.2378 ... 0.1396 0.1959
          bias_factor_l    (chain, draw) float64 32kB -0.2014 -0.07906 ... -0.07839
          volume_ls        (chain, draw, volume_ls_dim_0) float64 768kB -1.275 ... ...
          bias_factor      (chain, draw) float64 32kB 0.8176 0.924 ... 0.9842 0.9246
          volume_noise     (chain, draw) float64 32kB 0.09116 0.1455 ... 0.1199
          volume           (chain, draw, tube) float64 768kB 182.9 192.1 ... 855.2
      Attributes:
          created_at:                 2024-05-03T15:34:59.287263
          arviz_version:              0.17.1
          inference_library:          cmdstanpy
          inference_library_version:  1.2.1

    • <xarray.Dataset> Size: 264kB
      Dimensions:  (chain: 4, draw: 1000, obs: 8)
      Coordinates:
        * chain    (chain) int64 32B 0 1 2 3
        * draw     (draw) int64 8kB 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999
        * obs      (obs) int64 64B 1 4 9 11 15 19 21 22
      Data variables:
          y_rep    (chain, draw, obs) float64 256kB 197.1 177.7 290.6 ... 693.9 779.7
      Attributes:
          created_at:                 2024-05-03T15:34:59.293103
          arviz_version:              0.17.1
          inference_library:          cmdstanpy
          inference_library_version:  1.2.1

    • <xarray.Dataset> Size: 264kB
      Dimensions:     (chain: 4, draw: 1000, llik_dim_0: 8)
      Coordinates:
        * chain       (chain) int64 32B 0 1 2 3
        * draw        (draw) int64 8kB 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999
        * llik_dim_0  (llik_dim_0) int64 64B 0 1 2 3 4 5 6 7
      Data variables:
          llik        (chain, draw, llik_dim_0) float64 256kB -3.875 -2.193 ... -3.675
      Attributes:
          created_at:                 2024-05-03T15:34:59.337397
          arviz_version:              0.17.1
          inference_library:          cmdstanpy
          inference_library_version:  1.2.1

    • <xarray.Dataset> Size: 204kB
      Dimensions:          (chain: 4, draw: 1000)
      Coordinates:
        * chain            (chain) int64 32B 0 1 2 3
        * draw             (draw) int64 8kB 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
      Data variables:
          lp               (chain, draw) float64 32kB 21.01 18.71 ... 24.94 22.19
          acceptance_rate  (chain, draw) float64 32kB 0.9988 0.7904 ... 0.8739 0.9229
          step_size        (chain, draw) float64 32kB 0.3503 0.3503 ... 0.4368 0.4368
          tree_depth       (chain, draw) int64 32kB 3 4 3 3 3 3 3 3 ... 3 3 3 3 3 3 4
          n_steps          (chain, draw) int64 32kB 7 31 7 15 7 7 7 ... 7 7 7 7 7 7 15
          diverging        (chain, draw) bool 4kB False False False ... False False
          energy           (chain, draw) float64 32kB -6.192 -8.972 ... -15.39 -11.6
      Attributes:
          created_at:                 2024-05-03T15:34:59.291527
          arviz_version:              0.17.1
          inference_library:          cmdstanpy
          inference_library_version:  1.2.1

    • <xarray.Dataset> Size: 2MB
      Dimensions:          (chain: 4, draw: 1000, volume_ls_dim_0: 24, tube: 24,
                            obs: 8, llik_dim_0: 8)
      Coordinates:
        * chain            (chain) int64 32B 0 1 2 3
        * draw             (draw) int64 8kB 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
        * volume_ls_dim_0  (volume_ls_dim_0) int64 192B 0 1 2 3 4 5 ... 19 20 21 22 23
        * tube             (tube) int64 192B 0 1 2 3 4 5 6 7 ... 17 18 19 20 21 22 23
        * obs              (obs) int64 64B 1 4 9 11 15 19 21 22
        * llik_dim_0       (llik_dim_0) int64 64B 0 1 2 3 4 5 6 7
      Data variables:
          volume_noise_s   (chain, draw) float64 32kB 0.02974 0.02728 ... 0.2746
          bias_factor_l    (chain, draw) float64 32kB -0.1025 -0.07485 ... -0.1921
          volume_ls        (chain, draw, volume_ls_dim_0) float64 768kB -1.18 ... 0...
          bias_factor      (chain, draw) float64 32kB 0.9026 0.9279 ... 0.9219 0.8252
          volume_noise     (chain, draw) float64 32kB 0.0182 0.01669 ... 0.1236 0.168
          volume           (chain, draw, tube) float64 768kB 193.8 185.0 ... 693.1
          y_rep            (chain, draw, obs) float64 256kB 184.8 189.1 ... 1.002e+03
          llik             (chain, draw, llik_dim_0) float64 256kB -8.981 ... -87.32
      Attributes:
          created_at:                 2024-05-03T15:34:59.331496
          arviz_version:              0.17.1
          inference_library:          cmdstanpy
          inference_library_version:  1.2.1

    • <xarray.Dataset> Size: 204kB
      Dimensions:          (chain: 4, draw: 1000)
      Coordinates:
        * chain            (chain) int64 32B 0 1 2 3
        * draw             (draw) int64 8kB 0 1 2 3 4 5 6 ... 994 995 996 997 998 999
      Data variables:
          lp               (chain, draw) float64 32kB 72.09 71.09 ... 23.28 14.77
          acceptance_rate  (chain, draw) float64 32kB 0.9856 0.9735 ... 0.9808 0.99
          step_size        (chain, draw) float64 32kB 0.03423 0.03423 ... 0.04274
          tree_depth       (chain, draw) int64 32kB 4 4 4 4 4 4 4 4 ... 7 6 7 7 6 6 7
          n_steps          (chain, draw) int64 32kB 15 31 15 15 15 ... 127 63 63 127
          diverging        (chain, draw) bool 4kB False False False ... False False
          energy           (chain, draw) float64 32kB -58.38 -61.21 ... -14.09 -3.941
      Attributes:
          created_at:                 2024-05-03T15:34:59.334204
          arviz_version:              0.17.1
          inference_library:          cmdstanpy
          inference_library_version:  1.2.1

    • <xarray.Dataset> Size: 896B
      Dimensions:               (N_dim_0: 1, N_cal_dim_0: 1, tube: 24, obs: 8,
                                 measurement_ix_dim_0: 8, cal_error_dim_0: 1,
                                 likelihood_dim_0: 1)
      Coordinates:
        * N_dim_0               (N_dim_0) int64 8B 0
        * N_cal_dim_0           (N_cal_dim_0) int64 8B 0
        * tube                  (tube) int64 192B 0 1 2 3 4 5 6 ... 18 19 20 21 22 23
        * obs                   (obs) int64 64B 1 4 9 11 15 19 21 22
        * measurement_ix_dim_0  (measurement_ix_dim_0) int64 64B 0 1 2 3 4 5 6 7
        * cal_error_dim_0       (cal_error_dim_0) int64 8B 0
        * likelihood_dim_0      (likelihood_dim_0) int64 8B 0
      Data variables:
          N                     (N_dim_0) int64 8B 24
          N_cal                 (N_cal_dim_0) int64 8B 8
          target_volume         (tube) int64 192B 200 200 200 200 ... 800 800 800 800
          y                     (obs) float64 64B 199.1 176.3 281.9 ... 693.9 783.4
          measurement_ix        (measurement_ix_dim_0) int64 64B 2 5 10 12 16 20 22 23
          cal_error             (cal_error_dim_0) float64 8B 0.02
          likelihood            (likelihood_dim_0) int64 8B 1
          true_volume           (tube) float64 192B 152.6 199.7 161.3 ... 770.5 598.6
      Attributes:
          created_at:                 2024-05-03T15:34:59.336259
          arviz_version:              0.17.1
          inference_library:          cmdstanpy
          inference_library_version:  1.2.1

Next we look at the summaries of both the posterior and prior.

for group_name in ["prior", "posterior"]:
    group = idata.get(group_name)
    group_summary = az.summary(
        group,
        var_names=[
            "volume_noise", "bias_factor", "volume_noise_s", "bias_factor_l"
        ]
    )
    display(group_summary)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
volume_noise 0.066 0.034 0.016 0.128 0.002 0.001 224.0 380.0 1.01
bias_factor 0.994 0.143 0.731 1.252 0.015 0.010 80.0 120.0 1.06
volume_noise_s 0.108 0.055 0.026 0.210 0.003 0.002 224.0 380.0 1.01
bias_factor_l -0.016 0.141 -0.286 0.242 0.015 0.011 80.0 120.0 1.06
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
volume_noise 0.113 0.026 0.068 0.160 0.001 0.001 1303.0 1943.0 1.0
bias_factor 0.887 0.055 0.781 0.986 0.001 0.001 2493.0 2380.0 1.0
volume_noise_s 0.185 0.043 0.111 0.261 0.001 0.001 1303.0 1943.0 1.0
bias_factor_l -0.122 0.061 -0.229 0.002 0.001 0.001 2493.0 2380.0 1.0

Investigating the results

This plot compares the measurements with the observations.

az.plot_lm(
    y=idata.observed_data["y"],
    x=idata.observed_data["tube"].sel(tube=MEASUREMENT_IX + 1),
    y_hat=idata.posterior_predictive["y_rep"]
)
ax = plt.gca()
ax.semilogy()

This plot compares the volume_noise and bias_factor samples with the true values that we used to simulate the data.

az.plot_posterior(
  idata.prior,
  var_names=["volume_noise", "bias_factor"],
  kind="hist",
  hdi_prob="hide",
  point_estimate=None,
  figsize=[12, 4]
)
f = plt.gcf()
axes = f.axes
az.plot_posterior(
  idata.posterior,
  var_names=["volume_noise", "bias_factor"],
  kind="hist",
  hdi_prob="hide",
  point_estimate=None,
  figsize=[12, 4],
  ax=axes,
  color="tab:orange"
)
for ax, truth in zip(f.axes, [NOISE, BIAS_FACTOR]):
    ax.axvline(truth, color="red")

This plot shows the samples for all the tubes’ volumes, including those that weren’t measured, alongside the true volumes.

az.plot_lm(
    x=idata.observed_data["tube"],
    y=idata.observed_data["true_volume"],
    y_hat=idata.posterior["volume"],
    grid=False,
    y_kwargs={"label": "true volume"},
    figsize=[10, 5],
    legend=False,
)
ax = plt.gca()
for i in MEASUREMENT_IX:
    ax.text(i+0.1, volumes[i], "obs", zorder=1000)
ax.set(xlabel="tube", ylabel="volume ($\\mu$l)");
ax.semilogy()

So, what is the probability that Teddy put less than 350 \(\mu\)l of label into tube 10, even though the target amount was 400\(\mu\)l?

az.plot_posterior(
  idata.prior,
  var_names=["volume"],
  coords={"tube": [10]},
  kind="hist",
  hdi_prob="hide",
  point_estimate=None,
  ref_val=350,
  figsize=[12, 4],
  bins=np.linspace(250, 600, 30),
)

Phew, only about 13%, that’s probably fine right?

References

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.
Susan Holmes, and Wolfgang Huber. 2019. Modern Statistics for Modern Biology. Cambridge University Press. https://www.huber.embl.de/msmb/.