5.3 Bayesian Statistics:

These notes are based on chapter readings of the textbook A First Course in Bayesian Statistical Methods by Peter D. Hoff. https://link.springer.com/book/10.1007/978-0-387-92407-6

Bayesian statistics is a branch of statistics that is based on the Bayes’ theorem. It describes how we update the probability of a hypothesis as more evidence becomes available. The unknown parameters in our model are treated as random variables and probability is used to describe uncertainty about the parameters.

This is different from a frequentist approach because the parameters are treated as fixed values. Probability is used to describe long-run frequency of observed data.

Probabilities can numerically represent a set of rational beliefs. Bayes’ rule provides a rational method for updating belief with new information, i.e. Bayesian Inference.

Bayesian methods are data analysis tools with uses: - Formal interpretation as a means of induction
- Parameter estimates with good statistical properties
- Parsimonious descriptions of observed data
- Predictors for missing data and forecasts of future data

Bayesian learnin is about understanding and quantifying uncertainty. Statistical induction means understanding the general characteristics of a population based on a subset of that population.

Before data, numerical values of a population or subset are uncertain. After data, we can use information to decrease uncertainty in population.

Why Bayes?

Bayes is a practical analysis, it can be hard to determine and formulate our prior beliefs, therefore, \(p(\theta)\) is often chosen ad hoc, or for computational convienence.

\(p(\phi)\) can approximate our beliefs
\(p(\theta|y)\) is optimal \(p(\phi)\) is also approximated.

For complicated statistical problems, there may not be an obvious solution. Bayes’ rule can approximate in these situations and evalute non-Bayesian criteria

5.3.1 Properties of Conditional Probability

5.3.2 Application of Bayes Theorem: Examples

5.3.3 Independence

5.3.4 Random Variables; Joint Distributions; Law of Large Numbers (LLN); Central Limit Theorem (CLT)

this is test

5.3.5 Common Probability Distributions; Introduction to Bayesian Inference; One-parameter Models

this is test

5.3.6 Exponential Family; Frequentist Confidence Interval; Bayesian (Credible) Interval

this is test

References:
- Raiﬀa and Schlaifer (1961) classes of prior distributions
- Diaconis and Ylvisaker (1979 and 1985) conjugacy for exponential families
- Welch and Peers (1963), Hartigan (1966), Severini (1991) Bayesian coverage and credible intervals
- Tibshirani (1989) Sweeting (1999 and 2001) frequentist coverage for contruction of prior distributions
- Kass and Wasserman (1996) review of formal methods for selecting prior distributions

5.3.7 Sufficiency; Rao-Blackwell Theorem;

5.3.8 Monte Carlo Approximation

5.3.8.1 Monte Carlo Expectation

5.3.8.2 Additional information from Monte Carlo approximation, other than estimating parameters

Calculating the log-odds from Monte Carlo approximation:
Fifty-four percent of the respondents in the 1998 General Social Survey re-ported their religious preference as Protestant, leaving non-Protestants in the minority. Respondents were also asked if they agreed with a Supreme Court ruling that prohibited state or local governments from requiring the reading of religious texts in public schools. Of the \(n\) = 860 individuals in the religious minority (non-Protestant), \(y\)= 441 (51%) said they agreed with the Supreme Court ruling, whereas 353 of the 1011 Protestants (35%) agreed with the ruling.
Let θ be the population proportion agreeing with the ruling in the minority population. Using a binomial sampling model and a uniform prior distribution, the posterior distribution of θ is beta(442,420). Using the Monte Carlo algorithm described above, we can obtain samples of the log-odds \(Y\) = log[θ/(1-θ)] from both the prior distribution and the posterior distribution of \(Y\). In R, the Monte Carlo algorithm involves only a few commands:

If you have two parameters you want to test a numerical value for, say for example \(\theta_{1}\) > \(\theta_{2}\), or \(\theta_{1}\) / \(\theta_{2}\) then monte carlo sampling can be completed to perform posterior summaries or comparisons between the two parameters

Approach:
1. Draw S samples from each posterior:

For each s=1,…,S:
- Sample \(\theta_{1}^{(s)}\) from gamma(219,112)
- Sample \(\theta_{2}^{(s)}\) from gamma(68,45)

Compare the samples:
- For each pair (\(\theta_{1}^{(s)}\),\(\theta_{2}^{(s)}\)):
  - Check if \(\theta_{1}^{(s)}\) > \(\theta_{2}^{(s)}\)
- The proportion of times this is true estimates \(Pr(\theta_{1}^{(s)}\) > \(\theta_{2}^{(s)} | data)\)

This can be expressed as:
\[ \Pr(\theta_1 > \theta_2 \mid \text{data}) \approx \frac{1}{S} \sum_{s=1}^S 1(\theta_1^{(s)} > \theta_2^{(s)}) \]

where 1 if condition is true, 0 otherwise. This can be calculated as:
This peak suggests that \(\theta_{1}\) is more likely than \(\theta_{2}\). For spread and uncertainty the width of the density plot shows how uncertain you are about the ratio. A narrow peak means high certainty, a wide spread means more uncertainty.

Sampling from a predictive distributio
What does this mean? “Given what I know (my data and prior beliefs), what is the probability distribution for a future or unobserved outcome?” It is a probability distribution for a random variable \(\tilde{Y}\) (a new or future observation). We can’t just plug in to predict for future observations because we need to account for the uncertainty or variability that the future data will have.

It is constructed on:
- known quantities have been conditioned on; such as the observed data or the fixed covariates - unknown quantities have been integrated out (like parameters θ).

Special Case: Conjugate Priors
Example: Y~ = number of children for a randomly chosen woman aged 40 with a college degree. θ = mean birthrate for this population.

If we knew θ, we’d model Y~ as (Poisson): \[ \Pr(\tilde{Y} = \tilde{y} | \theta) = \frac{\theta^{\tilde{y}} e^{-\theta}}{\tilde{y}!} \] But We Usually Don’t Know θ:
- We have uncertainty about θ, described by a prior or posterior distribution p(θ). - To predict \(tilde{Y}\), we must integrate out θ:
\[ \Pr(\tilde{Y} = \tilde{y}) = \int p(\tilde{y} | \theta) p(\theta) d\theta \]

This is the predictive distribution.
If \(\theta \sim \operatorname{gamma}(a, b)\), the predictive distribution for \(\tilde{Y}\) is a negative binomial distribution with parameters \(a, b\).

5.3.8.3 Understanding discrepencies

Checking your work in Monte Carlo approximation
If there are descrepencies from the empirical distribution vs the predictive distribution, these are due to:
1. sampling variability as the observed data might just have a statistical fluke
2. model misfit as the model may not be flexible enough to capture the true pattern of the data

To check if this is unusual, you can quantify the conflict using simulation. 1. For each sim - Draw a parameter value \(\theta^{(s)}\) from the posterior. - Generate a new dataset \(\tilde{\boldsymbol{Y}}^{(s)}\) of size 111, using the model (\(\text{Poisson}(\theta^{(s)})\)) for each woman). - Compute a summary statistic for each simulated dataset (here, the ratio \(t^{(s)}\) of # with 2 children to # with 1 child).

Repeat for many simulations to get a distribution of \(t^{(s)}\) under the model.
- Compare your observed statistic tobs (here, 2) to the simulated values.
  - If very few simulated datasets have t(s) as extreme as tobs, this suggests the model is not consistent with the data.

References:
- Rubinstein and Kroese (2008): Monte Carlo methods for a wide variety of statistical problems
- Robert and Casella (2004) include more coverage of Bayesian applications (and cover Markov chain Monte Carlo methods as well).
- Guttman (1967) and Rubin (1984) posterior predictive distribution to assess model fit
- Gelman et al (1996) and more recently in Johnson (2007) evaluate goodness-of-fit using functions that depend on parameters as well as predicted data
- Bayarri and Berger (2000) posterior predictive p-values

5.3.9 Normal Model

5.3.9.1 Conjugate analysis

Useful and most utilixed for data analysis because of the central limit theorem and due to simple model with separate parameters for mean and variance. \[ p(y \mid \theta, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2}\left(\frac{y-\theta}{\sigma}\right)^2\right), \quad -\infty < y < \infty \] - the distribution is symmetric about θ, and the mode, median and mean are all equal to \(\theta\) - Mean, median, and more are all equal to \(\theta\). - about 95% of the population lies within two standard deviations (\(\pm 2\sigma\)) of the mean (more precisely, 1.96 standard deviations (\(\pm 1.96\sigma\))) - if \(X \sim \operatorname{Normal}(\mu, \tau^2)\), \(Y \sim \operatorname{Normal}(\theta, \sigma^2)\) and \(X\) and \(Y\) are independent, then for constants \(a, b\):
\[ aX + bY \sim \operatorname{Normal}(a\mu + b\theta, a^2\tau^2 + b^2\sigma^2) \] - the dnorm, rnorm, pnorm, and qnorm commands in R take the standard deviation \(\sigma\) as their argument, not the variance \(\sigma^2\). Be very careful about this when using R - confusing as these can drastically change your results.

Inference for the mean, condinitional on the variance:
Suppose you observe \(Y_1, Y_2, \ldots, Y_n\) and assume iid. \[ Y_i \sim \operatorname{Normal}(\theta, \sigma^2) \]

The joing probability (likelihood) of seeing you data given \(\theta, \sigma^2\) is:

\[ p(y_1, \ldots, y_n | \theta, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{1}{2} \left( \frac{y_i - \theta}{\sigma} \right)^2 \right ) \]

\[ = (2\pi\sigma^2)^{-n/2} \exp\left( -\frac{1}{2} \sum_{i=1}^n \left( \frac{y_i - \theta}{\sigma} \right)^2 \right ) \]

which we can expand on the quadratic term in the exponent to get :
\[ \sum_{i=1}^n \left( \frac{y_i - \theta}{\sigma} \right)^2 = \frac{1}{\sigma^2} \sum y_i^2 - 2\frac{\theta}{\sigma^2} \sum y_i + n\frac{\theta^2}{\sigma^2} \] therefore, the likelihood depends on \(\sum y_i\) the sample mean and \(\sum y_i^2\). These are the sufficent statistics.

Updating based on prior belief

If you started with a normal prior, the posterior is also normal. \[ \theta \sim \operatorname{Normal}(\mu_0, \tau_0^2) \] \[ p(\theta | y_1, \ldots, y_n, \sigma^2) \sim \operatorname{Normal}(\mu_n, \tau_n^2) \] \[ \tau_n^2 = \frac{1}{ \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} } \] \[ \mu_n = \frac{ \frac{1}{\tau_0^2} \mu_0 + \frac{n}{\sigma^2} \bar{y} }{ \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} } \]

\(\mu_0, \tau_0^2\) are the prior mean and variance \(\bar{y}\) is sample mean

5.3.9.2 Precision and combining information

posterior variance for the mean, \(\tau_n^2\), is determined by combining the inverse variances (precisions) from the prior and the data: \[ \frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} \]

\(\tau_0^2\) prior variance (how uncertain you are about the prior mean) \(\sigma^2\) data variance (spread of the data) \(n\) sample size

Precision is the inverse of variance: Prior precision: \(1/\tau_0^2\) Data precision: \(n/\sigma^2\) (since you have n data points, each with variance \(\sigma^2\)) Posterior precision: \(1/\tau_n^2\)

Posterior mean is a weighted average number of “prior” and “data” observations
\[ \mu_n = \frac{1/\tau_0^2}{1/\tau_0^2 + n/\sigma^2} \mu_0 + \frac{n/\sigma^2}{1/\tau_0^2 + n/\sigma^2} \bar{y} \] Prior mean weight: \(1/\tau_0^2\) Sample mean weight: \(n/\sigma^2\)

If you think of your prior as coming from κ0 “pseudo-observations” with variance \(\sigma^2\), set \(\tau_0^2 = \sigma^2 / \kappa_0\), and the formula simplifies to: \[ \mu_n = \frac{\kappa_0}{\kappa_0 + n} \mu_0 + \frac{n}{\kappa_0 + n} \bar{y} \] The final posterior belief about the mean is a compromise between prior belief and evidence from the data, weighted by reliability (precision). With more data, the posterior reliers on the sample mean more than the prior.

5.3.9.3 Prediction for a new observation

Use the posterior distribution for \(theta\) after seeing data

Posterior mean of \(\theta\): \(\mu_n\) Posterior variance of \(\theta\): \(\tau_n^2\) this also represents our uncertainty about the center of the population

Now, the new observation \(\tilde{Y}\) is the sum of:

\(\theta \sim \operatorname{Normal}(\mu_n, \tau_n^2)\) \(\tilde{\epsilon} \sim \operatorname{Normal}(0, \sigma^2)\) (independent noise)

Sum of independent normals is normal: \[ \tilde{Y} \mid y_1, ..., y_n, \sigma^2 \sim \operatorname{Normal}(\mu_n, \tau_n^2 + \sigma^2) \] mean \(\mathrm{E}[\tilde{Y} \mid y_1, ..., y_n, \sigma^2] = \mu_n\) variance \(\operatorname{Var}[\tilde{Y} \mid y_1, ..., y_n, \sigma^2] = \tau_n^2 + \sigma^2\)

\(\sigma^2\) is t irreducible noise that can’t be predicted. The total uncertainty of \(\tilde{Y}\) is at least \(\sigma^2\), but bigger if uncertain about \(\theta\).

uncertainty is a combination about how well you know the mean and how much natural variability is in the data.

5.3.9.4 Joint inference for the mean and variance

This uses a joint prior on \((\theta, \sigma^2)\). The conjugate priors make it easy to compute the posterior. The parameters of these priors can be interpretated as if they come from previous (pseudo-)data.

Recall from axioms of probability that a joint distribution for two quantities can be expressed as the product of a conditional probability and a marginal probability: \[ p(\theta, \sigma^2) = p(\theta | \sigma^2) \, p(\sigma^2) \]

For a normal distribution, a convinient choice of a prior is \(\tau_0^2 = \sigma^2/\kappa_0\), so the prior can be thought of as coming from \(\kappa_0\) pseudo-observations with mean \(\mu_0\) and variance \(\sigma^2\)

Prior for the variance: the variance must be positive, so we use the inverse-gamma prior ( a gamma prior for the precision\(1/\sigma^2\)): \[ 1/\sigma^2 \sim \operatorname{gamma}\left(\frac{\nu_0}{2}, \frac{\nu_0}{2}\sigma_0^2\right) \] \(\sigma_0^2\) is the prior guess for the variance, \(\nu_0\) is the “prior sample size” reflecting your confidence in \(\sigma_0^2\).

\(\mu_0\): Prior mean for θ
\(\kappa_0\): Number of pseudo-observations supporting μ0
\(\sigma_0^2\): Prior guess for variance
\(\nu_0\): “Prior sample size” for the variance

Properties of the inverse-gamma prior
mean: \(E[\sigma^2] = \sigma_0^2 \frac{\nu_0/2}{\nu_0/2-1}\)
mode: \(\sigma_0^2 \frac{\nu_0/2}{\nu_0/2+1} < \sigma_0^2 < E[\sigma^2]\)
Variance decreases as \(\nu_0\) increases (more prior data, more certainty).

5.3.9.5 Posterior inference

We determined:

Data: \[ \(Y_1, ..., Y_n \mid \theta, \sigma^2 \sim \text{i.i.d. normal}(\theta, \sigma^2)\) \]
Prior for variance (precision):

\[ \(1/\sigma^2 \sim \operatorname{gamma}(\nu_0/2, \nu_0 \sigma_0^2/2)\) \]
Prior for mean (conditional on variance):

\[ \(\theta \mid \sigma^2 \sim \text{normal}(\mu_0, \sigma^2/\kappa_0)\) \]

As the prior can be facrotized: \[ \(p(\theta, \sigma^2) = p(\theta \mid \sigma^2)p(\sigma^2)\) \] The posterior can be factored as:
\[ p(\theta, \sigma^2 \mid y_1, ..., y_n) = p(\theta \mid \sigma^2, y_1, ..., y_n) \; p(\sigma^2 \mid y_1, ..., y_n) \]

The posterior for the mean \(\theta\) is a conditional. The posterior of \(\theta\) is normal:
\[ \theta \mid \sigma^2, y_1, ..., y_n \sim \text{normal}\left(\mu_n, \frac{\sigma^2}{\kappa_n}\right) \] \(\kappa_n = \kappa_0 + n\)
\(\mu_n = \frac{\kappa_0 \mu_0 + n \bar{y}}{\kappa_0 + n}\)
\(\bar{y}\) = sample mean of the data
\(\mu_n\) is a weighted average of the prior mean and the sample mean, weighted by their respective “sample sizes”.

this is all prior observed data weighted by the sample size

The posterior for variance \(\sigma^2\)
The marginal posterior for \(\sigma^2\) is inverse-gamma (or, equivalently, the precision \(1/\sigma^2\) is gamma): \[ 1/\sigma^2 \mid y_1, ..., y_n \sim \operatorname{gamma}(\nu_n/2, \nu_n \sigma_n^2/2) \] \(\nu_n = \nu_0 + n\) (prior sample size + data sample size)
\(\sigma_n^2 = \frac{1}{\nu_n}\left[\nu_0 \sigma_0^2 + (n-1)s^2 + \frac{\kappa_0 n}{\kappa_0 + n}(\bar{y} - \mu_0)^2\right]\)
\(s^2\) is the sample variance
\(\nu_0 \sigma_0^2\) is the prior sum of squares
\((n-1)s^2\) is the data sum of squares

The last term \(\frac{\kappa_0 n}{\kappa_0 + n}(\bar{y} - \mu_0)^2\) measures the difference between the prior mean and the sample mean; if they are very different, the posterior variance increases.

This incorporates prior and observed data, giving penalty if the sample mean and prior are far apart

If \(\mu_0\) is truly the mean of \(\kappa_0\) prior observations, everything works in an “add up the data” fashion. If not, the last term in \(\sigma_n^2\) corrects for disagreement between prior and observed data.

5.3.9.6 Monte Carlo sampling

It is common to want to analyse population mean, standard deviation, quantiles, probabilities, and credible intervals, however, the posterior for parameted \(\theta\) depends on the unknown variance of \(\sigma^2\) and the joint posterior is not a simple normal distribution.

for the posterior structure, the conditional posterior mean given variance is normal is \[ \theta | \sigma^2, y \sim \operatorname{Normal}(\mu_n, \sigma^2 / \kappa_n) \] the marginal posterior for the variance is the inverse-gamma: \[ \sigma^2 | y \sim \text{Inverse-Gamma}(\nu_n/2, \nu_n \sigma_n^2 / 2) \] but you want to discuss about \(\theta\) marginally, not just conditionally. We can perform Monte Carlo Sampling to generate samples from the join posterior: \(p(\theta, \sigma^2 | y_1, ..., y_n)\)

Sample \(\sigma^{2(s)}\) from its marginal posterior:
\[ \sigma^{2(s)} \sim \text{Inverse-Gamma}(\nu_n / 2, \nu_n \sigma_n^2 / 2) \]
Sample \(\theta^{(s)}\) from its conditional posterior given \(\sigma^{2(s)}\)
\[ \theta^{(s)} \sim \operatorname{Normal}(\mu_n, \sigma^{2(s)} / \kappa_n) \] This is repeated \(S\) times to get \(S\) pairs \((\theta^{(s)}, \sigma^{2(s)})\).

This works because each \(\theta^{(s)}\) is a conditional on a different sampled value of \(\sigma^{2(s)}\) ie conditional posterior \(p(\theta | \sigma^{2(s)}, y)\).

The set of \(\{\theta^{(1)}, ..., \theta^{(S)}\}\) are samples from the marginal posterior of \(\theta\) and represent the fun uncertainty about \(\theta\) after accounting for both mean and variance. These can be used to compute anything about the posterior of \(\theta\) like:

Posterior mean: average of samples
Posterior sd: standard deviation of samples
Credible intervals: quantiles of samples
Probabilities: proportion of samples with desired property

the marginal posterior of \(\theta\) is a t-distribution:

\[ t(\theta) = \frac{\theta - \mu_n}{\sigma_n / \sqrt{\kappa_n}} \]

with \(\nu_0 + n\) degrees of freedom (if the prior is small then close to classical \(t_{n-1}\) distribution).

This is useful because it allows you to work through complex posteriors that may not be writable. It also lets you estimate expectations, std, quantiles, and probabilities even if the formula is not simple.

5.3.9.7 Improper Priors

When you to use Bayesian methods but you don’t want a heavy influence of the priors over the answers, in otherwords, you want to be objective. In a normal model the priors for the mean and variance are set using prior sample sizes (\(\kappa_0\) and \(\nu_0\)) which measures how much weight you give to the prior beliefs. These uninformative priors make Bayesian answers almost identical to frequentist answers. The idea is to be able to use Bayesian logic to discover informaiton about the parameters while being objective.

The reason one may approach the question through Bayesian is because it is more intuitive and asks the question “What do I believe in the paramater given the data?”

It can handle complex models that frequentist methods have a difficult time computing.

As you decrease the prior influence, the posterior mean and variance become:
\[ \mu_n = \frac{\kappa_0 \mu_0 + n\bar{y}}{\kappa_0 + n} \] \[ \sigma_n^2 = \frac{1}{\nu_0 + n} \left[ \nu_0 \sigma_0^2 + (n-1)s^2 + \frac{\kappa_0 n}{\kappa_0 + n} (\bar{y} - \mu_0)^2 \right] \] As \(\kappa_0 \to 0\) and \(\nu_0 \to 0\) (i.e., as you put less weight to the prior information)
- \(\mu_n \to \bar{y}\) posterior mean becomes the sample mean.
- \(\sigma_n^2 \to \frac{n-1}{n} s^2\) posterior variance becomes the sample variance (but divided by \(n\), \(n-1\)).

Setting \(\kappa_0, \nu_0 \to 0\) is equivalent to using an improper prior, a prior that does not need to integrate to 1 in the way that \(p(\theta, \sigma^2) \propto 1/\sigma^2\) does.

The posterior with the improper prior becomes:
\[ 1/\sigma^2 \mid y \sim \text{gamma}\left(\frac{n}{2}, \frac{n}{2} \frac{1}{n} \sum (y_i - \bar{y})^2 \right) \] \[ \theta \mid \sigma^2, y \sim \text{normal}\left(\bar{y}, \frac{\sigma^2}{n}\right) \]

t-distribution: if you integrate out \(\sigma^2\), the posterior distribution of \(\theta\) after observing the data is \[ \frac{\theta - \bar{y}}{s/\sqrt{n}} \mid y \sim t_{n-1} \] This is the same as the sampling distribution of the sample mean in frequentist statistics uncertainty about the sample mean, given the true mean: \[ \frac{\bar{Y} - \theta}{s/\sqrt{n}} \mid \theta \sim t_{n-1} \]

uncertainty about the true mean, given the sample mean:
\[ \frac{\theta - \bar{y}}{s/\sqrt{n}} \mid y \sim t_{n-1} \]

These results are not formally Bayesian, as there is no proper probability distribution on \((\theta, \sigma^2)\) that yields this \(t_{n-1}\) posterior for \(\theta\). But these are reasonable and have desirable properties so they are admissible from a descision-theoretical standpoint.

References
- Stein (1955) and Berger (1980) any reasonable estimator is a Bayesian estimator or a limit of a sequence of Bayesian estimators, and that any Bayesian estimator is reasonable

5.3.9.8 Bias, Variance, and Mean Squared Error (MSE);

A point estimator is a single number calculated from the data to estimate an unknown parameter \(\theta\)

In the normal model with a conjugate prior:

\[ \hat{\theta}_{b}\left(y_{1}, \ldots, y_{n}\right) = \mathrm{E}\left[\theta \mid y_{1}, \ldots, y_{n}\right] = \frac{n}{\kappa_{0} + n} \, \bar{y} + \frac{\kappa_{0}}{\kappa_{0} + n} \, \mu_{0} = w \, \bar{y} + (1 - w)\mu_{0} \]

\(w = \frac{n}{\kappa_0 + n}\)
\(\mu_0\): prior mean
\(\kappa_0\): prior “sample size”
\(\bar{y}\): sample mean

Sampling Properties and Bias
Bias measures how close the expected value of the estimator is to the true parameter value. The sampling properties are the behavior of estimators under repeated experiments/surveys/sampling.

The unbiased sample mean is \(\hat{\theta}_e = \bar{y}\):

\[ \mathbb{E}[\hat{\theta}_e \mid \theta = \theta_0] = \theta_0 \]

The Bayesian estimator \(\hat{\theta}_b\):

\[ \mathbb{E}[\hat{\theta}_b \mid \theta = \theta_0] = w\theta_0 + (1-w)\mu_0 \] If \(\mu_0 \neq \theta_0\) this estimator is biased.

The bias tells us that an estimator is close to the center of mass of the sampling distribution of an estimator is to the true value, but it does not tell you how far away it is from the true value.

Sample mean variance: \[ \operatorname{Var}[\hat{\theta}_e \mid \theta_0, \sigma^2] = \frac{\sigma^2}{n} \] Bayesian estimator variance:

\[ \operatorname{Var}[\hat{\theta}_b \mid \theta_0, \sigma^2] = w^2 \frac{\sigma^2}{n} < \frac{\sigma^2}{n} \] The Bayesian estimator is less variable than the sample mean.

To measure bias, we use mean squared error (MSE). MSE combines both bias and variance:

\[ \begin{aligned} \operatorname{MSE}\left[\hat{\theta} \mid \theta_{0}\right] &= \mathrm{E}\left[\left(\hat{\theta} - \theta_{0}\right)^2 \mid \theta_{0}\right] \\ &= \mathrm{E}\left[\left(\hat{\theta} - m + m - \theta_{0}\right)^2 \mid \theta_{0}\right] \\ &= \mathrm{E}\left[(\hat{\theta} - m)^2 \mid \theta_{0}\right] + 2\mathrm{E}\left[(\hat{\theta} - m)(m - \theta_{0}) \mid \theta_{0}\right] + \mathrm{E}\left[(m - \theta_{0})^2 \mid \theta_{0}\right]. \end{aligned} \]

This is equivalent to:

\[ \operatorname{MSE}[\hat{\theta}\mid \theta_0] = \operatorname{Var}[\hat{\theta}\mid \theta_0] + \text{Bias}^2[\hat{\theta}\mid \theta_0] \]

Variance comparisons:

\[ \begin{aligned} \operatorname{Var}\left[\hat{\theta}_{e} \mid \theta=\theta_{0}, \sigma^{2}\right] &= \frac{\sigma^{2}}{n}, \quad \text{whereas} \\ \operatorname{Var}\left[\hat{\theta}_{b} \mid \theta=\theta_{0}, \sigma^{2}\right] &= w^{2} \times \frac{\sigma^{2}}{n} < \frac{\sigma^{2}}{n}, \end{aligned} \]

Sample mean: \[ \operatorname{MSE}[\hat{\theta}_e \mid \theta_0] = \frac{\sigma^2}{n} \]

Bayesian estimator:

\[ \operatorname{MSE}[\hat{\theta}_b \mid \theta_0] = w^2 \frac{\sigma^2}{n} + (1-w)^2 (\mu_0 - \theta_0)^2 \]

First term: less than sample mean’s variance (because w<1)
Second term: squared bias (penalty if prior is wrong)

When is the Bayesian Estimator Better? The Bayesian estimator has a lower MSE than the sample mean if your prior mean \(\mu_0\) is “close enough” to the true mean \(\theta_0\), the Bayesian estimator is better on average (lower squared error): \[ (\mu_0 - \theta_0)^2 < \frac{\sigma^2}{n} \frac{1+w}{1-w} \]

A good prior guess (\(\mu_0\) close to \(\theta_0\)) reduces the risk (MSE) by shrinking the estimates toward the prior. This can outperform the frequentist approach if the prior information is good. In the bias and variance tradeoff, adding a little bias can reduce variance to make the overall MSE smaller. If the prior is very wrong, then the MSE will end up higher. As the sample size n increases, \(w \to 1\), therefore, teh estimators are dominated by the data and the estimators are closer to the data.

If there is no reliable prior, use the sample mean. Otherwise, use the Bayesian estimator.

5.3.9.9 Prior specification based on expectations

The normal model is written in exponential family form. The sufficient statistics for a single observation \(y\) are \(t(y) = (y, y^2)\). The natural parameters are \(\phi = (\theta/\sigma^2, -1/(2\sigma^2))\).

For exponential families, the conjugate prior can be written as:
\[ p(\phi|n_0, t_0) \propto c(\phi)^{n_0} \exp(n_0 t_0^T \phi) \]

where \(t_0 = (t_{01}, t_{02}) = (\mathrm{E}[Y], \mathrm{E}[Y^2])\), the prior expectations of \(Y\) and \(Y^2\). \(n_0\) is a “prior sample size” (how strongly you believe the prior).

Reparameterize in prior terms of \(\theta\) and \(\sigma^2\) from the natural parameters \(\phi\) to the familiar mean/variance parameters. The prior for \((\theta, \sigma^2)\) becomes:

\[ \begin{aligned} p(\theta, \sigma^2 | n_0, t_0) \propto\ & (\sigma^2)^{-1/2} \exp\left\{ \frac{-n_0(\theta-t_{01})^2}{2\sigma^2} \right\} \\ & \times (\sigma^2)^{-(n_0+5)/2} \exp\left\{ \frac{-n_0(t_{02}-t_{01}^2)}{2\sigma^2} \right\} \end{aligned} \]

The first term is a normal density for \(\theta\) given \(\sigma^2\).
The second term is an inverse-gamma density for \(\sigma^2\).

To setting the prior parameters:

\(\mathrm{E}[Y] = \mu_0\)(prior mean for the population)
\(\mathrm{E}[\mathrm{Var}(Y|\theta, \sigma^2)] = \sigma_0^2\) (prior variance for the population)
Set \(t_{01} = \mu_0\)

For \(t_{02}\):
\[ \begin{aligned} t_{02} &= \mathrm{E}[Y^2] = \mathrm{E}[\sigma^2 + \theta^2] \\ &= \sigma_0^2 + \sigma_0^2/n_0 + \mu_0^2 \\ &= \sigma_0^2(n_0+1)/n_0 + \mu_0^2 \end{aligned} \]

So, \(n_0(t_{02} - t_{01}^2) = (n_0+1)\sigma_0^2\)

The resulting priors are then
\[ \begin{aligned} \theta \mid \sigma^2 &\sim \mathrm{Normal}(\mu_0, \sigma^2/n_0) \\ \sigma^2 &\sim \mathrm{Inv\text{-}Gamma}((n_0+3)/2,\, (n_0+1)\sigma_0^2/2) \end{aligned} \]

If you want a weak prior (low information), set \(n_0 = 1\):

\(\theta|\sigma^2 \sim \mathrm{Normal}(\mu_0, \sigma^2)\)
\(1/\sigma^2 \sim \mathrm{Gamma}(2, \sigma_0^2)\) or equivalently \(\sigma^2 \sim \mathrm{Inv\text{-}Gamma}(2, \sigma_0^2)\)

We then update the posterior:
Given data \(y_1, \dots, y_n\) the posterior distributions are: For \(\theta\) given \(\sigma^2\) and data:

\[ \theta|\sigma^2, y_1,\dots,y_n \sim \mathrm{Normal}\left(\frac{\mu_0/\sigma^2 + n\bar{y}}{1/\sigma^2 + n},\, \frac{\sigma^2}{n+1}\right) \]

For \(\sigma^2\) given data:

\[ \sigma^2|y_1,\dots,y_n \sim \mathrm{Inv\text{-}Gamma}\left(2 + n/2,\, \sigma_0^2 + (n-1)s^2 + \frac{n}{n+1}(\bar{y}-\mu_0)^2\right) \]

where s2 is the sample variance.

5.3.9.10 The normal model on non-normal data

We often use the normal model for inference even when the data is not normally distributed for convienence because the although the data point is not normal, the sample mean is approximately normal due to the central limit theorem.

References
- Lukacs (1942) shows that a characterizing feature of the normal distribution is that the sample mean and the sample variance are independent (see also Rao (1958)).
- White (1982) confidence intervals for the population mean based on the normal model will generally be asymptotically correct

5.3.10 Posterior approximation with the Gibbs sampler.

In single parameter or two-parameter models, there is some standard or well-known form of the posterior distribution that makes it possible to sample directly. However, this is not the case for multiparameter models. This is because with many coefficients or hierarchial models, the joint posterior distributino can become complex and hard to sample from because:
- it may not have a closed-form expression
- it might not be from a standard distribution
- it may involve complicated dependencies between parameters

Its is possible to sample from the full conditional distribution of each parameter, when this is the case we can use the Gibbs sampler. The Gibbs sampler is a type of Markov Chain Monte Carlo algorithm, interative algorithm that constructs a dependent sequence of the parameter values who distribution convergest to the target join posterior distribution.

The algorithm works as such: 1. Start with initial guesses for all parameters. 2. Update each parameter in turn by sampling from its full conditional distribution, using the current values of all the other parameters. 3. Repeat this process many times.

Through the iterations, the sequence of parameter values forms a Markov chain. Under certain conditions, the distribution of these parameter values will converge to the joint posterior distribution of interest.

This algorithm works because the Markov chain has the correct stationary distribution (the target posterior). Even though each step only samples from one parameter’s conditional distribution, the process as a whole explores the joint space correctly.

Classic cases are the normal models with conjugate of semiconjugate priors

5.3.10.1 A semiconjugate prior distribution

Modeling data \(Y_1, \ldots, Y_n\) i.i.d. obs. from a normal distribution with mean \(\theta\) and variance \(\sigma^2\): \[ Y_i \sim \text{Normal}(\theta, \sigma^2) \]

Conjugate Priors
Normal conjugate prior \[ p(\theta \mid \sigma^2) = \text{Normal}\left(\mu_0, \frac{\sigma^2}{\kappa_0}\right) \] \[ 1/\sigma^2 \sim \text{Gamma}\left(\nu_0/2, \nu_0 \sigma_0^2/2\right) \] The prior variance of \(\theta\) depends on the unknown variance of \(\sigma^2\) and the \(\mu_0\) represents the mean from \(\kappa_0\) prior pseudo-observations with variance \(\sigma^2\).

clean, easy to calculate from

However, dependence of \(\theta\) and \(\sigma^2\) may not be desireable and you may want the mean to be independent of the uncertainty.

A semiconjugate prior breaks dependence by specifying independent priors for \(\theta\) and \(\sigma^2\):
\[ \theta \sim \text{Normal}(\mu_0, \tau_0^2) \]

The prior variance \(\tau_0^2\) for \(\theta\) is fixed and does not depend on \(\sigma^2\)
Therefore, the joint prior is \(p(\theta, \sigma^2) = p(\theta) \times p(\sigma^2)\)
This add model flexibility

Posterior calculations given \(Y_1, \ldots, Y_n\) with likelihood \(Y_i \sim \text{Normal}(\theta, \sigma^2)\):

conditional posterior \(p(\theta \mid \sigma^2, y_1, \ldots, y_n)\) is normal:
\[ \theta \mid \sigma^2, y_1, \ldots, y_n \sim \text{Normal}(\mu_n, \tau_n^2) \] \[ \mu_n = \frac{\mu_0/\tau_0^2 + n\bar{y}/\sigma^2}{1/\tau_0^2 + n/\sigma^2} \quad \text{and} \quad \tau_n^2 = \left(\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}\right)^{-1} \]
The marginal posterior for \(\sigma^2\) is not a standard distribution.

Conjugate vs. Semiconjugate Posterior Sampling Conjugate case \(\tau_0^2 \propto \sigma^2\):

Posterior for \(\sigma^2\) is inverse-gamma.
Posterior for \(\theta \mid \sigma^2, \text{data}\) is normal.
You can use Gibbs sampling directly: alternate between sampling \(\sigma^2\) and \(\theta\) from their standard conditional distributions.

Semiconjugate Case \(\tau_0^2\) fixed:
- Posterior for \(\theta \mid \sigma^2, \text{data}\) is still normal. = But the marginal posterior for \(\sigma^2\)is not a gamma or inverse-gamma—it’s more complicated. - This makes direct Gibbs sampling or analytic integration more challenging.

With semiconjugate priors you gain flexibility by being able to specify mean and variance seprately, sometimes this is also more realistic.

The computation is harder as the marginal for \(\sigma^2\) is not a standard form.

5.3.10.2 Discrete approximations

We want to approximate the joint posterior distribution of our parameters \(\theta\) and \(\tilde{\sigma}^2\) (where \(\tilde{\sigma}^2 = 1/\sigma^2\), the precision), given observed data \(y_1, \ldots, y_n\). The posterior is:

\[ p(\theta, \tilde{\sigma}^2 \mid y_1, \ldots, y_n) \]

For many models, this posterior does not have a simple, closed form from which we can sample directly or compute probabilities exactly. However, when you have a small number of parameters, the understanding of the posterior distribution is straightforward and practical. We will replace continuous and potential complicated posterior with a set of probabilities on a grid of parameters. Instead of formulas, tables of numbers that approximate the probabilities at different points are used.

From the joint distribution, we know from Bayes’ theorem: \[ p(\theta, \tilde{\sigma}^2 \mid y_1, \ldots, y_n) = \frac{p(\theta, \tilde{\sigma}^2, y_1, \ldots, y_n)}{p(y_1, \ldots, y_n)} \]

But the denominator doesn’t depend on the parameters. So, relative posterior probabilities for different parameter values can be computed directly from the joint distribution in the numerator.

This means, for any two sets of parameter values \((\theta_1, \tilde{\sigma}_1^2)\) and \((\theta_2, \tilde{\sigma}_2^2)\):

\[ \frac{p(\theta_1, \tilde{\sigma}_1^2 \mid y)}{p(\theta_2, \tilde{\sigma}_2^2 \mid y)} = \frac{p(\theta_1, \tilde{\sigma}_1^2, y)}{p(\theta_2, \tilde{\sigma}_2^2, y)} \]

We can then compute the joint distribution from:

The prior on \(\theta\): \(\text{dnorm}(\theta, \mu_0, \tau_0)\)
The prior on \(\tilde{\sigma}^2\):\(\text{dgamma}(\tilde{\sigma}^2, \nu_0/2, \nu_0 \sigma_0^2/2)\)
The likelihood: \(\prod_{i=1}^n \text{dnorm}(y_i, \theta, 1/\sqrt{\tilde{\sigma}^2})\)

So: \[ p(\theta, \tilde{\sigma}^2, y_1, ..., y_n) = \text{dnorm}(\theta, \mu_0, \tau_0) \times \text{dgamma}(\tilde{\sigma}^2, \nu_0/2, \nu_0 \sigma_0^2/2) \times \prod_{i=1}^n \text{dnorm}(y_i, \theta, 1/\sqrt{\tilde{\sigma}^2}) \]

Matricies: Discrete Grid Construction
Because the posterior is difficult to compute analytically, we approximate it discretely:

Choose a grid of possible values for \(\theta\): \(\{\theta_1, ..., \theta_G\}\)
Choose a grid of possible values for \(\tilde{\sigma}^2\): \(\{\tilde{\sigma}_1^2, ..., \tilde{\sigma}_H^2\}\)

For each combination \((\theta_k, \tilde{\sigma}_l^2)\), compute the joint probability \(p(\theta_k, \tilde{\sigma}_l^2, y_1, ..., y_n)\).

Next, normalize to get posterior probabilities. To get a valid joint posterior probability at each grid point, normalize these values so they sum to 1:
\[ p_D(\theta_k, \tilde{\sigma}_l^2 \mid y_1, ..., y_n) = \frac{p(\theta_k, \tilde{\sigma}_l^2, y_1, ..., y_n)}{\sum_{g=1}^G \sum_{h=1}^H p(\theta_g, \tilde{\sigma}_h^2, y_1, ..., y_n)} \] This is the discrete approximation to the true posterior.

Step 6: Marginalizing to Get Marginals To get the marginal posterior for \(\theta_k\) (across all \(\tilde{\sigma}^2\)): \[ p_D(\theta_k \mid y_1, ..., y_n) = \sum_{h=1}^H p_D(\theta_k, \tilde{\sigma}_h^2 \mid y_1, ..., y_n) \] And similarly for \(\tilde{\sigma}_l^2\).

There are limits for discrete approximations. For two parameters, a fine grid (e.g., 100 values for each) is manageable. For more parameters, the number of grid points grows exponentially (aka the “curse of dimensionality”): for \(p\) parameters, \(100^p\) grid points are needed. Thus, discrete grid approximation is practical only for models with very few parameters.

5.3.10.3 Sampling from the conditional distributions

Sampling from the join posterior distribution of \(\theta\) and \(\sigma^2\) given the data \(p(\theta, \sigma^2 | y_1, \ldots, y_n)\) is difficult because in the semiconjugate prior, the joint distribution is not in a standard form.

Based on the Gibbs sample, instead of sampling form the join distribution directly, you can sample alternately from the conditional distributions. This allows you to generate samples from the joint posterior. Overall, this also allows Bayesian inference to estimate parameters even though the direct analytical or discrete-grid methods are infeasible.

\(p(\theta | \sigma^2, y_1, \ldots, y_n)\)
\(p(\sigma^2 | \theta, y_1, \ldots, y_n)\)

Here’s an example of deriving the conditional for precision \(\tilde{\sigma}^2\). Suppose you know the value of \(\theta\). Then, the conditional distribution of \(\tilde{\sigma}^2\) (where \(\tilde{\sigma}^2 = 1/\sigma^2\)) given \(\theta\) and the data is:

\[ p(\tilde{\sigma}^2 | \theta, y_1, ..., y_n) \propto p(y_1, ..., y_n | \theta, \tilde{\sigma}^2) \, p(\tilde{\sigma}^2) \]

Plug in the normal likelihood and gamma prior for \(\tilde{\sigma}^2\):

\[ \propto (\tilde{\sigma}^2)^{n/2} \exp\left(-\tilde{\sigma}^2 \sum_{i=1}^n (y_i - \theta)^2 / 2 \right) \times (\tilde{\sigma}^2)^{\nu_0/2 - 1} \exp(-\tilde{\sigma}^2 \nu_0 \sigma_0^2 / 2) \]

Combine exponents and powers and you the kernel of a gamma distribution for \(\tilde{\sigma}^2\), or equivalently, an inverse-gamma for \(\sigma^2\).
(²⁾{(_0 + n)/2 - 1} (-^2 [_0 0^2 + {i=1}^n (y_i - )^2 ] / 2) ]

For the parameters for the update posterior are:
Let:
- \(\nu_n = \nu_0 + n\)
- \(\sigma_n^2(\theta) = \frac{1}{\nu_n} \left[ - \nu_0 \sigma_0^2 + n s_n^2(\theta) \right]\)
- \(s_n^2(\theta) = \frac{1}{n} \sum_{i=1}^n (y_i - \theta)^2\)

Therefore:

\[ \sigma^2 | \theta, y_1, ..., y_n \sim \text{Inv-Gamma}\left(\frac{\nu_n}{2}, \frac{\nu_n \sigma_n^2(\theta)}{2}\right) \]

The Gibbs sample procedure can now use the full conitional distributions to sample from the joint posterior:

Initialize with a value for \(\sigma^{2(1)}\) (or \(\theta^{(1)}\). You can use a random value or a value informed by the data (e.g., the sample variance or mean).
Repeat for many iterations:

Sample \(\theta^{(t)}\) from \(p(\theta | \sigma^{2(t-1)}, y_1, ..., y_n)\) (this is a normal distribution).
Sample \(\sigma^{2(t)}\) from \(p(\sigma^2 | \theta^{(t)}, y_1, ..., y_n)\) (this is an inverse-gamma distribution).

Each pair \((\theta^{(t)}, \sigma^{2(t)})\) after a burn-in period is (approximately) a sample from the joint posterior.

Even though you cannot sample from the joint posterior directly, you can sample from each parameter’s conditional distribution.
By alternately sampling from these conditionals (Gibbs sampling), you generate a sequence whose stationary distribution is the desired joint posterior.

5.3.10.4 Gibbs sampling

You have data \(y_1, ..., y_n\)
You have two unknowns: the mean \(\theta\) and variance \(\sigma^2\).
You want to sample from the joint posterior \(p(\theta, \sigma^2 \mid y_1, ..., y_n)\).
You know how to sample from the full conditional distributions:
- \(p(\theta \mid \sigma^2, y_1, ..., y_n)\)
- \(p(\sigma^2 \mid \theta, y_1, ..., y_n)\)

The Gibbs sampler generates samples from the joint posterior by alternating between the full conditionals:
Given current values \(\theta^{(s)}\) and \(\sigma^{2(s)}\):

Sample \(\theta\) given \(\sigma^2\): \[ \theta^{(s+1)} \sim p(\theta \mid \sigma^{2(s)}, y_1, ..., y_n) \]
Sample \(\sigma^2\) given new \(\theta\):
\[ \sigma^{2(s+1)} \sim p(\sigma^2 \mid \theta^{(s+1)}, y_1, ..., y_n) \]
Update your state: \((\theta^{(s+1)}, \sigma^{2(s+1)})\).

Repeat this process for many steps and the sequence of samples \(\{ (\theta^{(s)}, \sigma^{2(s)}) \}\) is called a Markov chain. The samples are dependent on the latest values.

\[\begin{align*} n s_{n}^{2}(\theta) &= \sum_{i=1}^{n} (y_{i} - \theta)^{2} \\ &= \sum_{i=1}^{n} (y_{i} - \bar{y} + \bar{y} - \theta)^{2} \\ &= \sum_{i=1}^{n} \left[ (y_{i} - \bar{y})^{2} + 2(y_{i} - \bar{y})(\bar{y} - \theta) + (\bar{y} - \theta)^{2} \right] \\ &= \sum_{i=1}^{n} (y_{i} - \bar{y})^{2} + 0 + \sum_{i=1}^{n} (\bar{y} - \theta)^{2} \\ &= (n-1)s^{2} + n(\bar{y} - \theta)^{2}. \end{align*}\]

5.3.10.5 General properties of the Gibbs sampler

Vector of parameters \[ \boldsymbol{\phi} = \{\phi_1, \phi_2, \ldots, \phi_p\} \] where you are interested in sampling from their joint posterior distribution (e.g. \(p(\theta, \sigma^2 \mid y_1, \ldots, y_n)\) in the normal model)

The Gibbs sample generates this algorithm:
Given the starting point \(\phi^{(0)} = \{\phi_1^{(0)}, \ldots, \phi_p^{(0)}\}\), for each iteration \(s = 1, 2, ..., S\) each parameter using the full conditional distribution:

Sample the first parameter \[ \phi_1^{(s)} \sim p\left(\phi_1 \mid \phi_2^{(s-1)}, \phi_3^{(s-1)}, \ldots, \phi_p^{(s-1)}\right) \] using the most recent values for all parameters
Sample the second parameter
\[ \phi_2^{(s)} \sim p\left(\phi_2 \mid \phi_1^{(s)}, \phi_3^{(s-1)}, \ldots, \phi_p^{(s-1)}\right) \] update only the parameter you just sampled, \(\phi_1\), use the most recent value for the rest.
Continue for all parameters
\[ \phi_p^{(s)} \sim p\left(\phi_p \mid \phi_1^{(s)}, \ldots, \phi_{p-1}^{(s)}\right) \]

New vector generated:
\(\phi^{(s)} = \{\phi_1^{(s)}, ..., \phi_p^{(s)}\}\)

The next state depending on the previous state is called a Markov property and the sequence is called a Markov chain
Under certain mild conditions, the distribution of \(\phi^{(s)}\) will converge to the target posterior \(p(\phi)\), regardless of where you started.

According to the Monte Carlo Approximation, as \(s \to \infty\) the distribution of the samples approaches the target distribution for any region A. \[ \text{Pr}(\phi^{(s)} \in A) \to \int_A p(\phi) d\phi \]

For a function of the parameter \(g(\phi)\), the sample mean converges on the posterior mean as \(S \to \infty\):

\[ \frac{1}{S} \sum_{s=1}^S g(\phi^{(s)}) \to \mathbb{E}[g(\phi)] = \int g(\phi) p(\phi) d\phi \]

Once there are a large collection of samples \(\{\phi^{(1)}, \ldots, \phi^{(S)}\}\) you can:
- Estimate posterior means, medians, quantiles, credible intervals, etc. - Approximate posterior probabilities, e.g. \(\Pr(\phi \leq c \mid y)\). \[ \frac{1}{S} \sum_{s=1}^S \phi^{(s)} \approx \mathbb{E}[\phi|y] \]

\[ \frac{1}{S} \sum_{s=1}^S 1(\phi^{(s)} \leq c) \approx \Pr(\phi \leq c | y) \]

Distinguishing model from approximation: - Model specification: Define your likelihood \(p(y|\phi)\) and prior \(p(\phi)\):
\[ p(\phi \mid \mathbf{y}) = \frac{p(\phi)\,p(\mathbf{y} \mid \phi)}{p(\mathbf{y})} = \frac{p(\phi)\,p(\mathbf{y} \mid \phi)}{\int p(\phi)\,p(\mathbf{y} \mid \phi)\, d\phi} \] After this, there is no more modeling or estimation, all that is left is the posterior summary which is a description of the posterior distribution \(p(\phi|y)\). This information may be posterior means, medians, modes, predictive probabilities and confidence regions.

Monte Carlo and MCMC sampling algorithms are not models. They do not generate more information and are simply approximation computational tools used to explore and and summarize a given posterior of \(p(y|\phi)\).

5.3.10.6 Introduction to MCMC diagnostics

The goal of Monte Carlo (MC) or Markov chain Monte Carlo (MCMC) approximation is to estimate expectations with respect to a target probability distribution \(p(\phi)\). Suppose you want to compute the expected value of some function \(g(\phi)\) under \(p(\phi)\): \[ \int g(\phi)p(\phi) d\phi \] However, this integral is often analytically intractable because \(p(\phi)\) may have a complicated form or the parameter space may be high-dimensional. Instead of direct calculation, you obtain a sequence of samples \(\{\phi^{(1)}, \ldots, \phi^{(S)}\}\) and approximate the expectation with the empirical average: \[ \frac{1}{S} \sum_{s=1}^{S} g(\phi^{(s)}) \approx \int g(\phi) p(\phi) d\phi \] This means: for any function \(g\) of interest, you can approximate its expected value under \(p(\phi)\) by averaging \(g\) over the samples.

What is Required for an Accurate Approximation?

For this method to be effective for many different functions \(g\), the empirical distribution of the samples \(\{\phi^{(1)}, \ldots, \phi^{(S)}\}\) must be a good approximation to the target distribution \(p(\phi)\). That is, the histogram (or kernel density estimate, or empirical CDF, etc.) of the simulated values should closely resemble the true shape of \(p(\phi)\).

Generating a Sequence: Monte Carlo vs. Markov Chain Monte Carlo Monte Carlo (MC) simulation and Markov chain Monte Carlo (MCMC) simulation are two different ways to generate a sequence of samples.

Monte Carlo Simulation

In “ordinary” MC, each sample \(\phi^{(s)}\) is drawn independently from the target distribution \(p(\phi)\). This means:

Each sample is uncorrelated with the others.
For any subset \(A\) of the parameter space, the probability that \(\phi^{(s)}\) falls in \(A\) is exactly \(\int_A p(\phi) d\phi\) for every \(s\).
The set of samples, collectively, automatically represents the distribution \(p(\phi)\) perfectly in a probabilistic sense.
This is considered the “gold standard” of sampling.

Markov Chain Monte Carlo (MCMC) Simulation

In MCMC, the samples \(\{\phi^{(1)}, \ldots, \phi^{(S)}\}\) are not independent; rather, they are generated by a Markov process, where each sample depends on the previous one.
- For MCMC, the limiting distribution of the chain is \(p(\phi)\), meaning that as \(s \rightarrow \infty\), the distribution of \(\phi^{(s)}\) approaches \(p(\phi)\). - This is expressed mathematically: \[ \lim_{s \to \infty} \operatorname{Pr}(\phi^{(s)} \in A) = \int_A p(\phi) d\phi \] - For any finite \(s\), the sample distribution may not yet match \(p(\phi)\) exactly, especially early in the chain.

Example: Mixture Model with Discrete and Continuous Variables Let’s consider the example provided, which helps illustrate the differences between MC and MCMC.

Target Distribution:
Two variables: \(\phi = (\delta, \theta)\)

\(\delta\) is a discrete variable: \(\delta \in \{1,2,3\}\)
\(\theta\) is a continuous variable: \(\theta \in \mathbb{R}\)

The joint distribution is specified as:
- \(\Pr(\delta = 1) = 0.45\)
- \(\Pr(\delta = 2) = 0.10\)
- \(\Pr(\delta = 3) = 0.45\)
- \(p(\theta|\delta) = \text{Normal}(\mu_\delta, \sigma_\delta)\) with \(\mu_1 = -3, \mu_2 = 0, \mu_3 = 3\) and \(\sigma_1^2 = \sigma_2^2 = \sigma_3^2 = 1/3\)

This is a mixture of three normal distributions with means at -3, 0, and +3. The marginal distribution for \(\theta\) is: \[ p(\theta) = \sum_{\delta=1}^3 p(\theta|\delta)p(\delta) \] which is a mixture with three “bumps” at the group means.

Monte Carlo Sampling (Independent Samples) To generate independent Monte Carlo samples from the joint distribution \(p(\delta, \theta)\):

Sample \(\delta\) from its marginal distribution.
Given \(\delta\), sample \(\theta\) from \(p(\theta|\delta)\).
The pair \((\delta, \theta)\) is a sample from \(p(\delta, \theta)\).

If you repeat this process 1,000 times, you get 1,000 independent samples. Plotting a histogram of the \(\theta\) values gives a good approximation to the true marginal \(p(\theta)\), as shown in Figure 6.4.

Gibbs Sampling (Dependent Samples, MCMC)

In contrast, to sample from the same distribution using a Gibbs sampler:

Alternately sample \(\theta\) and \(\delta\) from their full conditional distributions:
- Sample \(\theta\) from \(p(\theta | \delta)\) (already specified as normal).
- Sample \(\delta\) from \(p(\delta | \theta)\), which can be computed by Bayes’ Rule:

\[ \operatorname{Pr}(\delta = d | \theta) = \frac{\operatorname{Pr}(\delta = d) \times \operatorname{dnorm}(\theta, \mu_d, \sigma_d)}{\sum_{d=1}^3 \operatorname{Pr}(\delta = d) \times \operatorname{dnorm}(\theta, \mu_d, \sigma_d)} \]

The sequence of pairs \((\delta^{(s)}, \theta^{(s)})\) forms a Markov chain.

Problems with MCMC: Autocorrelation and “Stickiness”
When you run the Gibbs sampler, you might notice problems:
- The histogram of \(\theta\) values after 1,000 MCMC samples (Figure 6.5) does not match \(p(\theta)\) well. - Some regions (e.g., around -3) are underrepresented. - Other regions (e.g., around 0 or +3) are overrepresented.

The traceplot (plot of \(\theta\) vs iteration) shows that the chain gets “stuck” in certain regions, not moving freely between the three modes. - This is called autocorrelation: values of θ are highly correlated with their previous values in the sequence. - If \(\theta\) is near 0, it is likely to stay near 0 for many steps because the chain keeps sampling \(\delta = 2\) (which has mean 0), and so on. - The Markov chain does not mix well; it does not move rapidly between different modes.

Is MCMC Guaranteed to Work? Yes, in theory, the Gibbs sampler is guaranteed to eventually provide samples from the true distribution p(θ), but “eventually” can mean a very long time in practice, especially if the chain moves slowly (poor mixing, high autocorrelation). - Figure 6.6 shows that after 10,000 iterations, the empirical distribution is much closer to the true marginal, but may still not be perfect. - The traceplot with 10,000 samples shows better movement between regions, but “stickiness” can still be present.

Interpreting the MCMC Chain as a Particle Trajectory
- You can think of the sequence \(\{\phi^{(1)}, ..., \phi^{(S)}\}\) as a particle moving around the parameter space.
- In terms of MCMC integration, the amount of time the particle spends in any set A is proportional to the target probability \(\int_A p(\phi) d\phi\).

Suppose the parameter space has three regions of interest, \(A_1, A_2, A_3\):
- If \(\Pr(A_2) < \Pr(A_1) \approx \Pr(A_3)\), we want the chain to spend less time in A2 and roughly equal time in \(A_1\) and \(A_3\). - If the chain starts in \(A_2\), it might take a long time to “escape” to the higher probability regions. This is why burn-in and sufficient iterations are important.

Stationarity and Convergence
- Stationarity (or convergence): The chain has “forgotten” its starting value and samples from the stationary distribution. In essence, the chaing does not change as you move along the chain and matches the target distribution \(p(\phi)\).
- If the chain starts in a high probability region, stationarity is achieved quickly.
- If not, it may take a long time, and you may not be able to tell if your chain has converged unless you use diagnostics. - Assessing convergence is difficult because you do not know \(p(\phi)\) exactly. You can check for stationarity by comparing samples from different parts of the chain.
- For simple models (like the normal model with semiconjugate priors), stationarity is quickly achieved. For complex, highly parameterized models, autocorrelation can be high, and mixing can be poor, so stationarity takes longer and is more difficult to diagnose.
- Thus, it is often necessary to run the MCMC sampler for many iterations to ensure the samples you collect are representative of \(p(\phi)\)

Mixing and Speed of Movement in Parameter Space
- Mixing is a term used to describe how quickly the MCMC chain explores the parameter space. It is closely related to autocorrelation.
- Perfect mixing occurs when each new sample is completely independent of the previous sample (as in independent Monte Carlo sampling); there is zero autocorrelation.
- In MCMC, samples are correlated, and if the chain moves slowly between regions, autocorrelation is high and mixing is poor.
- Poor mixing means the chain may take a long time to move between areas of high probability, reducing the efficiency of the MCMC estimation.

Variance of the Sample Mean: Monte Carlo vs. MCMC Monte Carlo (Independent Sampling) Suppose you want to approximate the expectation:
\[ \mathrm{E}[\phi] = \int \phi p(\phi) d\phi = \phi_0 \] using the empirical mean: \[ \bar{\phi} = \frac{1}{S} \sum_{s=1}^S \phi^{(s)} \] where the \(\phi^{(s)}\) are independent samples from \(p(\phi)\). The variance of \(\bar{\phi}\), denoted\(\mathrm{Var}_{\mathrm{MC}}[\bar{\phi}]\), measures how much \(\bar{\phi}\) fluctuates around the true value \(\phi_0\) if you repeated the MC approximation many times: \[ \mathrm{Var}_{\mathrm{MC}}[\bar{\phi}] = \mathrm{E}[(\bar{\phi} - \phi_0)^2] = \frac{\mathrm{Var}[\phi]}{S} \] where \[ \mathrm{Var}[\phi] = \int \phi^2 p(\phi) d\phi - \phi_0^2 \]

The square root of \(\mathrm{Var}_{\mathrm{MC}}[\bar{\phi}]\) is the Monte Carlo standard error. This tells you how close \(\bar{\phi}\) is expected to be to \(\phi_0\). If you repeat the MC procedure many times, about 95% of the time the true value \(\phi_0\) will be within the interval \(\bar{\phi} \pm 2\sqrt{\mathrm{Var}_{\mathrm{MC}}[\bar{\phi}]}\). You can make this interval as narrow as you want by increasing S (the number of samples), because variance decreases as \(1/S\).

Markov Chain Monte Carlo (MCMC, Dependent Sampling) In MCMC, the samples \(\phi^{(s)}\) are not independent; they are correlated, especially for nearby \(s\).
- This correlation (autocorrelation) increases the variance of the empirical mean compared to independent MC.
- Assuming the chain has reached stationarity, the variance of the MCMC estimate is: \[ \mathrm{Var}_{\mathrm{MCMC}}[\bar{\phi}] = \mathrm{E}[(\bar{\phi} - \phi_0)^2] \] Let’s expand the algebra: \[ \bar{\phi} = \frac{1}{S} \sum_{s=1}^{S} \phi^{(s)} \] \[ \mathrm{Var}_{\mathrm{MCMC}}[\bar{\phi}] = \mathrm{E}\left[ \left( \bar{\phi} - \phi_0 \right)^2 \right] = \mathrm{E}\left[ \left( \frac{1}{S} \sum_{s=1}^{S} (\phi^{(s)} - \phi_0) \right)^2 \right] \] Expanding the square: \[ = \frac{1}{S^2} \mathrm{E} \left[ \sum_{s=1}^S (\phi^{(s)} - \phi_0)^2 + \sum_{s \neq t} (\phi^{(s)} - \phi_0)(\phi^{(t)} - \phi_0) \right] \] \[ = \frac{1}{S^2} \sum_{s=1}^S \mathrm{E}\left[ (\phi^{(s)} - \phi_0)^2 \right] + \frac{1}{S^2} \sum_{s \neq t} \mathrm{E}\left[ (\phi^{(s)} - \phi_0)(\phi^{(t)} - \phi_0) \right] \] The first term is the same as the MC variance: \[ \mathrm{Var}_{\mathrm{MC}}[\bar{\phi}] = \frac{1}{S} \mathrm{Var}[\phi] \] The second term is new: it involves the correlation between different samples in the chain. \[ \mathrm{Var}_{\mathrm{MCMC}}[\bar{\phi}] = \mathrm{Var}_{\mathrm{MC}}[\bar{\phi}] + \frac{1}{S^2} \sum_{s \neq t} \mathrm{E}[ (\phi^{(s)} - \phi_0)(\phi^{(t)} - \phi_0) ] \]

This second sum is zero if the samples are independent (as in MC), but is positive if there is autocorrelation.
The greater the autocorrelation (the more similar nearby samples are), the larger this term will be, and the less efficient the sampler becomes.

Effective Sample Size and Practical Implications - In practice, we often calculate the effective sample size: the number of independent samples that would give the same variance as our correlated MCMC samples. - For instance, in the normal model example, the autocorrelation for \(\{\theta^{(1)}, ..., \theta^{(1000)}\}\) is essentially zero, so the effective sample size is 1,000, matching the number of actual samples. - For \(\sigma^2\), the lag-1 autocorrelation is 0.147, corresponding to an effective sample size of 742. This means that 1,000 MCMC samples with this autocorrelation are about as informative as 742 independent samples.

Summary of the Variance Formulas MC (Independent): \[ \mathrm{Var}_{\mathrm{MC}}[\bar{\phi}] = \frac{\mathrm{Var}[\phi]}{S} \] MCMC (Correlated):
\[ \mathrm{Var}_{\mathrm{MCMC}}[\bar{\phi}] = \mathrm{Var}_{\mathrm{MC}}[\bar{\phi}] + \frac{1}{S^2} \sum_{s \neq t} \mathrm{E}[ (\phi^{(s)} - \phi_0)(\phi^{(t)} - \phi_0) ] \]
- The second term only appears if the samples are correlated (as in MCMC), and increases the variance of your estimate.

Visualizing the Chains
- Traceplots (Fig 6.7) show the sequence of sampled values for parameters like \(\theta\) and \(\sigma^2\) over iterations.
- If the traceplot looks like a “hairy caterpillar,” with the chain moving freely, mixing is good and autocorrelation is low. - If the traceplot shows long stretches with little movement, mixing is poor and autocorrelation is high.

References - German and German (1984) Gibbs sampling coined
- Besag (1974) and Ripley (1979) spatial statistics algorithm
- Gelfand and Smith (1990) general utility of the Gibbs sampler for Bayesian data analysis - Robert and Casella (2008) historical review of the Gibbs sampler - Gleman and Rubin (1992), Geweke (1992), Raftery and Lewis (1992), Geyer (1992) MCMC approximation and convergence diagnostics

Perpetually trying to remember what I forgot