Introduction to Bayesian Inference

Bayesian inference offers a convenient framework to analyse missing data as it draws no distinction between missing values and parameters, both interprted as unobserved quantities who are associated with a joint posterior distribution conditional on the observed data. In this section, I review basic concepts of Bayesian inference based on fully observed data, with notation and structure mostly taken from Gelman et al. (2013).

Bayesian Inference for Complete Data

Bayesian inference is the process of fitting a probability model to a set of data Y and summarising the results by a probability distribution on the parameters θ of the model and on unobserved quantities Y~ (e.g. predictions). Indeed, Bayesian statistical conclusions about θ (or Y~) are made in terms of probability statements, conditional on the observed data Y, typically indicated with the notation p(θy) or p(y~y). Conditioning on the observed data is what makes Bayesian inference different from standard statistical approaches which are instead based on the retrospective evaluation of the procedures used to estimate θ (or y~) over the distribution of possible y values conditional on the “true” unknown value of θ.

Bayes’ Rule

In order to make probability statements about θ given y, we start with a model providing a joint probability distribution p(θ,y). Thus, the joint probability mass or density function can be written as a product of two densities that are often referred to as the prior distribution p(θ) and the sampling distribution p(yθ), respectively:

p(θ,y)=p(θ)p(yθ),

and conditioning on the observed values of y, using the basic property of conditional probability known as Bayes’ rule, yields the posterior distribution

p(θy)=p(θ,y)p(y)=p(θ)p(yθ)p(y),

where p(y)=θΘp(θ)p(yθ) is the sum (or integral in the case of continous θ) over all possible values of θ in the sample space Θ. We can approximate the above equation by omitting the factor p(y) which does not depend on θ and, given y, can be considered as fixed, yielding the unnormalised posterior density

p(θy)p(θ)p(yθ),

with the purpose of the analysis being to develop the model p(θ,y) and adequately summarise p(θy).

Univariate Normal Example (known variance)

Let y=(y1,,yn) denote an independent and identially distributed sample of n units, which are assumed to come from a Normal distribution with mean μ and variance σ2, whose sampling density function is

p(yμ)=1(2πσ2)nexp(12i=1n(yiμ)2σ2),

where for the moment we assume the variance σ2 to be known (i.e. constant). Consider now a prior probability distribution for the mean parameter p(μ), which belongs to the family of conjugate prior densities, for example a Normal distribution, and parameterised in terms of a prior mean μ0 and variance σ02. Thus, its prior density function is

p(μ)=12πσ02exp(12(μμ0)2σ2),

under the assumption tha the hyperparameters μ0 and σ02 are known. The conjugate prior density implies that the posterior distribution for μ (with σ2 assumed constant) belongs to the same family of distributions of the sampling function, that is Normal, but some algebra is required to reveal its specific form. In particular, the posterior density is

p(μy)=p(μ)p(yμ)p(y)12πσ021(2πσ2)nexp(12[(μμ0)2σ02+i=1n(yiμ)2σ2]).

Exapanding the components, collecting terms and completing the square in μ gives

p(μy)exp((μμ1)2τ12),

that is the posterior distribution of μ given y is Normal with posterior mean μ1 and variance τ12, where

μ1=1τ02μ0+nσ2y¯1τ02+nσ2and1τ12=1τ02+nσ2.

We can see that the posterior distribution depends on y only through the sample mean y¯=i=1nyi, which is a sufficient statistic in this model. When working with Normal distributions, the inverse of the variance plays a prominent role and is called the precision and, from the above expressions, it can be seen that for normal data and prior, the posterior precision 1τ12 equals the sum of the prior precision 1τ02 and the sampling precision nσ2. Thus, when n is large, the posterior precision is largely dominated by σ2 and the sample mean y¯ compared to the corresponding prior parameters. In the specific case where τ02=σ2, the prior has the same weight as one extra observation with the value of μ0 and, as n, we have that p(μy)N(μy¯,σ2n).

Univariate Normal Example (unknown variance)

For p(yμ,σ2)=N(yμ,σ2) with μ known and σ2 unknown, the sampling distribution for a vector y of n units is

p(yσ2)=1(2πσ2)nexp(12i=1n(yiμ)2σ2),

with the corresponding conjugate prior for σ2 being the Inverse-Gamma distribution Γ1(α,β) with density function

p(σ2)(σ2)(α+1)exp(βσ2),

indexed by the hyperparameters α and β. A convenient parameterisation is as a Scaled Inverse-Chi Squared distribution Inv-χ2(σ02,ν0) with scale and degrees of freedom parameters σ02 and ν0, respectively. This means that the prior on σ2 corresponds to the distribution of σ02ν0X, where Xχν02 random variable. After some calculations, the resulting posterior for σ2 is

p(σ2y)(σ2)(n+ν02+1)exp(ν0σ02+nν2σ2)

where ν=1ni=1n(yiμ)2. This corresponds to say that

σ2yInv-χ2(ν0+n,ν0σ02+nνν0+n),

with scale equal to the degrees of freedom-weighted average of the prior and data scales and degrees of freedom equal to the sum of the prior and data degrees of freedom.

Univariate Normal Example (unknown mean and variance)

Suppose now that both the mean and variance parameters are unknown such that

p(yμ,σ2)N(μ,σ2),

and that the interest is centred on making inference about μ, that is we seek the conditional posterior distribution of the parameters of interest given the observed data p(μy). This can be derived from the joint posterior distribution density p(μ,σ2y) by averaging over all possible values of σ2, that is

p(μy)=p(μ,σ2y)dσ2,

or, alternatively, the joint posterior can be factored as the product of the marginal distribution of one parameter and the conditional distribution of the other given the former and then taking the average over the values of the “nuisance” parameter

p(μy)=p(μσ2,y)p(σ2y)dσ2.

The integral forms are rarely computed in practice but this expression helps us to understand that posterior distributions can be expressed in terms of the product of marginal and conditional densities, first drawing σ2 from its marginal and then μ from its conditional given the drawn value of σ2, so that the integration is indirectly performed. For example, consider the Normal model with both unknown mean and variance and assume a vague prior density p(μ,σ2)(σ2)1 (corresponding to uniform prior on (μ,logσ)), then the joint posterior distribution is proportional to the sampling distribution multiplied by the factor (σ2)1, that is

p(μ,σ2y)σn2exp(12σ2[(n1)s2+n(y¯μ)2]),

where s2=1n1i=1n(yiy¯)2 is the sample variance. Next, the conditional posterior density p(μσ2) can be shown to be equal to

p(μσ2,y)N(y¯,σ2n),

while the marginal posterior p(σ2y) can be obtained by averaging the joint p(μ,σ2y) over μ, that is

p(σ2y)(σn2exp(12σ2[(n1)s2+n(y¯μ)2]))dμ,

which leads to

p(σ2,y)Inv-χ2(n1,s2).

Typically, μ represents the estimand of interest and the obejective of the analysis is therefore to make inference about the marginal distribution p(μy), which can be obtained by integrating σ2 out of the joint posterior

p(μy)=0p(μ,σ2y)dσ2[1+n(μy¯)(n1)s2]

which corresponds to a Student-t density with n1 degrees of freedom

p(μy)tn1(y¯,s2n)

Multivariate Normal Example

Similar considerations to those applied to the univariate case can be extended to the multivariate case when y is formed by J components coming from the Multivariate Normal distribution

p(yμ,Σ)N(μ,Σ),

where μ is a vector of length J and Σ is a J×J covariance matrix, which is symmetric and positive definite. The sampling distribution for a sample of n units is

p(yμ,Σ)∝∣Σn/2exp(12i=1n(yiμ)TΣ1(yiμ)),

As with the univariate normal model, we can derive the posterior distribution for μ and Σ according to the factorisation used of the joint posterior and the prior distributions specified. For example, using the conjugate normal prior for the mean p(μ)N(μ0,Σ0), given Σ known, the posterior can be shown to be

p(μy)N(μ1,Σ1),

where the posterior mean is a weighted average of the data and prior mean with weights given by the data and prior precision matrices μ1=(Σ01+nΣ1)1(Σ01μ0+nΣ1y¯), and the posterior precision is the sum of the data and prior precisions Σ11=Σ01+nΣ1.

In the situation in which both μ and Σ are unknown, convenient conjugate prior distributions which generalise those used in the univariate case are the Inverse-Wishart for the covariance matrix ΣInv-Wishart(Λ0,ν0) and the Multivariate Normal for the mean μN(μ0,Σ0), where ν0 and Λ0 represent the degrees of freedom and the scale matrix for the Inverse-Wishart distribution, while μ0 and Σ0=Σκ0 are the prior mean and covariance matrix for the Multivariate Normal. Woking out the form of the posterior, it can be shown that the joint posterior distribution has the same form of the sampling distribution with parameters

p(μΣ,y)N(μ1,Σ1)andp(Σy)Inv-Wishart(Λ1,ν1),

where Σ1=Σκ1, μ1=1κ0+nμ0+nκ0+ny¯, κ1=κ0+n, ν1=ν0+n, and Λ1=Λ0+i=1n(yiy¯)(yiy¯)T+κ0nκ0+n(y¯μ0)(y¯μ0)2.

Regression Models

Suppose the data consist in n units measured on an outcome variable y and a set of J covariates X=(x1,,xJ) and assume that the distribution of y given x is Normal with mean μi=β0+j=1Jβjxij and variance σ2

p(yβ,σ2,X)N(Xβ,σ2I),

where β=(β0,,βJ) is the set of regression coefficients and I is the n×n identity matrix. Within the normal regression model, a convenient vague prior distribution is uniform on (β,logσ)

p(β,σ2)σ2.

As with normal distributions with unknown mean and variance we can first determine the marginal posterior of σ2 and factor the joint posterior as p(β,σ2)=p(βσ2,y)p(σ2y) (omit X for simplicity). Then, the conditional distribtuion p(βσ2,y) is Normal

p(βσ2,y)N(β^,Vβσ2),

where β^=(XTX)1(XTy) and Vβ=(XTX)1. The marginal posterior p(σ2y) has a scaled Inverse-χ2 form

p(σ2y)Inv-χ2(nJ,s2),

where s2=1nJ(yXβ^)T(yXβ^). Finally, the marginal posterior p(βy), averaging over σ2, is multivariate t with nJ degrees of freedom, even though in practice since we can characterise the joint posterior by drawing from p(σ2) and then from p(βσ2). When the anaysis is based on improper priors (do not have finite integral), it is important to check tha the posterior is proper. In the case of the regression model, the posterior for βσ2 is proper only if the number of observations is larger than the number of parameters n>J, and that the rank of X equals J (i.e. the columns of X are linearly independent) in order for all J coefficients to be uniquely identified from the data.

Generalised Linear Models

The purpose of Generalised Linear Models(GLM) is to extend the idea of linear modelling to cases for which the linear relationship between X and E[yX] or the Normal distribution is not appropriate. GLMs are specified in three stages

  1. Choose the linear predictor η=Xβ

  2. Choose the link fuction g() that relates the linear predictor to the mean of the outcome variable μ=g1(η)

  3. Choose the random component specifying the distribution of y with mean E[yX]

Thus, the mean of the distribution of y given X is determined as E[yX]=g1(Xβ). The Normal linear model can be thought as a special case of GLMs where the link function is the identity g(μ)=μ and the random component is normally distributed. Perhaps, the most commonly used GLMs are those based on Poisson and Binomial distributions to analyse count and binary data, respectively.

Poisson

Counted data are often modelled using Poisson regression models which assume that y is distributed according to a Poisson distribution with mean μ. The link function is typically chosen to be the logarithm so that logμ=Xβ and the distribution of the data has density

p(yβ)=i=1n1yiexp(e(ηi)(exp(ηi))yi),

where ηi=(Xβ)i is the linear predictor for the ith unit.

Binomial

Suppose there are some binomial data yiBin(ni,μi), with ni known. It is common to specify the model in terms of the mean of the proportions yini rather than the mean of yi. Choosing the logit tranformation of the probability of success g(μi)=log(μi1μi) as the link function leads to the logistic regression where data have distribution

p(yβ)=i=1n(niyi)(eηi1+eηi)yi(11+eηi)niyi.

The link functions used in the previous models are known as the canonical link functions for each family of distributions, which is the function of the mean parameter that appears in the exponent of the exponential family form of the probability density. However, it is also possible to use link functions which are not canonical.

References

Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. Chapman; Hall/CRC.

Edit this page