Markov Chain Monte Carlo (MCMC) Sampling

A beginners guide to iterative Monte-Carlo methods.

6 min readJan 18, 2021

When I started learning Bayesian Statistics, I had to invest lots of hours dealing with complexities of algorithms. It’s really annoying when you try to make things simpler and you end up making it more complex. If you are reading this article, then I am pretty sure that you must have experienced such exasperating moment. Do not worry , I am here to help you with your statistical journey.

To understand MCMC sampling method, you need to have prior knowledge of Bayes Theorem and you need to be familiar with the terms like Prior, Posterior, likelihoods, and evidence. I am writing this article assuming that you are already familiar with the fundamentals of Bayesian data analysis.

The main objective of Bayesian data analysis is to compute the posterior probability. Posterior probability of an event θ given X is true can be written as

determination of above posterior distribution comes down to evaluation of complex, often high-dimensional integration (i.e.,denominator of above expression). p(X) can be written as

Here,

p(X|θ) is the likelihood
p(θ|X) is the posterior distribution
p(θ) is the prior
p(X) is the evidence. Also known as normalising constant or marginal likelihood.

There are many methods to calculate the p(X).

In case of conjugate priors, we can easily get a closed form solution.
We can use numerical Integration.
we can approximate the function used to calculate posterior distribution and the resulting estimated priori is close to expected priori.
And we can use Monte Carlo methods, of which MCMC (Markov Chain Monte Carlo Method) is the most important one.

Majority of the Bayesian MCMC computing is accomplished using one of the basic two algorithms, the Metropolis-Hastings algorithm and Gibbs sampler algorithm.

Here we will only discuss about the Metropolis-Hastings(M-H) method and its usefulness in calculating the posterior distributions. If you want to explore the Gibbs sampling method, you can go through this paper.

Metropolis-Hasting algorithm has an upper hand over Gibbs sampler when the prior p(θ) and likelihood p(X|θ) are not a conjugate pair or either of these full conditionals are not available in closed form.

How does Metropolis- Hasting algorithm work?

M-H algorithm is a rejection algorithm that aims to accept or reject a posterior sample subjected to certain condition. Don’t get confused by the term rejection . we will see later what rejection algorithm means after we have understood the basic of M-H algorithm.

To apply M-H algorithm, we need to provide a transition kernel to the sampler. A transition kernel is simply a way of moving randomly in space from a given position x to a new position y. That is Q, is a distribution on y given x is true. we will write it as Q(y|x). In many applications, Q will be a continuous distribution, in that case Q(y|x) will be a density on y, and so ∫Q(y|x)dy=1(for all x). This kind of kernel which adds some random number to a current position x to get y is called as ‘‘random walk’’ kernel. In the context of M-H algorithm, Q is also called as proposal distribution or more specifically ‘’random walk’’ proposal.

our main interest in algorithm is for generation from the full conditionals(typically uni-variate). But, it can be easily described and theoretically supported for multivariate θ (here θ is our argument of interest that we are trying to generate from sampling).

Suppose for now, we wish to generate from joint posterior distribution given by p(θ|Y) ∝ h(θ) ≡ (f(y|θ)*π(θ), where f(y|θ) is the likelihood and π(θ) is the prior distribution.

Let’s begin by specifying a proposal distribution which is given by

M-H algorithm -

initialise θ with some value.

for (t=1,2,3…. T):

draw a sample θ* from the proposed distribution .

calculate the rejection ratio r given by
If r > 1, set θᵗ ⁼ θ*
if r < 1, set θᵗ = θ* with probability r otherwise set θᵗ = θ⁽ᵗ⁻¹⁾ with probability 1-r (this is where, we use uniform normal distribution).

for M-H algorithm, any kind of probability distribution can be used as proposal distribution while that is not the case with basic Metropolis algorithm. There we use symmetric proposal distribution such that

Typically a normal distribution θ ~ N(0,σ). Since, the normal distribution is symmetric around centre, most of the samples are densely located around the centre of the posterior distribution.

After some number of iterations k, the samples θᵏ⁺¹,θᵏ⁺² …… will be sampled from the posterior distribution p(θ|Y).

Note that we accept a sample θᵏ⁺¹ if it is density is greater than the target(unnormalised) distribution θᵏ. It means, θ will be more often within the region where the target distribution is denser. If this was the case, then we will get stuck at local mode of target distribution. Therefore we also accepts occasionally lower density if it is greater than the randomly chosen uniform density otherwise we reject it. It will get more clear once we see an example.

We will consider a binomial distribution as likelihood and Beta distribution as prior distribution .

Some facts about Binomial and Beta distribution : -

Beta distribution is conjugate prior of Binomial distribution.
if the prior θ ~ Beta(a,b) where a and b are shape parameters and both a and b are positive, likelihood y ~ Bin(n,θ), then its posterior distribution p(θ|Y) is proportional to Beta(a+y, b+n-y).

The efficiency comes out to be 23.85 % (greater the efficiency, lesser the correlation between samples). We can tune the efficiency by considering only every mᵗʰ sample which reduces the auto-correlation. Also, There is a term called burn-in sample, it avoids oversampling in low probability region. We can verify our generated posterior sample with respect to the theoretical posterior sample by plotting histogram.

Plot is shown below.

In the above plot, we can see that histogram of generated sample looks somewhat similar to the histogram of theoretical generated samples. Under mild conditions, a draw θᵗ always converges in distribution to a draw from the true posterior density p(θ|Y) as t →∞. Again, a full justification of why the algorithm works requires Markov Chain theory. If you are interested, you can go through this link.

Thanks for reading this. Happy learning!!!!!

References:

Bayesian Method for Data Analysis, Third Edition by Carlin B.P., Louis T.A.

chandan5362/Bayesian-Statistics

contains the notebook related with sampling and all. - chandan5362/Bayesian-Statistics

github.com