Thursday, August 10, 2017

Bayesian : Non-informative priors & Jeffreys Prior & Fisher Information

Priors are what makes bayesian approach strong. But using priors with the knowledge that we are not sure about is not recommended. If we want to take advantage of bayesian approach, like having posterior probabilities, we can use non informative priors in a case that we do not have necessary data to use as informative prior.

Using non informative prior makes data have maximum influence on the posterior rather than prior. What makes the prior non informative? The answer is effective sample size. Effective sample size is the amount of data we have on our prior distribution. As it approaches to zero our prior will become non informative. 

Let's say we have a coin. We do not know whether it is a fair coin or not. But we have flipped that coin two times and get one head and one tail. With this information we can have an uniform distribution θ ~ U[0, 1]. Which is equal the Beta(1, 1). Effective sample size of that beta distribution is 2. This information is not telling us enough about being fair coin. But still has some information in it. So we need to get closer to zero. As getting closer, our limit will be equal to zero. Eventually, we are going to have Beta(0, 0). The density of Beta(0, 0) is proportional to θ^(-1)(1-θ)^(-1). If we integrate this we will have infinite integral. So, it is not a true density. This is called ‘improper prior’. It does not have a proper density. But this prior is usable. As long as we get some data later on, we can get proper posteriors from improper priors. Posterior density will be proportional to θ^(y-1)(1-θ)^(n-y-1). We will have Beta(y, n-y). 

For instance, after choosing improper prior as Beta(0,0), let’s say we observed one head and one tail. If getting head is success, our posterior will be Beta(1,1). And the posterior mean will be y/n, which is 1/2 in our example. This is exactly equal to maximum likelihood estimation. So, by using this improper prior, we get a posterior which gives us point estimates same as frequentist approach. But at the same time, we have a posterior. If we want to make interval statements, probability statements, we can actually find an interval and say that there is 95% probability that θ is in this interval.

Improper priors have a problem with parameterization. If we have an event that can get any value between 0 and 1 we can draw an uniform distribution with U(0,1). Let’s parameterize this event with Z. For now, Z falls any value between 0 and 1. But how about Z^200? How can we be sure that it will be in the same range or it will concentrate near 0. This is a parameterization problem of improper priors. Choosing a prior depends upon the particular parameterization in this case. Choice of parameterization affects the prior. We need to get a prior that doesn’t depend on the choice of parameterization.

Jeffreys prior has this property. Jeffreys prior is proportional to the square root of fisher information. Prior π(θ) is proportional to √I(θ), I(θ) is the fisher information. Fisher information measures the expected amount of information.


Fisher information formula;


Where X has more than one data points. If our I(θ) value is large it means that likelihood will change as we change the parameter. We can make such conclusion since it is the second derivative of log likelihood. It shows the amount of change. If the variance is small, I(θ) is small, we can conclude that the posterior comes from the data we generated using parameter θ. It doesn't depend on the parameter. Likelihood will be small everywhere except near parameter θ. It supports our idea of non informative prior which is posterior is determined by the data, not the prior. 

An example using normal model is following. We will use a single point from that normal model. Let’s compute the fisher information of it.


First, log of single point in normal model is equal to (X - μ)² ∕ 2σ². Next we will take the second order derivative then the expected value of it. The result will be 1/σ². If we want to extend it to the normal model it will be n/σ². As can be seen, it is a constant. And the square root of it will also be a constant. It doesn't vary as parameter varies. And since it is just a constant, it can be any value between minus and plus infinities. So, it actually is a non informative prior. These kind of priors are Jeffreys priors.



More detailed explanation about Jeffreys prior is here.


There is a great blog about Jeffreys prior here.

No comments:

Post a Comment

Gibbs Sampler

Gibbs Sampler In this blog we are going to discuss Gibbs Sampler with an applied example. But we will not dive into Gibbs Sampler direc...