Wednesday, August 23, 2017

Bayesian Data Analysis : Conjugate Prior with R

In the last blog, I talked about non informative priors. As you might guess, there is one
other type of prior which is informative prior. Informative priors are based on expert
opinion, historical data and so on. After selection of the prior type, we find our distribution with some number of draws and then we get our posterior distribution using the specific model. If our analysis is sequential, in other words if we add some new data in progress, we can use posterior distribution as a prior distribution in next step. We can add some values to our distribution and keep going like that. In this blog I want to use this property in the example.

Conjugate priors are key concept in prior&posterior distributions. A simple definition for conjugate prior is that the family of prior distribution is the same with posterior distribution’ family. For example, if we choose our prior as Beta distribution we will end with Beta distribution in our posterior distribution. This makes our lives easier. 

Here is a simple example. Let’s say we have a coin that we don't know whether it is fair or not. So we are not sure that the probability of getting head is equal to 1/2. We start with Uniform(0, 1) which means the probability of getting head is anywhere between 0 and 1. Uniform(0, 1) is equal to Beta(1, 1). I am going to use Beta since it is conjugate prior. After choosing our prior distribution we toss the coin 6 times and the result is 2 heads and 4 tails. We have two successful outcomes. In the first place, we have Beta(1, 1). If α =1 and β = 1, posterior for Beta will be Beta(α + Σy, β + n - Σy) where Σy is the number of success. Σy is equal to 2 and n is equal to 6. So, our new Beta will be Beta(1 + 2, 1 + 6 - 2) = Beta(3, 5). This is the posterior distribution. We started with Beta and ended with Beta. 

I want to perform a simple Bayesian Data Analysis using the property explained in the example. Let’s say we have a company called Jacob’s Eggs that sells organic eggs. After discussing on how to sell it with marketing department, our CEO comes up with a plan of handing out brochures to houses in specific areas. After handing out to 25 houses we sold 9 subscriptions for our 3-month organic egg subscription. That is 36% sales rate. But our CEO, whose name is Jacob John Blacksea, is not sure about how to interpret those results. Is it successful sales rate? Furthermore, a good friend of Jacob John who is very good sales specialist named Bahadirius told him that the sales rate is often between 20 - 25%. Jacob got confused and wants to make a simple analysis to interpret the results. Here is the analysis,



>n_draws <- 100000
>prior <- runif(n_draws, 0,1)
>hist(prior)
 

>simulated_data <- rep(NA, n_draws)
>model <- function(parameter){
  return(rbinom(1, size = 25, p = parameter)) } 
>#rbinom takes probability from the prior and gives simulated # of success for each iteration 
>for( i in 1:n_draws){
    simulated_data[i] <- model(prior[i])
}
> posterior <- prior[simulated_data == 9]

> hist(posterior)

> abline(v=mean(posterior), col = "red")


What happened above is that we drew 100000 data and generated uniform distribution as prior. Our model function will give us the number of successes in the given size, which is 25 using prior probabilities. For each prior probability we return the number of success to simulated_data vector.  Our posterior is number of successes that are equal to 9 in simulated_data vector. 


> mean(posterior)

[1] 0.3692231

> 9/25
[1] 0.36
> quantile(posterior, c(0.025,0.975))
     2.5%     97.5% 
0.2049356 0.5511299 

95% of the probabilities are distributed between 0.2 and 0.55.

> #95% of the probabilities falls between 0.2 and 0.55
> sum(posterior > 0.3) / length(posterior)
[1] 0.7666064
> sum(posterior > 0.4) / length(posterior)
[1] 0.3528043
> sum(posterior > 0.25) / length(posterior)
[1] 0.9092789

We had 36% probability to sell subscriptions. The probability that our sales rate will be higher that 0.25 is 0.909. We are quite certain that it will be more than 0.25. But in the case of 0.4, we only have 0.35 as our probability. So we can expect that it will be less than 0.4. For now 36% seems like a good portion for sales rate.


In the above analysis we used the prior before Bahadirius shares his information. To add the new information I will select Σy = 4, n = 17. Remember that Uniform(0,1) is equal to Beta(1,1).

>#Beta(1 + 4, 1 + 17 - 4)
>priorB <- rbeta(n_draws, 5, 14) 
>hist(priorB)
This is more informative than uniform distribution. 

> simulated_data2 <- rep(NA,n_draws)
> for(i in 1:n_draws){
+     simulated_data2[i] <- model(priorB[i])
+ }
> posteriorB <- priorB[simulated_data2 == 9]
> hist(posteriorB)
> mean(posteriorB)
[1] 0.3188646

> hist(posteriorB)
> abline(v=mean(posteriorB), col = "red", lwd = 2)
 Here our mean for the probability of getting 9 sales is near 32%. We had 36% in the first place. To interpret it we can do something like

> sum(posteriorB > 0.35) / length(posteriorB)
[1] 0.3180726

The probability that sales rate will be higher than 35% is 0.32. So we can say that our sales people have done a good job by selling 9 subscriptions according to that analysis!

The example in this blog inspired from Rasmus Bååth's Research Blog. In this blog, there is a great series of lectures on Bayesian Data Analysis. If you are looking for an introductory lecture on Bayesian Statistics, Coursera has some effective courses like this one.




Thursday, August 10, 2017

Bayesian : Non-informative priors & Jeffreys Prior & Fisher Information

Priors are what makes bayesian approach strong. But using priors with the knowledge that we are not sure about is not recommended. If we want to take advantage of bayesian approach, like having posterior probabilities, we can use non informative priors in a case that we do not have necessary data to use as informative prior.

Using non informative prior makes data have maximum influence on the posterior rather than prior. What makes the prior non informative? The answer is effective sample size. Effective sample size is the amount of data we have on our prior distribution. As it approaches to zero our prior will become non informative. 

Let's say we have a coin. We do not know whether it is a fair coin or not. But we have flipped that coin two times and get one head and one tail. With this information we can have an uniform distribution θ ~ U[0, 1]. Which is equal the Beta(1, 1). Effective sample size of that beta distribution is 2. This information is not telling us enough about being fair coin. But still has some information in it. So we need to get closer to zero. As getting closer, our limit will be equal to zero. Eventually, we are going to have Beta(0, 0). The density of Beta(0, 0) is proportional to θ^(-1)(1-θ)^(-1). If we integrate this we will have infinite integral. So, it is not a true density. This is called ‘improper prior’. It does not have a proper density. But this prior is usable. As long as we get some data later on, we can get proper posteriors from improper priors. Posterior density will be proportional to θ^(y-1)(1-θ)^(n-y-1). We will have Beta(y, n-y). 

For instance, after choosing improper prior as Beta(0,0), let’s say we observed one head and one tail. If getting head is success, our posterior will be Beta(1,1). And the posterior mean will be y/n, which is 1/2 in our example. This is exactly equal to maximum likelihood estimation. So, by using this improper prior, we get a posterior which gives us point estimates same as frequentist approach. But at the same time, we have a posterior. If we want to make interval statements, probability statements, we can actually find an interval and say that there is 95% probability that θ is in this interval.

Improper priors have a problem with parameterization. If we have an event that can get any value between 0 and 1 we can draw an uniform distribution with U(0,1). Let’s parameterize this event with Z. For now, Z falls any value between 0 and 1. But how about Z^200? How can we be sure that it will be in the same range or it will concentrate near 0. This is a parameterization problem of improper priors. Choosing a prior depends upon the particular parameterization in this case. Choice of parameterization affects the prior. We need to get a prior that doesn’t depend on the choice of parameterization.

Jeffreys prior has this property. Jeffreys prior is proportional to the square root of fisher information. Prior π(θ) is proportional to √I(θ), I(θ) is the fisher information. Fisher information measures the expected amount of information.


Fisher information formula;


Where X has more than one data points. If our I(θ) value is large it means that likelihood will change as we change the parameter. We can make such conclusion since it is the second derivative of log likelihood. It shows the amount of change. If the variance is small, I(θ) is small, we can conclude that the posterior comes from the data we generated using parameter θ. It doesn't depend on the parameter. Likelihood will be small everywhere except near parameter θ. It supports our idea of non informative prior which is posterior is determined by the data, not the prior. 

An example using normal model is following. We will use a single point from that normal model. Let’s compute the fisher information of it.


First, log of single point in normal model is equal to (X - μ)² ∕ 2σ². Next we will take the second order derivative then the expected value of it. The result will be 1/σ². If we want to extend it to the normal model it will be n/σ². As can be seen, it is a constant. And the square root of it will also be a constant. It doesn't vary as parameter varies. And since it is just a constant, it can be any value between minus and plus infinities. So, it actually is a non informative prior. These kind of priors are Jeffreys priors.



More detailed explanation about Jeffreys prior is here.


There is a great blog about Jeffreys prior here.

Gibbs Sampler

Gibbs Sampler In this blog we are going to discuss Gibbs Sampler with an applied example. But we will not dive into Gibbs Sampler direc...