Conjugate Priors

Posted by JongHyun on September 5, 2017

Bernouilli / Binomial distribution likelihood with uniform prior

As we can see in this post, the output or posterior will be beta distribution if we use uniform prior to bernouilli / binomial distribution. The generalization of bernouilli / binomial distribution with uniform prior is like this. f(˜y|θ)=θΣyi(1θ)nΣyif(θ)=I0θ1

f(θ|˜y)=f(˜y|theta)f(θ)f(˜y|theta)f(θ)dθ=θΣyi(1θ)nΣyiI0θ1θΣyi(1θ)nΣyiI0θ1dθ
=θΣyi(1θ)nΣyiI0θ1Γ(Σyi+1)Γ(nΣyi+1)Γ(n+2)10Γ(n+2)Γ(Σyi+1)Γ(nΣyi+1)θyi(1θ)nΣyidθ
=Γ(n+2)Γ(Σyi+1)Γ(nΣyi+1)θΣyi(1θ)nΣyiI0θ1

Through the previous equation, we can see posterior for theta given y follows a beta distribution. We can use this relation to make general sentense. Actually, the uniform distribution is a kind of beta distribution of parameter 1,1 and any beta distribution is conjugate for the bernouilli distribution. f(θ)=Γ(α+β)Γ(α)Γ(β)θα1(1θ)β1I0θ1

f(θ|˜y)f(˜y|θ)f(θ)=θΣyi(1θ)nΣyiΓ(α+β)Γ(α)Γ(β)θα1(1θ)β1I0θ1
θα+Σyi1(1θ)β+nΣyi1I0θ1
So, f(θ|˜y)beta(α+Σyi,β+nΣyi)
If the prior distribution has alpha and beta 1, it starts from uniform distribution. In this calculation, the order of event is not important. What we need to consider terms which includes parameter theta.

Now, we can say conjugae family. It is a family of distribution is referred to as conjugate if when we use member of that family as a prior, we get another member of that familiy as our posterior. In the previous case, the beta distribution is conjugate for the bernouilli distribution. Because this relationship makes calculation much simple, it is better to use conjugate distribution as a prior. One thing we need to know is that when we use beta distribution or other distribution as a prior, there are another parameter like alpha and beta. We call this parameter as hyper parameter.

Posterior mean and effective sample size

In the case of bernouilli likelihood with N observation using beta prior, now we can write the posterior predictive distribution is like this. Also, we can set effective sample size using hyper parameter. priorbeta(α,β)

effective sample size=α+β
posteriorbeta(α+Σyi,β+nΣyi
mean of beta=αα+β
posterior mean=α+Σyiα+β+n=α+βα+β+nαα+β+nα+β+nΣyin
=prior weightprior mean+data weightdata mean
In this calculation, effective sample size describes that how many data is required to be confident that prior does not affect the result of posterior. So, we can conclude the posterior is not affected from prior by comparing effective sample size and the number of event n.

If we observe event in different time, it is not problem in bayesian paradigm. The only thing need to do is updating posterior sequentially. Because we update it sequentially, it has identical result whether we perform at one time or seperately. In the case of frequentist paradigm, it is different case if we perform in different days.