**Introduction**

In this post, I am going to talk about Beta Distribution and some intuitive interpretations behind it.

**An Example**

Suppose we have two coins (A and B), and we are making a statistical experiment to identify whether these coins are biased or not. For coin A, we tossed 5 times and the results are: 1,0,0,0,0. (1 indicates Head and 0 indicates Tail). For coin B, we tossed 10 times and the results are: 1,1,0,0,0,0,0,0,0,0. The probability for theses two coins to be Tail are identical: 0.2. *Is it safe to say, both coins equally favour the Tail?*

The answer is *NOT*, from the perspective of *law of large numbers* (LLN),

the results obtained from a large number of trials should be close to the expected ealue, and will tend to become closer as more trials are performed.rt

That means, although the standard deviation of the two distributions are the same,the standard error of A will be larger than that of B, because of a smaller sample size.

(Please note that the standard error of a sample average measures the rough size of the difference between the population average and the sample average.)

Now that we know, the expected outcome of coin A and B are the same,

while the confidence for these two events are different. Let \( P_{A}\) denote the probability for A to be Tail, and \( P_{B}\) denote the probability for B to be Tail. We want to know, the uncertainty for \( P_{A}\) and \( P_{B}\) to take different values, ranging from 0 to 1. In other words, that is the probability (uncertainty) for \( P_{A}\) and \( P_{B}\) to be different probabilities. That is, *the probability (uncertainty) for probability*.

Below is the probability (uncertainty) that we wanted.

(Please find the code that generates the below image here)

In the above graph, the red line illustrates coin A, the green line illustrates coin B, and the Blue line coorespondes to another coin with 80 Heads and 20 Tails. From the red curve (coin A), we can find that, even though there is one Tail in the five tosses, the probability for it to be Tail has a peak around zero. That means, the most probable probability for coin A to be Tail is close to zero. From the green curve (coin B), we can see that, the peak of probability is close to 0.15. That means, the most probable probability for coin B to be Tail is close to 0.15. From the blue curve, the peak is close to 0.2. That means, the most probable probability for it to be Tail is close to 0.2. Also, we find that, although the expected probabilities for these coins to be Tail are the same: 0.2, the shapes of the probability distribution are different. And, the more data we get, the shape of probability distribution becomes more condensed in a small area. *That is Beta Distribution.*

**Definition**

The pdf of Beta Distribution is:

\( P(x)=\left\{\begin{matrix}

\frac{(1-x)^{\beta -1}x^{\alpha -1}}{B(\alpha ,\beta )} & ,x\in [0,1]\\

0 & ,otherwise

\end{matrix}\right.\)

where \( B(\alpha ,\beta )\) is a normalizing constant to make the outcome of the formula ranging from 0 to 1.

\( B(\alpha ,\beta ) \\= \int_{0}^{1}y^{\alpha -1}(1-y)^{\beta -1}dy\\=y^{\alpha -1}(\frac{-(1-y)^{\beta }}{\beta })\bigg\vert_{0}^{1}+\frac{\alpha -1}{\beta }\int_{0}^{1}y^{\alpha -2}(1-y)^{\beta}dy \\= 0+\frac{\alpha -1}{\beta }\int_{0}^{1}y^{\alpha -2}(1-y)^{\beta}dy \\=\frac{\alpha -1}{\beta }B(\alpha -1,\beta +1) =\frac{(\alpha -1)(\alpha -2)\cdots1}{\beta (\beta +1)\cdots(\beta +\alpha -2)}\int_{0}^{1}(1-y)^{\alpha +\beta -2}dy \\=\frac{(\alpha -1)(\alpha -2)\cdots1}{\beta (\beta +1)\cdots(\beta +\alpha -1)} \\=\frac{\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta )}\)

where \( \Gamma(x)\) is the Gamma Function.

\( \Gamma(x)=(x-1)!\)

Beta Distribution can express a wide range of different shapes for pdf, the above graph shows a variety of pdf from Beta Distribution.

**Mean**

The expected value of Beta Distribution is $\frac{\alpha }{\alpha +\beta }$,

which answers the intuitive question why coin A and coin B has the same expected value.

\( \eta = \int_{0}^{1}xP(x)=\int_{0}^{1}x\frac{(1-x)^{\beta -1}x^{\alpha -1}}{B(\alpha ,\beta )}=\frac{\alpha }{\alpha +\beta }\)

**Variance**

The variance of a Beta distribution is:

\( var(X)=E[(x-\eta )^2]=\frac{\alpha \beta }{(\alpha +\beta )^2(\alpha +\beta +1)}\)

This answers the question that, if expected values are the same, as the number of trial are becoming larger and larger, the dispersion of the Beta distribution is becoming smaller and smaller.

**Conjugate Prior**

One important application of Beta distribution is that,

it can be used as a conjugate prior for binomial distributions in Bayesian analysis.

In Bayesian probability theory, if the posterior distributions $P(h|D)$ are in the same family as the prior probability distribution $P(h)$, the prior and posterior are then called **conjugate distributions**, and the prior is called a **conjugate prior**. A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior: otherwise a difficult numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.

For a Binomial distribution, we got \( \alpha\)successes and \( \beta\) failures,

we use this information as a prior to model the further \( s\) successes and \( f\) failures.

The prior is a Beta distribution:

\( P(q)=\frac{(1-x)^{\beta -1}x^{\alpha -1}}{B(\alpha ,\beta )}\)

The likelihood is a Binomial distribution:

\( P(s,f \vert x)=\begin{pmatrix}s\\ s+f\end{pmatrix}x^s(1-x)^f\)

The posterior is another Beta distribution:

\( P(x|s,f)=\frac{P(s,f|x)P(x)}{\int P(s,f|x)P(x)dx}=\frac{x^{s+\alpha -1}(1-x)^{f+\beta -1}}{B(s+\alpha ,f+\beta )}

\\=Beta(s+\alpha ,f+\beta )\)

This posterior distribution could then be used as the prior for more samples, with the hyperparameters simply adding each extra piece of information as it comes.