Probability Review (2)

Probability Distributions Review & Cheatsheet (2): Discrete Distributions, Continuous Distributions, Exponential Family Distributions, Conjugate Priors.

1. Discrete Distributions

1.1 Bernoulli Distribution

Suppose \(X\) has Bernoulli distribution, \(X\sim Bernoulli(p)\), it is about a trial performed with probability of \(p\) to be “success”, and \(X\) takes 1 if it is success and 0 if it is failure.

\[
P(X=k) = p^{k}(1-p)^{1-k}, k\in\{0, 1\}.
\]

The expectation and variance of \(Bernoulli(p)\) is \(p\) and \(p(1-p)\).

1.2 Binomial Distribution

Suppose \(X\) has Binomial distribution, \(X\sim Binomial(n, p)\), it models \(n\) independent trials with probability of \(p\) to be “success”. Therefore it could be regarded as the sum of \(n\) i.i.d. \(Bernoulli(p)\) random variables.

\[
P(X=k) = {n \choose k}p^{k}(1-p)^{n-k}, k = 0,1,2,\cdots, n.
\]

The sum of \(n\) independent binomial distributions with parameter \(n_{i}\) and \(p\) is an another binomial distribution with parameter \(\sum^{n}_{i=1}n_{i}\) and \(p\).

The expectation and variance of \(Binomial(n, p)\) is \(np\) and \(np(1-p)\).

1.3 Poisson Distribution

Suppose \(X\) has Poisson distribution, \(X\sim Poisson(\lambda)\). It models the number of times that rare event occurs with an average rate \(\lambda\) (per unit time).

\[
P(X=k) = \frac{\lambda^{k}}{k!}e^{-\lambda}, k=0,1,2,\cdots,.
\]

Poisson distribution is an approximation to Binomial distribution with large \(n\) and very small \(p\). The sum of \(n\) independent Poisson distribution with parameter \(\lambda_{i}\) is another Poisson distribution with parameter \(\sum^{n}_{i=1}\lambda_{i}\).

The expectation and variance of \(Poisson(\lambda)\) are both \(\lambda\).

1.4 Geometric Distribution

Suppose \(X\) has Geometric distribution, \(X\sim Geometric(p)\), it models the number of trials related to first success. There are two scenarios: (1) \(X\) is the number of trials before first success, i.e. total number of failures before first success; (2) \(X\) is the total number of trials until first success. The distribution of second scenario is actually a “shifted” version of first scenario. The PMF of first scenario is
\[
P(X=k) = (1-p)^{k}p, k=0,1,2,\cdots.
\]

The expectation and variance of Geometric distribution with parameter \(p\) is

\[
E(X) = \frac{1-p}{p},~~ Var(X) = \frac{1-p}{p^{2}}.
\]

If the Geometric distribution is for the total number of trials (second scenario), since it is a “shifted” of first scenario, therefore the expectation will increase by one unit, i.e.
\[
E(X) = \frac{1}{p},
\]

and variance remains same.

1.5 Negative Binomial Distribution

Suppose \(X\) has Negative Binomial distribution, \(X\sim NegBin(r, p)\). It models number of failures before \(r\)-th success.

\[
P(X=k) = {r+k-1\choose k}(1-p)^{k}p^{r}, k=0,1,2,\cdots.
\]

As i.i.d. Bernoulli distribution sum up to Binomial distribution, here i.i.d. Geometric distribution sum up to Geometric distribution.

The expectation and variance of \(NegBin(r, p)\) is

\[
E(X) = \frac{r(1-p)}{p},~~Var(X) = \frac{r(1-p)}{p^{2}}.
\]

2. Continuous Distributions

2.1 Uniform Distribution

Suppose \(X\) has Uniform distribution, \(X\sim Unif(\alpha, \beta)\), with PDF

\[
f(x) = \left\{
\begin{aligned}
&\frac{1}{\beta-\alpha}, & \text{for }\alpha<x<\beta
&0, & \text{otherwise}
\end{aligned}\right..
\]

The expectation and variance of \(Uniform(\alpha, \beta)\) is

\[
E(X)=\frac{1}{\beta-\alpha},~~ Var(X) = \frac{(\beta - \alpha)^{2}}{12}.
\]

2.2 Exponential Distribution

Suppose \(X\) has Exponential distribution, \(X\sim Exp(\lambda)\), with PDF

\[
f(x) = \left\{
\begin{aligned}
&\lambda e^{-\lambda x}, & \text{for }x > 0 \\
&0, & \text{otherwise}
\end{aligned}\right..
\]

The shape of PDF is strict decreasing with decay rate \(\lambda\). The CDF is given by

\[
F(x) = \int^{x}_{0} \lambda e^{\lambda t} dt = 1 - e^{-\lambda x}.
\]

Exponential distribuion could be used to model lifetimes and time between events. The expectation and variance is

\[
E(X) = \frac{1}{\lambda}, ~~ Var(X) = \frac{1}{\lambda^{2}}.
\]

Exponential distribution has memoryless property, i.e. \(P(X > s + t | X > s) = P(X > t)\).

2.3 Gamma Distribution

Suppose \(X\) has Gamma distribution, \(X\sim Gamma(\alpha, \lambda)\), with PDF

\[
f(x) = \left\{
\begin{aligned}
&\frac{\lambda^{\alpha}}{\Gamma(\alpha)}e^{-\lambda x}x^{\alpha-1}, & \text{for }x > 0 \\
&0, & \text{otherwise}
\end{aligned}\right..
\]

It is easy to see that when \(\alpha=1\) it is an Exponential distribution \(Exp(\lambda)\). Here Gamma function is defined as

\[
\Gamma(\alpha) = \int^{\infty}_{0}x^{\alpha-1}e^{-x} dx,
\]

where \(\Gamma(\alpha) = (\alpha - 1)!\), \(\Gamma(\alpha) = (\alpha-1)\Gamma(\alpha-1)\).

Also, \(\Gamma(n/2, 1/2)\) is actually \(\chi^{2}(n)\) distribution.

2.4 Beta Distribution

Suppose \(X\) has Beta distribution, \(X\sim Beta(\alpha, \beta)\), with PDF

\[
f(x) = \left\{
\begin{aligned}
&\frac{1}{B(\alpha, \beta)}x^{\alpha-1}(1-x)^{\beta -1}, & \text{for }x \in [0, 1] \\
&0, & \text{otherwise}
\end{aligned}\right..
\]

where

\[
B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}.
\]

2.5 Normal Distribution

Suppose \(X\) has Normal distribution, \(X\sim N(\mu, \sigma^{2})\), with PDF

\[
f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x - \mu)^{2}}{2\sigma^{2}}}, x\in R.
\]

2.6 Exponential Family Distributions

This is a family of distributions, distributions like bernoulli distribution, poisson distribution, exponential distribution, gamma distribution, beta distribution, normal distribution all belong to exponential family.

\[
p(x|\theta) = \frac{h(x)}{Z(\theta)}e^{S(x)^{\top}\theta}.
\]

Where \(S\) is the sufficient statistic. The data \(x\) and parameter \(\theta\) interact through the linear term in the exponent. The MLE of \(\theta\) satisfies

\[
\frac{\partial}{\partial\theta}\log Z(\hat{\theta}) = \frac{1}{n}\sum^{n}_{i=1}S(x_{i}) = E_{p(x|\hat{\theta})}\left[S(x)\right].
\]

3. Conjugate Prior (Bayesian Statistics)

List of commonly used conjugate prior.

3.1 Model Binomial Data

If \(\theta\sim Beta(\alpha, \beta)\) and \(y|\theta\sim Binomial(n, \theta)\), then \(\theta|y\sim Beta(\alpha + y, \beta + n - y)\).

Beta prior is conjugate for Binomial likelihood, means posterior has same parameteric form as prior. Beta prior has interpretation as “prior data” of \(\alpha\) success and \(\alpha+\beta\) tries.

The mean of the posterior is a weighted average of prior mean and likelihood mean.

\[
E(\theta|y) = \frac{\alpha+\beta}{\alpha+\beta+n}\frac{\alpha}{\alpha+\beta} + \frac{n}{\alpha+\beta+n}\frac{y}{n}
\]

3.2 Model Event Count Data

If \(\theta\sim Gamma(\alpha, \beta)\) and \(y_{1},\cdots,y_{n}|\theta\sim Poisson(\theta)\), then
\[
\theta|y_{1},\cdots,y_{n}\sim Gamma(\alpha+n\bar{y}, \beta + n).
\]

The mean of the posterior is a weighted average of prior mean and likelihood mean.

\[
E(\theta|y) = \frac{\beta}{\beta+n}\frac{\alpha}{\beta} + \frac{n}{\beta+n}\bar{y}.
\]