Bangda Sun

Practice makes perfect

Statistical Inference Review

Statistical Inference Quick Review & Cheatsheet: Law of Large Number, Central Limit Theorem, Sampling Distributions, Estimation, Hypothesis Test.

Statistical inference is a procedure that produce a probabilistic statement about some or all parts of a statistical model.

1. Preliminary

  • Law of Large Number

Let \(X_{1}, X_{2}, \cdots, X_{n}\) to be i.i.d. with mean \(\mu\) and variance \(\sigma^{2}\), LLN states that as sample size increase, sample mean will converge to population mean with probability 1, i.e.

\[
P\left( \Bigg| \mu - \frac{1}{n}\sum^{n}_{i=1} X_{i} \Bigg| < \epsilon \right)\rightarrow 0, \text{ for any }\epsilon >0 \text{ as } n\rightarrow\infty
\]

  • Central Limit Theorem

Let \(X_{1}, X_{2}, \cdots, X_{n}\) to be i.i.d. with mean \(\mu\) and variance \(\sigma^{2}\), CLT states that as sample size increase, the distribution of normalized sample mean will approximate to standard normal distribution, i.e.

\[
\lim_{n\rightarrow\infty}P\left( \frac{\bar{X}_{n} - \mu}{\sigma/\sqrt{n}} \leq x \right) = \Phi(x)
\]

where \(\Phi(x)\) is the CDF of standard normal distribution.

  • Random Variables

A random variable is a function that map from sample space \(S\), i.e. possible event outcomes to real numbers, \(X: S\rightarrow R\).

  • Statistic of Observed Data

Let \(X_{1}, X_{2}, \cdots, X_{n}\) to be a random sample from population, \(r\) is an arbitrary real-valued function, then random variable \(T=r(X_{1},X_{2},\cdots,X_{n})\) is called a statistic.

  • Unbiased Statistic

Statistic with mean equals true value of parameter is an unbiased statistic (or estimator), i.e.

\[
E\left[ r(X_{1}, X_{2}, \cdots, X_{n}) \right] = \theta,
\]

here \(r(X_{1}, X_{2}, \cdots, X_{n})\) is used to estimate \(\theta\).

  • Consistent Statistic

Statistic converges to true value of parameter with probability 1 as sample size increase is a consitent statistic, i.e.

\[
\hat{\theta} = r(X_{1}, X_{2}, \cdots, X_{n}) \rightarrow \theta \text{ as }n\rightarrow \infty.
\]

  • Sufficient Statistic

Let \(X_{1}, X_{2}, \cdots, X_{n}\) to be a random sample from \(f(x;\theta)\), let \(Y=r(X_{1}, X_{2}, \cdots, X_{n})\) with density \(g(y;\theta)\), then \(Y\) is a sufficient statistic for \(\theta\) if and only if

\[
\frac{f(x_{1};\theta) f(x_{2};\theta) \cdots f(x_{n};\theta)}{g(r(x_{1}, x_{2}, \cdots, x_{n}); \theta)} = H(x_{1}, x_{2}, \cdots, x_{n}),
\]

Where \(H(x_{1}, x_{2}, \cdots, x_{n})\) doesn’t contain \(\theta\).

Intuitively, sufficient statistic could reduce the data but keep all information about the parameters for the data distribution.

2. Parameter Estimation

Estimating data distribution parameters by observed data.

2.1 Point Estimation

  • Maximum Likelihood Estimation

Estimate the parameters by maximizing likelihood function, likelihood function is function measure the “likelihood” of parameter given observed data,

\[
\hat{\theta} = \arg\max_{\theta} L(\theta; x).
\]

MLE has invariance principle.

  • Moments Estimation

Equate sample moments to theoretical moments, solve the unknown parameters to get the estimates,

\[
\frac{1}{n}\sum^{n}_{i=1}\left( X_{i} - \bar{X}_{n} \right)^{k} = E(X-\mu)^{k}.
\]

2.2 Interval Estimation

  • Sampling distributions

Distribution of statistics. Three important sampling distributions:

Chi-square distribution. \(Z\sim N(0,1)\) is a standard normal distribution, then \(Z^{2}\sim \chi^{2}(1)\), it is chi-square distribution with degree of freedom equals 1; Sum of independent chi-square distributions is still chi-square distribution.

t distribution. \(Z\sim N(0,1)\) is a standard normal distribution, \(V\sim\chi^{2}(n)\) is a chi-square distribution with df equals \(n\), given \(Z\) and \(V\) are independent, then

\[
T = \frac{Z}{\sqrt{V/n}}
\]

is a t distribution with df equals \(n\).

F distribution. Given two independent chi-square distributions, \(V\sim\chi^{2}(n_{1})\) and \(U\sim\chi^{2}(n_{2})\) , then

\[
F = \frac{V/n_{1}}{U/n_{2}}
\]

is a F distribution with df1 equals \(n_{1}\) and df2 equals \(n_{2}\).

Summary of sampling distribution:

Let \(X_{1}, X_{2}, \cdots, X_{n}\) to be a random sample from \(N(\mu, \sigma^{2})\), the sample mean and sample variance (unbiased) are:

\[
\bar{X}_{n} = \frac{1}{n}\sum^{n}_{i=1}X_{i}, s_{n}^{2} = \frac{1}{n-1}\sum^{n}_{i=1}\left(X_{i} - \bar{X}_{n} \right)^{2}.
\]

(1) Sample mean also has normal distribution.
\[
\bar{X}_{n}\sim N\left( \mu, \frac{\sigma^{2}}{n} \right).
\]

(2) \(\bar{X}\) and \(s_{n}^{2}\) are independent.

(3) \(s_{n}^{2}\) relates to chi-square distribution.
\[
\frac{(n-1)s_{n}^{2}}{\sigma^{2}}\sim \chi^{2}(n-1).
\]

(4) normalizing sample mean by sample std will get t distribution.
\[
\frac{\bar{X}_{n} - \mu}{s_{n}/\sqrt{n}} \sim t(n-1).
\]

  • Confidence Interval

CI is the interval estimates for parameters, it is a interval that might contain the true value of parameters, a \(100(1-\alpha)\%\) confidence interval \([f_{1}(x), f_{2}(x)]\) for \(\theta\) means

\[
P\left( f_{1}(x) < \theta < f_{2}(x) \right) = 1 - \alpha,
\]

intuitively, we keep getting samples and calculate this CI, \((100-\alpha)\%\) of these CI’s will contain the true value of parameter.

The general procedure to get the \((100-\alpha)\%\) confidence interval is:
(1) find a statistic \(r(x)\) (a function of data) such that its distribution doesn’t depend on \(\theta\), such as standard normal distribution, t distribution;
(2) look for \(a\) and \(b\) (quantiles of the distribution), such that \(P(a < r(x) < b) = 1-\alpha\) and derive the CI.

Those 4 items in sampling distribution summary and central limit theorem are the theoretical basis in deriving confidence intervals.

  • Frequently Used Confidence Interval

    • Binomial Proportion

    Suppose \(X_{1},\cdots,X_{n}\) are i.i.d. \(Bernoulli(p)\), and \(Y = \sum^{n}_{i=1}X_{i}\), the estimated proportion is \(\hat{p} = Y / n\), based on CLT:

    \[
    \frac{Y/n - p}{\sqrt{\frac{p(1-p)}{n}}}\rightarrow N(0,1),
    \]
    therefore the \((100-\alpha)\%\) CI
    \[
    \hat{p}\in\left[\frac{Y}{n} - Z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}, \frac{Y}{n} + Z_{\alpha/2}\sqrt{\frac{p(1-p)}{n}}\right].
    \]

    • Normal Distribution Mean (unknown population variance)

    Suppose \(X_{1},\cdots,X_{n}\) are i.i.d. \(N(\mu, \sigma)\), and
    \[
    \bar{X} = \frac{1}{n}\sum^{n}_{i=1}X_{i}, s^{2} = \frac{1}{n-1}\sum^{n}_{i=1}(X_{i} - \bar{X})^{2},
    \]
    Then apply the properties of sampling distribution,
    \[
    T = \frac{\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}}{\sqrt{\frac{(n-1)s^{2}}{\sigma^{2}}\frac{1}{(n-1)}}} = \frac{\bar{X} - \mu}{s / \sqrt{n}}\rightarrow t(n-1)
    \]
    therefore \((100-\alpha)\%\) CI
    \[
    \hat{\mu}\in\left[\bar{X} - \frac{s}{\sqrt{n}}t_{\alpha/2}(n-1), \bar{X} + \frac{s}{\sqrt{n}}t_{\alpha/2}(n-1)\right].
    \]

3. Hypothesis Test

Hypothesis test is a procedure to prove or disprove assertions about data distributions.

3.1 Basic Elements

  • Null Hypothesis \(H_{0}\) and Alternative Hypothesis \(H_{A}\)

(1) Null hypothesis is the default statement, which is tend to be believed.
(2) Alternative hypothesis is the by default not believed unless there is evidence in its flavor.

  • Type I Error and Type II Error

(1) Type I error: reject \(H_{0}\) when \(H_{0}\) is true, the probability of type I error is therefore \(P(\text{reject }H_{0} | H_{0}\text{ is true})\).
(2) Type II error: fail to reject \(H_{0}\) when \(H_{A}\) is true, the probability of type II error is therefore \(P(\text{fail to reject }H_{0} | H_{A}\text{ is true})\).

  • Rejection Region

A subset of the space of outcomes that leads to reject \(H_{0}\).

  • Power of Test

Power is the probability that reject \(H_{0}\) when \(H_{A}\) is true, i.e.

\[
\text{power} = P(\text{reject }H_{0} | H_{A}\text{ is true}) = 1 - \text{type II error},
\]

therefore when the power of statistics increase, type II error will reduce.

  • Size of Test

Note here it is not referring to the sample size needed for the test, it is the probability to reject \(H_{0}\) when \(H_{0}\) is true, i.e. type I error.

  • Significance level

Usually specified in the test, as the maximum allowable type I error.

  • p-value

Under \(H_{0}\) is true, the area of the smallest rejection region that contains the observed value - if this area is smaller than the significance level, it means the observed value must be in the rejection region.

  • Procedure of Hypothesis Test

(1) state the null hypothesis and alternative hypothesis;
(2) specify the significance level \(\alpha\);
(3) construct a test;
(4) perform the test;
(5) draw conclusions.

Constructing the test could be either
(1) constructing rejection region with specified \(\alpha\) and checking if observation is in it;
(2) finding smallest rejection region such that it contains observations, then compare the p-value with \(\alpha\).

  • Simple and Composite Test

(1) Simple test completely specifies the distribution, such as \(\mu=3\);
(2) Composite test doesn’t, such as \(\mu > 3\).

3.2 Constructing Hypothesis Test

  • Best Rejection Region of size \(\alpha\)

\(C\) is the best rejection region of size \(\alpha\) for testing a simple \(H_{0}: \theta = \theta’\) against a simple \(H_{A}: \theta = \theta’’\) if

\[
P(X\in C|H_{0}\text{ is true}) = \alpha,
\]

which means the type I error is fixed; and for every subset \(A\) such that

\[
P(X\in A|H_{0}\text{ is true}) = \alpha
\]

it has

\[
P(X\in C|H_{A}\text{ is true}) \geq P(X\in A|H_{A}\text{ is true}),
\]

which means

\[
P(X\in C^{c}|H_{A}\text{ is true}) < P(X\in A^{c}|H_{A}\text{ is true}),
\]

the left hand side is type II error, this means best rejection region minimizes type II error with fixed type I error.

  • Neyman-Pearson Lemma

Let \(X_{1}, X_{2}, \cdots, X_{n}\) to be i.i.d. with \(f(x;\theta)\), fot testing a simple \(H_{0}: \theta = \theta’\) against a simple \(H_{A}: \theta = \theta’’\), let \(C\) be the rejection region such that

\[
\begin{aligned}
\frac{L(\theta’)}{L(\theta’’)} &\leq k \text{ for all } x\in C \\
\frac{L(\theta’)}{L(\theta’’)} &\geq k \text{ for all } x\in C^{c} \\
\end{aligned}
\]

where \(L\) is the likelihood function. Then \(C\) is the best rejection region. So Intuitively, N-P Lemma is to find the best rejection region for simple vs simple test.

For other tests:
(1) Uniform Most Powerful (UMP) Test is to find the best rejection region for simple vs composite test;
(2) Likelihood Ratio Test is to find the best rejection region for composite vs composite test.