Statistics Formula Reference

📐1. Descriptive Statistics

Measures of Central Tendency

Measure	Formula	When?
Mean	$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$	Symmetric distributions
Median	Middle of sorted data	Skewed data / outliers
Mode	Most frequent value	Categorical data

Measures of Spread

Sample Variance$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

Standard Deviation$$s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

💡 Bessel's correction (n-1): Since the sample mean is estimated from the data, one degree of freedom is lost.

Shape Measures

Measure	=0	>0	<0
Skewness	Symmetric	Right-skewed	Left-skewed
Kurtosis	Normal (meso)	Heavy-tailed (lepto)	Light-tailed (platy)

🔔2. Probability Distributions

Normal Distribution (Gaussian)

Probability Density Function$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$

Interval	Coverage
$\mu \pm 1\sigma$	68%
$\mu \pm 2\sigma$	95%
$\mu \pm 3\sigma$	99.7%

💡 Central Limit Theorem: When $n \geq 30$, sample means follow $\bar{X} \sim N\!\left(\mu, \frac{\sigma^2}{n}\right)$ regardless of the population distribution.

Binomial Distribution

$n$ independent trials, each with success probability $p$

PMF$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

$E[X] = np$ $Var(X) = np(1-p)$

Poisson Distribution

Counting rare events per unit time/area ($\lambda$ = average rate)

PMF$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

$E[X] = \lambda$ $Var(X) = \lambda$

📊3. Z-Score

Measures how many standard deviations a value is from the mean. Enables comparison across different scales.

Single Value$$z = \frac{x - \mu}{\sigma}$$

Sample Mean$$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$

Critical Z Values

$z$	One-tailed $P$	Two-tailed $P$
$1.645$	$0.050$	$0.100$
$1.960$	$0.025$	$0.050$
$2.576$	$0.005$	$0.010$

Z to probability: $P(Z < z)=\Phi(z)$ | Probability to Z: $z = \Phi^{-1}(p)$

🎯4. Confidence Intervals

σ known or n ≥ 30$$\text{CI} = \bar{x} \pm z^* \cdot \frac{\sigma}{\sqrt{n}}$$

σ unknown and n < 30$$\text{CI} = \bar{x} \pm t^* \cdot \frac{s}{\sqrt{n}}$$

Confidence Level	$z^*$
90%	1.645
95%	1.960
99%	2.576

Proportion CI$$\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

💡 Interpretation: "95% CI" means if we repeat this procedure infinitely, 95% of the constructed intervals would contain the true parameter.

⚖️5. Hypothesis Testing

Steps

$H_0$ (Null): No effect / no difference
$H_1$ (Alternative): Effect exists / difference exists
Set significance level: $\alpha = 0.05$
Compute test statistic ($z$, $t$, $\chi^2$, $F$…)
Find p-value
Decision: $p < \alpha \Rightarrow$ Reject $H_0$ | $p \geq \alpha \Rightarrow$ Fail to reject

Error Types

	$H_0$ True	$H_0$ False
Reject $H_0$	❌ Type I ($\alpha$)	✅ Correct (Power $= 1-\beta$)
Fail to Reject	✅ Correct	❌ Type II ($\beta$)

Test Directions

Direction	$H_1$	When?
Two-tailed	$\mu \neq \mu_0$	Direction doesn't matter
Right-tailed	$\mu > \mu_0$	Expecting increase
Left-tailed	$\mu < \mu_0$	Expecting decrease

📈6. Normality Tests

Shapiro-Wilk Test

Most reliable normality test ($n < 5000$)

Test Statistic$$W = \frac{\left(\sum_{i=1}^{n} a_i x_{(i)}\right)^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

$H_0$: Data comes from a normal distribution
$H_1$: Data is not normally distributed
$p > 0.05 \Rightarrow$ Assume normality ✅

D'Agostino K² Test

Tests skewness and kurtosis jointly. More suitable for larger samples.

QQ-Plot (Visual)

Compares data quantiles against theoretical normal quantiles. Points on the line → normal.

💡 In practice: When $n > 30$, parametric tests are generally robust to mild normality violations (CLT).

⚖️7. Homogeneity of Variance — Levene's Test

Tests whether groups have equal variances. Prerequisite for t-test and ANOVA.

Levene Statistic$$W = \frac{(N-k)}{(k-1)} \cdot \frac{\sum_{i=1}^{k} n_i (\bar{Z}_{i\cdot} - \bar{Z}_{\cdot\cdot})^2}{\sum_{i=1}^{k}\sum_{j=1}^{n_i}(Z_{ij} - \bar{Z}_{i\cdot})^2}$$

$Z_{ij} = |x_{ij} - \tilde{x}_i|$ (absolute deviation from median)

$H_0$: $\sigma_1^2 = \sigma_2^2 = \cdots = \sigma_k^2$
$p > 0.05 \Rightarrow$ Variances are homogeneous ✅

Test	Advantage	Disadvantage
Levene	Doesn't assume normality	Slightly less powerful
Bartlett	More powerful under normality	Sensitive to violations

⚠️ If not homogeneous: Use Welch's t-test (equal_var=False) or Kruskal-Wallis instead of ANOVA.

🔬8. T-Test

8.1 One-Sample T-Test

Compare a group's mean against a known value.

Formula$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \qquad df = n - 1$$

8.2 Independent Two-Sample T-Test

Prerequisites: ① Normality ② Homogeneity of variance ③ Independence

Equal Variance$$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \qquad s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}$$

Welch's T-Test (unequal variance)$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$$

8.3 Paired T-Test

Before-after comparison of the same group. Differences: $d_i = x_{1i} - x_{2i}$

Formula$$t = \frac{\bar{d}}{s_d / \sqrt{n}} \qquad df = n - 1$$

📏9. Z-Test

Large-sample ($n \geq 30$) version of t-test when $\sigma$ is known.

One-Sample$$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$$

Two-Proportion$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}} \qquad \hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$$

Feature	Z-Test	T-Test
$\sigma$ known?	Yes	No
Sample size	$n \geq 30$	Any
Distribution	$N(0,1)$	$t(df)$ — heavier tails

📊10. ANOVA

One-Way ANOVA

Compares means of $k$ independent groups.

$H_0: \mu_1 = \mu_2 = \cdots = \mu_k$
$H_1:$ At least one mean differs

F Statistic$$F = \frac{MSB}{MSW} = \frac{SS_B / (k-1)}{SS_W / (N-k)}$$

Sum of Squares$$SS_B = \sum_{i=1}^{k} n_i(\bar{x}_i - \bar{x})^2 \qquad SS_W = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(x_{ij} - \bar{x}_i)^2$$

Post-hoc Tests

Test	Use Case
Tukey HSD	All pairwise comparisons, equal samples
Bonferroni	Conservative, few comparisons
Scheffé	Flexible, unequal samples

🎲11. Chi-Square Test

Test of Independence

Is there a relationship between two categorical variables?

Test Statistic$$\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \qquad E_{ij} = \frac{R_i \cdot C_j}{N}$$

$df = (r-1)(c-1)$

Goodness of Fit

Does the observed distribution match the expected?

Test Statistic$$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \qquad df = k - 1$$

📉12. Correlation & Regression

Pearson Correlation Coefficient

Formula$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \cdot \sum(y_i-\bar{y})^2}}$$

$\|r\|$	Interpretation
0.00 – 0.29	Weak
0.30 – 0.69	Moderate
0.70 – 1.00	Strong

Simple Linear Regression

Model$$\hat{y} = \beta_0 + \beta_1 x$$

OLS Coefficients$$\beta_1 = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2} \qquad \beta_0 = \bar{y} - \beta_1\bar{x}$$

Coefficient of Determination$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

Spearman Rank Correlation

Formula$$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)} \qquad d_i = \text{rank}(x_i) - \text{rank}(y_i)$$

🔄13. Non-Parametric Tests

Used when normality is violated or data is ordinal.

Parametric	Non-Parametric	Scenario
Independent t-test	Mann-Whitney U	2 independent groups
Paired t-test	Wilcoxon Signed-Rank	2 dependent groups
One-way ANOVA	Kruskal-Wallis	3+ independent groups

Mann-Whitney U

Test Statistic$$U = n_1 n_2 + \frac{n_1(n_1+1)}{2} - R_1$$

Kruskal-Wallis

H Statistic$$H = \frac{12}{N(N+1)} \sum_{i=1}^{k} \frac{R_i^2}{n_i} - 3(N+1)$$

📐14. Effect Size

p-value answers "is there a difference?" — Effect size answers "how big is the difference?"

Cohen's d (T-Test)

Formula$$d = \frac{\bar{x}_1 - \bar{x}_2}{s_p} \qquad s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}$$

$\|d\|$	Interpretation
0.2	Small
0.5	Medium
0.8	Large

Eta-Squared (ANOVA)

Formula$$\eta^2 = \frac{SS_{between}}{SS_{total}}$$

$\eta^2$	Interpretation
0.01	Small
0.06	Medium
0.14	Large

⚡15. Power Analysis

Done BEFORE the test. Determines the required sample size to detect the target effect.

4 Components (give 3 → compute 4th)

Component	Symbol	Typical
Effect size	$d$	0.2 / 0.5 / 0.8
Significance	$\alpha$	0.05
Power	$1-\beta$	0.80
Sample size	$n$	Computed

Power$$\text{Power} = 1 - \beta = P(\text{Detect a true effect})$$

Power increases ↑ when: $n$ ↑, $d$ ↑, $\alpha$ ↑

🧪16. A/B Testing

Workflow

State hypothesis: $H_0: p_A = p_B$
Define success metric (conversion, CTR, revenue…)
Calculate MDE & required sample size
Run experiment & collect data
Apply statistical test & evaluate

Two-Proportion Z-Test

Test Statistic$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}}$$

Lift$$\text{Lift} = \frac{\hat{p}_{test} - \hat{p}_{control}}{\hat{p}_{control}} \times 100\%$$

Sample Size (approx.)$$n \approx \frac{(z_{\alpha/2} + z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)]}{(p_1 - p_2)^2}$$

Common Pitfalls

Pitfall	Solution
Peeking	Pre-determine $n$, wait until completion
Multiple testing	Bonferroni: $\alpha_{adj} = \alpha / k$
Simpson's paradox	Segment analysis
Novelty effect	Run for 2+ weeks
Selection bias	Proper randomization

🧠17. Bayesian Basics

Bayes' Theorem$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

General Form$$\underbrace{P(\theta|X)}_{\text{Posterior}} = \frac{\overbrace{P(X|\theta)}^{\text{Likelihood}} \cdot \overbrace{P(\theta)}^{\text{Prior}}}{\underbrace{P(X)}_{\text{Evidence}}}$$

Example: Medical Test Paradox

Test accuracy 99%, disease prevalence 1%:

$$P(D|+) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = \textbf{50\%}$$

⚠️ Even a 99% accurate test can be misleading for rare conditions!

Frequentist vs Bayesian

Frequentist

Probability = long-run frequency
Parameter is fixed (unknown)
Result: p-value, confidence interval
No prior information used

Bayesian

Probability = degree of belief
Parameter is random variable
Result: posterior, credible interval
Incorporates prior knowledge

🗺️18. Which Test to Use? — Decision Tree

WHAT IS YOUR DATA TYPE? │ ├── Numerical (Continuous) │ ├── 1 Group → One-sample t-test │ ├── 2 Groups │ │ ├── Independent → Normal? → Yes: Independent t | No: Mann-Whitney U │ │ └── Dependent → Normal? → Yes: Paired t | No: Wilcoxon │ └── 3+ Groups │ ├── Independent → Normal? → Yes: ANOVA + Tukey | No: Kruskal-Wallis │ └── Dependent → Repeated Measures ANOVA / Friedman │ ├── Categorical (Counts) │ ├── One variable → Chi-Square Goodness of Fit │ └── Two variables → Chi-Square Independence │ └── Relationship ├── Linear? → Normal? → Pearson $r$ | Spearman $\rho$ └── Prediction? → Regression (Simple / Multiple)

Quick Checklist

#	Step	Method
1	Identify data type	Continuous / Categorical / Ordinal
2	Explore distribution	Histogram, QQ-Plot
3	Test normality	Shapiro-Wilk
4	Test variance homogeneity	Levene's test
5	Apply the right test	Decision tree above
6	Calculate effect size	Cohen's $d$, $\eta^2$
7	Report results	$p$ + effect + CI

⚠️ Golden Rule: A p-value alone is NOT enough. Always report alongside effect size and confidence intervals!

STATISTICS FORMULA REFERENCE

📑 Table of Contents

📐1. Descriptive Statistics

Measures of Central Tendency

Measures of Spread

Shape Measures

🔔2. Probability Distributions

Normal Distribution (Gaussian)

Binomial Distribution

Poisson Distribution

📊3. Z-Score

Critical Z Values

🎯4. Confidence Intervals

⚖️5. Hypothesis Testing

Steps

Error Types

Test Directions

📈6. Normality Tests

Shapiro-Wilk Test

D'Agostino K² Test

QQ-Plot (Visual)

⚖️7. Homogeneity of Variance — Levene's Test

🔬8. T-Test

8.1 One-Sample T-Test

8.2 Independent Two-Sample T-Test

8.3 Paired T-Test

📏9. Z-Test

📊10. ANOVA

One-Way ANOVA

Post-hoc Tests

🎲11. Chi-Square Test

Test of Independence

Goodness of Fit

📉12. Correlation & Regression

Pearson Correlation Coefficient

Simple Linear Regression

Spearman Rank Correlation

🔄13. Non-Parametric Tests

Mann-Whitney U

Kruskal-Wallis

📐14. Effect Size

Cohen's d (T-Test)

Eta-Squared (ANOVA)

⚡15. Power Analysis

4 Components (give 3 → compute 4th)

🧪16. A/B Testing

Workflow

Two-Proportion Z-Test

Common Pitfalls

🧠17. Bayesian Basics

Example: Medical Test Paradox

Frequentist vs Bayesian

Frequentist

Bayesian

🗺️18. Which Test to Use? — Decision Tree

Quick Checklist