Ordinary Least Squares

\[\mathbf{Y = X\beta + \varepsilon}\]

Assumptions :

Linear Relationship: That goes without saying, but you implicitely assume that there is a linear relationship bewteen $\mathbf{X}$ and $\mathbf{Y}$.

Strict Exogeneity: $\mathrm{E}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = 0$

Consequences: $\mathrm{E}[\boldsymbol{\varepsilon}] = 0$ and $Cov(X,\varepsilon) = \mathrm{E}[\mathbf{X}^T\boldsymbol{\varepsilon}] = 0$
If it holds, regressors are exogenous. If violated, regressors are endogenous, OLS becomes biased, and instrumental variables may be needed.

No perfect multicollinearity $\Pr[\mathrm{rank}(\mathbf{X}) = p] = 1$

Regressors must be linearly independent
If violated, $\boldsymbol{\beta}$ cannot be estimated, though prediction may still be possible.

Spherical Errors: $\mathrm{Var}[\boldsymbol{\varepsilon} \mid \mathbf{X}] = \sigma^2 \mathbf{I}_n$

Homoscedasticity: $\mathrm{E}[\varepsilon_i^2 \mid \mathbf{X}] = \sigma^2$ for all $i$
No Autocorrelation: $\mathrm{E}[\varepsilon_i\varepsilon_j \mid \mathbf{X}] = 0$ for $i \neq j$
If violated, OLS estimates are unbiased but inefficient; use GLS or robust estimation.

Normality: $\boldsymbol{\varepsilon} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)$

Estimator Distribution :

\[\begin{align*} \boldsymbol{\hat{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) \\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon} \\ &= \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon} \end{align*}\]

Estimator Mean:

\[\begin{align*} E[\boldsymbol{\hat{\beta}}] &= E[\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon} | \mathbf{X}] \\ &= \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E[\boldsymbol{\varepsilon} | \mathbf{X}] \\ &= \boldsymbol{\beta} \end{align*}\]

Estimator Variance:

\[\begin{align*} Var(\boldsymbol{\hat{\beta}}) &= Var(\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}) \\ &= \boldsymbol{\beta} +Var((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon})\\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T Var(\varepsilon) \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \end{align*}\]

Under normality assumption $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2)$, $\qquad \qquad\boldsymbol{\hat{\beta}} \sim \mathcal{N}(\beta, \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1})$.

Hypothesis Testing :

$\text{RSS} = \varepsilon^\top \varepsilon = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

\[\qquad \qquad \text{estimate variance } \sigma^2 \text{ by } \qquad \qquad \hat{\sigma}^2 = \frac{\text{RSS}}{n - p} \quad \qquad \text{($n - p$ makes it unbiased } \mathbb{E}[\hat{\sigma}^2] = \sigma^2\text{)}\] \[\text{RSE} = \sqrt{\sigma} = \sqrt{\frac{\varepsilon^\top \varepsilon}{n - p}}\]

Testing $\beta_j = 0$ : Define the z-score $z = \frac{\hat{\beta}}{SE(\hat{\beta})} = \frac{\hat{\beta}}{\hat{\sigma}\sqrt{(X^TX)}} \sim t_{n-p}$

One can construct a confidence interval:

\[\hat{\beta}_j \pm t_{n - p,\, 1 - \alpha/2} \; \text{SE}(\hat{\beta}_j)\] \[\hat{\beta}_j \pm z_{1 - \alpha/2} \; \text{SE}(\hat{\beta}_j) \qquad \text{as $t_{n - p} \to \mathcal{N}(0,1)$ when $n \to \infty$}\]

Testing Model 1 with $p_1$ parameters vs. Model 2 with $p_2$ parameters, $p_1 > p_2$:

\[F = \frac{(RSS_0 - RSS_1) / (p_1-p_0)}{RSS_1/(n-p1)} \sim F_{p_1-p_0, n-p_1}\]

Metrics

$\mathbf{R^2} = 1 - \frac{\text{RSS}}{\text{TSS}} \quad (= \rho_{xy}^2)$
Adjusted $\mathbf{R^2} = 1 - \left(\frac{1 - R^2}{n - p - 1}\right) \cdot (n - 1)$
AIC $= 2k - 2 \ln(L)$
BIC $= \ln(n)k - 2 \ln(L)$

Gauss–Markov Theorem

Theorem (Gauss–Markov): If the following assumptions hold:

Linearity: $y = X\beta + \varepsilon$
$\mathbb{E}[\varepsilon] = 0$
$\operatorname{Var}(\varepsilon) = \sigma^2 I$
$X$ has full column rank

then the OLS estimator is Best Linear Unbiased Estimator (BLUE).

Proof:

Let $\tilde{\beta} = C y$ be another linear estimator of $\beta$, with $C = (X^\top X)^{-1} X^\top + D$

\[\begin{aligned} \mathbb{E}[\tilde{\beta}] &= \mathbb{E}[C y] \\ &= \mathbb{E}\!\left[\big((X^\top X)^{-1} X^\top + D\big)(X\beta + \varepsilon)\right] \\ &= \big((X^\top X)^{-1} X^\top + D\big) X \beta + \big((X^\top X)^{-1} X^\top + D\big)\mathbb{E}\! \left[\varepsilon\right] \\ &= (X^\top X)^{-1} X^\top X \beta + D X \beta \\ &= (I + D X)\beta. \end{aligned}\]

So $\tilde{\beta}$ is unbiased if and only if $D X = 0$.

\[\begin{aligned} \operatorname{Var}(\tilde{\beta}) &= \operatorname{Var}(C y) \\ &= C \operatorname{Var}(y) C^\top \\ &= \sigma^2 C C^\top \\ &= \sigma^2 \big( (X^\top X)^{-1} X^\top + D \big) \big( X (X^\top X)^{-1} + D^\top \big) \\ &= \sigma^2 \left( (X^\top X)^{-1} + (X^\top X)^{-1} X^\top D^\top + D X (X^\top X)^{-1} + D D^\top \right). \end{aligned}\]

$\operatorname{Var}(\tilde{\beta}) = \sigma^2 (X^\top X)^{-1} + \sigma^2 D D^\top.$ as unbiasedness condition $D X = 0$

\[\operatorname{Var}(\tilde{\beta}) \succeq \operatorname{Var}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}.\]

Since $D D^\top$ is PSD, so $\hat{\beta}$ has the smallest variance among all linear unbiased estimators and is therefore BLUE.

Frisch–Waugh–Lovell Theorem

If you have an OLS: $Y = \beta_1X_1 + \beta_2X_2 + \varepsilon$ Then $\beta_2$ will be the same as: $M_1Y = \beta_2M_1X_2 + M_1\varepsilon$ where $M1 $be the residual-maker (or annihilator) matrix which projects any vector onto the space orthogonal to the column space of $X_1$: $M_1 = I - X_1 (X_1^\top X_1)^{-1} X_1^\top$

Procedure:

Residuals of $y$ on $X_1$: $ \tilde{y} = M_1 y = y - X_1 (X_1^\top X_1)^{-1} X_1^\top y. $
Orthogonal component of $X_2$ wrt $X_1$: $ \tilde{X}_2 = M_1 X_2 = X_2 - X_1 (X_1^\top X_1)^{-1} X_1^\top X_2. $
Regress $\tilde{y}$ on $\tilde{X}_2$: $ \tilde{\beta}_2 = (\tilde{X}_2^\top \tilde{X}_2)^{-1} \tilde{X}_2^\top \tilde{y}. $
Then $\tilde{\beta}_2= \hat{\beta}_2$

Violated Assumptions in OLS

Model Misspecification / Functional Form

Issue: Wrong functional form, omitted variables, or nonlinearity. Violates $\mathbb{E}[\varepsilon|X] = 0$ if true relationship is nonlinear but model is linear.

Consequences:

Biased Estimates: $\hat{\beta}$ biased and inconsistent when functional form is misspecified. Bias magnitude depends on degree of misspecification.
Omitted Variable Bias: If $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon$ but only regress on $x_1$, then $\mathbb{E}[\hat{\beta}_1] = \beta_1 + \beta_2 \frac{\text{Cov}(x_1, x_2)}{\text{Var}(x_1)}$ (biased unless $x_1 \perp x_2$ or $\beta_2 = 0$).

Diagnostics:

Residual Plots: Plot residuals $\varepsilon$ vs. fitted values $\hat{y}$ or vs. individual predictors $x_j$. Should see no pattern (random scatter around zero). Systematic patterns (U-shape, curves, trends) indicate misspecification.
Partial Residual Plots: Plot $e + \hat{\beta}_j x_j$ vs. $x_j$ to detect nonlinearity in variable $x_j$ while controlling for others.

Remedies:

Nonlinear Transformations of Predictors: Use $\log(x)$, $\sqrt{x}$, $x^2$, $1/x$
Interaction Terms: Include $x_1 \cdot x_2$ if effect of $x_1$ depends on level of $x_2$.
Box-Cox Transformation: Transform dependent variable: $y^{(\lambda)} = \frac{y^\lambda - 1}{\lambda}$ (or $\log(y)$ if $\lambda = 0$). Choose $\lambda$ via MLE to improve model fit.

Multicollinearity

Issue: High correlation among regressors causes an ill-conditioned $X^\top X$ matrix, leading to:

Inflated Variance & Unstable Estimates: The sampling variance, $\text{Var}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}$, becomes inflated. This instability in $\hat{\beta}$ leads to deflated t-statistics, increasing the risk of Type II error (failing to identify a significant effect).
Unreliable Interpretation: It becomes difficult to disentangle the individual effect of each predictor, making coefficient interpretation unreliable.

Diagnostics:

Variance Inflation Factor (VIF): $\text{VIF}_j = \frac{1}{1-R_j^2}$ for regressor $j$. A high VIF indicates multicollinearity, as $\text{Var}(\hat{\beta}_j) = \text{Var}(\hat{\beta}_j)^{\text{orth}} \times \text{VIF}_j$, where the variance is multiplied relative to the orthogonal case.
Condition Number: $\kappa = \sqrt{\lambda_\text{max} / \lambda_\text{min}}$ of $X$ or $X^\top X$. Values > 30 suggest significant multicollinearity.

Remedies: Shrinkage (Ridge, Lasso), Dimensionality Reduction (Principal Component Regression), FWL Theorem.

Heteroskedasticity

Issue: Non-constant error variance, $\mathrm{Var}(\varepsilon_i) = \sigma_i^2$. OLS remains unbiased but is inefficient (no longer BLUE). The standard covariance matrix $\sigma^2(X^\top X)^{-1}$ is incorrect, leading to biased standard errors and invalid inference (t-statistics, confidence intervals).

Diagnostics:

Graphics: Plot residuals ($e_i$) vs. predictors ($X_i$) or fitted values ($\hat{y}_i$) (should be around 0 with constant variance). Also plot squared residuals vs. predictors (need a flat line around $\sigma^2$).
tests : Breusch-Pagan (tests if variance depends on predictors), White (general test for heteroskedasticity), Goldfeld-Quandt tests for differing variance between two data subsets)

Remedies:

Robust Standard Errors (Eicker–Huber–White): The preferred solution for inference. Instead of the standard covariance matrix $\sigma^2(X^\top X)^{-1}$, it uses a consistent estimator: $\widehat{V}_{ehw} = (X^\top X)^{-1} (X^\top \hat{\Omega} X) (X^\top X)^{-1}, \quad \hat{\Omega} = \text{diag}(\hat{\varepsilon}_1^2, \ldots, \hat{\varepsilon}_n^2)$ where $\widehat{V}_{ehw}$ is the Eicker–Huber–White (EHW) robust covariance matrix. We can now do test using $\hat{\beta} \overset{a}{\sim} \mathcal{N}(\beta, \widehat{V}_{ehw})$
Weighted Least Squares (WLS): If $\sigma_i^2$ is known for each measurement, we can set $w_i = 1 / \sigma_i^2$
Transformations: Log, Box-Cox, etc., to stabilize variance (may also affect functional form).

Autocorrelated Errors

Issue: Non-zero covariance between errors, $\mathrm{Cov}(\varepsilon_i, \varepsilon_j) \neq 0$ for $i \neq j$. This violates the Gauss-Markov assumption that errors are independent. In time series data, this commonly follows an AR(1) process: $\varepsilon_t = \rho \varepsilon_{t-1} + u_t$ where $u_t \sim \text{IID}(0, \sigma_u^2)$.

Consequences:

Reduced Effective Sample Size: With positive autocorrelation ($\rho > 0$), consecutive observations are not fully independent. For AR(1) errors, the variance inflation factor is approximately $\frac{1+\rho}{1-\rho}$, so $\text{Var}(\hat{\beta})_{AR(1)} \approx \frac{1+\rho}{1-\rho} \cdot \text{Var}(\hat{\beta})_{IID}$.
Biased Inference: OLS estimators remain unbiased and consistent, but are inefficient. The conventional OLS standard errors are inconsistent:
For $\rho > 0$: Standard errors are typically underestimated, making confidence intervals artificially narrow and inflating t-statistics (increased Type I error rates).
For $\rho < 0$: Standard errors are typically overestimated, leading to conservative inference (increased Type II error rates).

Diagnostics:

Residual Plots: Plot residuals against time; look for tracking patterns where adjacent residuals tend to have similar signs and magnitudes. ACF / PACF plots also.
Durbin-Watson Test: $d = \frac{\sum_{t=2}^T (e_t - e_{t-1})^2}{\sum_{t=1}^T e_t^2} \approx 2(1-\hat{\rho})$ Tests $H_0: \rho = 0$ vs $H_1: \rho \neq 0$ (or one-sided alternatives). Values significantly below 2 suggest positive autocorrelation; values significantly above 2 suggest negative autocorrelation. Limitation: Only tests for AR(1) autocorrelation and is inconclusive near 2.
Box-Pierce Test: General test for autocorrelation up to lag $h$: $Q_{BP} = n\sum_{k=1}^h \hat{\rho}_k^2 \sim \chi^2_h$ where $\hat{\rho}_k$ is the sample autocorrelation at lag $k$.
Ljung-Box Test: Modified version with better small-sample properties: $Q_{LB} = n(n+2)\sum_{k=1}^h \frac{\hat{\rho}_k^2}{n-k} \sim \chi^2_h$ Generally preferred over Box-Pierce in practice.

Remedies:

Generalized Least Squares (GLS): For AR(1) errors, transform data via $y_t^* = y_t - \rho y_{t-1}$ and $X_t^* = X_t - \rho X_{t-1}$. The Cochrane-Orcutt procedure iteratively estimates $\rho$ from OLS residuals and applies the transformation until convergence.
Newey-West HAC Standard Errors: Heteroskedasticity and Autocorrelation Consistent (HAC) robust covariance estimator: $\widehat{V}_{NW} = (X^\top X)^{-1} \hat{\Omega} (X^\top X)^{-1}$ where $\hat{\Omega}$ accounts for autocorrelation up to lag $L$ using a kernel weighting scheme.
Include Lagged Dependent Variable: For time series models, adding $y_{t-1}$ as a regressor may capture autocorrelation in the original error term, transforming it into proper model dynamics. Note: This creates a dynamic panel model with different asymptotic properties.

Heavy-tailed / Non-normal Errors

Issue: Errors $\varepsilon_i$ deviate from normality, particularly with heavy tails (high kurtosis). Normality not required for OLS unbiasedness, but departures affect inference and efficiency.
Consequences:

Variance Overestimation: $\hat{\sigma}^2$ highly sensitive to extreme values, typically overestimated with heavy-tailed outliers. This inflates standard errors, making CIs artificially wide and $\hat{\beta}$ appear less significant (reduces power, increases Type II error).
Inference Validity: t-tests and F-tests assume normality for exact finite-sample validity. Under non-normality, these rely on CLT for asymptotic approximations. With heavy tails (Cauchy, Pareto with $\alpha < 2$), CLT fails and inference is invalid even asymptotically.

Diagnostics:

Q-Q Plot: Sample quantiles vs. theoretical normal quantiles. Heavy tails appear as deviations at extremes (both ends curving away).
Sample Kurtosis: Excess kurtosis $> 0$ indicates heavier tails. Kurtosis $> 10$ suggests severe heavy-tailedness.
Shapiro-Wilk: Most powerful normality test for $n < 2000$. $W = \frac{(\sum_{i=1}^n a_i x_{(i)})^2}{\sum_{i=1}^n (x_i - \bar{x})^2}$. P-value $< \alpha$ rejects normality.
Goodness of fit tests: Kolmogorov-Smirnov (compares empirical CDF to theoretical normal CDF), Anderson-Darling (similar to KS but more weight on the tails)

Remedies:

Robust SEs: HC or HAC standard errors remain valid under non-normality (rely on asymptotic approximations).
Bootstrap SEs: Non-parametric bootstrap provides valid inference without distributional assumptions. Use residual bootstrap or pairs bootstrap.
Studentized Residuals: $r_i = \frac{\varepsilon_i}{\hat{\sigma}\sqrt{1-H_{ii}}}$ with $H = X(X^TX)^{-1}X^T$.

References

The Elements ofStatistical Learning: Data Mining, Inference, and Prediction. (2009)
Trevor Hastie, Robert Tibshirani, Jerome Friedman
Book

The Truth about Linear Regression (2015)
Cosma Rohilla Shalizi
Book

Linear Model and Extensions (2024)
Pend Ding
Book