# Endogeneity and IV Estimation

Endogeneity happens when exogeneity is violated, i.e., $E [ε | X] \neq 0$ .

The following are some sources of endogeneity.

Omitted variable bias
Simultaneous/Reverse causality
Measurement error
Sample selection bias

# Assumption

$[x_{i}, z_{i}, ε_{𝑖}], 𝑖 = 1, \dots, 𝑛$ are an i.i.d. sequence of random variables.

# Explanatory variables

When exogeneity is violated, we assume

E [ε | X] = η,

( $η$ is a function of $X$ ) and

E [x_{i} ε_{i}] = γ \overset{辛 钦 大 数 定 理}{⟹} plim \frac{1}{n} X^{'} ε = γ,

then the estimator $b$ is biased,

E [b | X] = β + (X^{'} X)^{- 1} X^{'} η \neq β,

and is inconsistent,

plim 𝑏 = β + plim {(\frac{X^{'} X}{𝑛})}^{- 1} plim (\frac{X^{'} ε}{𝑛}) = β + 𝑄_{X X}^{- 1} γ \neq β .

# Instrumental variables

The instrumental variable should satisfy:

Exogeneity. They are uncorrelated with the disturbance $ε$ .
Relevance. They are correlated with the independent variable $X$ .

We can also describe it as:

$plim \frac{1}{n} Z^{'} Z = Q_{Z Z}$ , a finite, positive definite matrix (well-behaved data).
$plim \frac{1}{n} 𝑍^{'} X = 𝑄_{𝑍 𝑋}$ , a finite $L \times K$ matrix with rank $K$ (relevance).
$plim \frac{1}{n} 𝑍^{'} ε = 0$ (exogeneity).

# IV Estimation

# Situation when L=K

We partition $X$ into $𝑥_{1}$ , a set of $𝐾_{1}$ exogenous variables, and $𝑥_{2}$ , a set of $𝐾_{2}$ endogenous variables, then $𝑍 = [𝑥_{1}, 𝑧_{2}]$ , where $𝑧_{2}$ are the instrumental variables for $𝑥_{2}$ , and $𝑥_{1}$ are the instrumental variables for themselves.

𝑏_{𝐼 𝑉} = (𝑍^{'} 𝑋)^{- 1} 𝑍^{'} 𝑦 .

$b_{I V}$ is consistent.

The asymptotic distribution is
$b_{I V} \overset{a}{\sim} N [β, \frac{σ^{2}}{n} Q_{Z X}^{- 1} Q_{Z Z} Q_{X Z}^{- 1}] .$
The asymptotic covariance matrix is estimated as
$Est . Asy . Var [b_{I V}] = {\hat{σ}}^{2} (Z^{'} X)^{- 1} (Z^{'} Z) (X^{'} Z)^{- 1},$
where
${\hat{σ}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - x_{i}^{'} b_{I V})^{2} .$

In general, we have

𝑄_{𝑋 𝑦} = 𝑄_{𝑋 𝑋} β + γ,

while $β$ and $γ$ cannot be jointly identified.

Assume $γ = 0$ , which is the standard OLS assumption.
Find instrumental variables $Z$ , use $Q_{𝑍 𝑦} = 𝑄_{𝑍 𝑋} β$ to estimate $β$ .

# Situation when L>K

Estimator:

\begin{aligned} b_{I V} & = ({\hat{X}}^{'} X)^{- 1} {\hat{𝑋}}^{'} 𝑦 \\ = [𝑋^{'} 𝑍 (𝑍^{'} 𝑍)^{- 1} 𝑍^{'} 𝑋]^{- 1} 𝑋^{'} 𝑍 (𝑍^{'} 𝑍)^{- 1} 𝑍^{'} 𝑦, \end{aligned}

where $\hat{𝑋} = 𝑍 (𝑍^{'} 𝑍)^{- 1} 𝑍^{'} 𝑋$ .

$\hat{𝑋}$ is the most efficient IV.

In practice, $𝑏_{𝐼 𝑉}$ can be estimated in two steps: first estimate $\hat{𝑋}$ , and then $𝑏_{𝐼 𝑉} = ({\hat{X}}^{'} \hat{X})^{- 1} {\hat{X}}^{'} y$ .

{\hat{X}}^{'} \hat{X} = 𝑋^{'} 𝑍 (𝑍^{'} 𝑍)^{- 1} 𝑍^{'} 𝑍 (𝑍^{'} 𝑍)^{- 1} 𝑍^{'} 𝑋 = 𝑋^{'} 𝑍 (𝑍^{'} 𝑍)^{- 1} 𝑍^{'} X = {\hat{X}}^{'} X .

# Relevant test

We want to test whether the regressors are correlated with the disturbances.

# Hausman Test

The statistic is

H = \frac{{(b_{I V} - b_{L S})}^{'} {[{({\hat{X}}^{'} \hat{X})}^{- 1} - {(X^{'} X)}^{- 1}]}^{- 1} (b_{I V} - b_{L S})}{s^{2}} \sim χ^{2} (K^{*}),

where $K^{*}$ is the amount of endogenous variables.

# Hausman Statistic

The covariance between an efficient estimator $𝑏_{E}$ of a parameter vector $β$ and its difference from an inefficient estimator $𝑏_{𝐼}$ of the same parameter vector $𝑏_{𝐸} - 𝑏_{𝐼}$ is zero,

which means $Cov (b_{E}, b_{E} - b_{I}) = 0.$
Thus $Cov (𝑏_{𝐸}, 𝑏_{𝐼}) = Var (𝑏_{𝐸}) .$

We have a pair of estimators ${\hat{θ}}_{E}$ and ${\hat{θ}}_{I}$ , such that under $𝐻_{0} : {\hat{θ}}_{E}$ and ${\hat{θ}}_{I}$ are both consistent and ${\hat{θ}}_{E}$ is efficient relative to ${\hat{θ}}_{I}$ , while under $H_{1} : {\hat{θ}}_{I}$ remains consistent while ${\hat{θ}}_{E}$ is inconsistent, then we can form a test of the hypothesis by the Hausman Statistic

H = {({\hat{θ}}_{I} - {\hat{θ}}_{E})}^{'} {Est. Asy. Var [{\hat{θ}}_{I}] - Est. Asy. Var [{\hat{θ}}_{E}]}^{- 1} ({\hat{θ}}_{I} - {\hat{θ}}_{E}) \overset{d}{\to} χ^{2} (J) .

# Problems

# Weak Instrument

The asymptotic covariance matrix of the IV estimator,

Est. Asy. Var [b_{I V}] = {\hat{σ}}^{2} {(Z^{'} X)}^{- 1} (Z^{'} Z) {(X^{'} Z)}^{- 1},

will be “large” when the instruments are weak.

# Measurement Error

# Single Regressor Model

A regression model with a single regressor and no constant term,

𝑦^{*} = β 𝑥^{*} + ε,

while $𝑦^{*}$ and $𝑥^{*}$ are not available, we can only observe $y$ and $x$ ,

𝑦 = 𝑦^{*} + 𝑣 with 𝑣 \sim 𝑁 [0, σ_{𝑣}^{2}],

𝑥 = 𝑥^{*} + 𝑢 with 𝑢 \sim 𝑁 [0, σ_{𝑢}^{2}] .

While the measurement error on $𝑦^{*}$ can be ignored, the measurement error on $x^{*}$ causes

plim 𝑏 = \frac{β}{1 + σ_{𝑢}^{2} / 𝑄^{*}},

where $𝑄^{*} = plim \frac{1}{𝑛} \sum_{𝑖} 𝑥_{𝑖}^{* 2}$ .

# Multiple Regression Model

In a multiple regression model,

plim b = {[Q^{*} + Σ_{u u}]}^{- 1} Q^{*} β = β - {[Q^{*} + Σ_{u u}]}^{- 1} Σ_{u u} β .

When only a single variable is measured with error (assume the first variable), we have

plim b_{1} = \frac{β_{1}}{1 + σ_{u}^{2} q^{* 11}},

and for $k \neq 1$ ,

plim b_{k} = β_{k} - β_{1} \frac{σ_{u}^{2} q^{* k 1}}{1 + σ_{u}^{2} q^{* 11}},

where $q^{* k 1}$ is the $(k, 1)$ th element in $(Q^{*})^{- 1}$ .

← Hypothesis Tests and Model Selection Panel Data Models →