# Endogeneity and IV Estimation

Endogeneity happens when exogeneity is violated, i.e., E[ε|X]0.

The following are some sources of endogeneity.

  • Omitted variable bias
  • Simultaneous/Reverse causality
  • Measurement error
  • Sample selection bias

# Assumption

[xi,zi,ε𝑖],𝑖=1,,𝑛 are an i.i.d. sequence of random variables.

# Explanatory variables

When exogeneity is violated, we assume

E[ε|X]=η,

(η is a function of X) and

E[xiεi]=γplim1nXε=γ,

then the estimator b is biased,

E[b|X]=β+(XX)1Xηβ,

and is inconsistent,

plim𝑏=β+plim(XX𝑛)1plim(Xε𝑛)=β+𝑄XX1γβ.

# Instrumental variables

The instrumental variable should satisfy:

  1. Exogeneity. They are uncorrelated with the disturbance ε.
  2. Relevance. They are correlated with the independent variable X.

We can also describe it as:

  1. plim1nZZ=QZZ, a finite, positive definite matrix (well-behaved data).
  2. plim1n𝑍X=𝑄𝑍𝑋, a finite L×K matrix with rank K (relevance).
  3. plim1n𝑍ε=0 (exogeneity).

# IV Estimation

# Situation when L=K

We partition X into 𝑥1, a set of 𝐾1 exogenous variables, and 𝑥2, a set of 𝐾2 endogenous variables, then 𝑍=[𝑥1,𝑧2], where 𝑧2 are the instrumental variables for 𝑥2, and 𝑥1 are the instrumental variables for themselves.

𝑏𝐼𝑉=(𝑍𝑋)1𝑍𝑦.

bIV is consistent.

  • The asymptotic distribution is

    bIVaN[β,σ2nQZX1QZZQXZ1].
  • The asymptotic covariance matrix is estimated as

    Est.Asy.Var[bIV]=σ^2(ZX)1(ZZ)(XZ)1,

    where

    σ^2=1ni=1n(yixibIV)2.

In general, we have

𝑄𝑋𝑦=𝑄𝑋𝑋β+γ,

while β and γ cannot be jointly identified.

  • Assume γ=0, which is the standard OLS assumption.
  • Find instrumental variables Z, use Q𝑍𝑦=𝑄𝑍𝑋β to estimate β.

# Situation when L>K

Estimator:

bIV=(X^X)1𝑋^𝑦=[𝑋𝑍(𝑍𝑍)1𝑍𝑋]1𝑋𝑍(𝑍𝑍)1𝑍𝑦,

where 𝑋^=𝑍(𝑍𝑍)1𝑍𝑋.

𝑋^ is the most efficient IV.

In practice, 𝑏𝐼𝑉 can be estimated in two steps: first estimate 𝑋^, and then 𝑏𝐼𝑉=(X^X^)1X^y.

X^X^=𝑋𝑍(𝑍𝑍)1𝑍𝑍(𝑍𝑍)1𝑍𝑋=𝑋𝑍(𝑍𝑍)1𝑍X=X^X.

# Relevant test

We want to test whether the regressors are correlated with the disturbances.

# Hausman Test

The statistic is

H=(bIVbLS)[(X^X^)1(XX)1]1(bIVbLS)s2χ2(K),

where K is the amount of endogenous variables.

# Hausman Statistic

The covariance between an efficient estimator 𝑏E of a parameter vector β and its difference from an inefficient estimator 𝑏𝐼 of the same parameter vector 𝑏𝐸𝑏𝐼 is zero,

  • which meansCov(bE,bEbI)=0.
  • ThusCov(𝑏𝐸,𝑏𝐼)=Var(𝑏𝐸).

We have a pair of estimators θ^E and θ^I, such that under 𝐻0:θ^E and θ^I are both consistent and θ^E is efficient relative to θ^I, while under H1:θ^I remains consistent while θ^E is inconsistent, then we can form a test of the hypothesis by the Hausman Statistic

H=(θ^Iθ^E){ Est. Asy. Var[θ^I] Est. Asy. Var[θ^E]}1(θ^Iθ^E)dχ2(J).

# Problems

# Weak Instrument

The asymptotic covariance matrix of the IV estimator,

 Est. Asy. Var[bIV]=σ^2(ZX)1(ZZ)(XZ)1,

will be “large” when the instruments are weak.

# Measurement Error

# Single Regressor Model

A regression model with a single regressor and no constant term,

𝑦=β𝑥+ε,

while 𝑦 and 𝑥 are not available, we can only observe y and x,

𝑦=𝑦+𝑣 with 𝑣𝑁[0,σ𝑣2],𝑥=𝑥+𝑢 with 𝑢𝑁[0,σ𝑢2].

While the measurement error on 𝑦 can be ignored, the measurement error on x causes

plim𝑏=β1+σ𝑢2/𝑄,

where 𝑄=plim1𝑛𝑖𝑥𝑖2.

# Multiple Regression Model

In a multiple regression model,

plimb=[Q+Σuu]1Qβ=β[Q+Σuu]1Σuuβ.

When only a single variable is measured with error (assume the first variable), we have

plimb1=β11+σu2q11,

and for k1,

plimbk=βkβ1σu2qk11+σu2q11,

where qk1 is the (k,1)th element in (Q)1.