dinsdag 3 februari 2015

Maximum Likelihood - with examples of regressions of continuous and discrete dependent variables

The idea of Maximum Likelihood is very simple, yet powerful. A regression returns a residual for every case, telling us how far the estimation is off. In Ordinary Least Squares, we minimize the residual. In Maximum Likelihood, we ... maximize the likelihood of the prediction. It sounds like it is the same thing, and in some cases it is the same thing. But whereas OLS mathematically computes the parameters of the model, ML uses a maximization algorithm (generally Newton-Raphson) that checks multiple possibilities. These are called iterations, and when the model works well, they will 'converge'. If there is no maximum (i.e. the maximizand is non concave), or several maxima, you are out of luck.

Continuous dependent variables

Take a simple regression like y = Xb, and suppose - as in simplified OLS - that the error term is normally distributed. Then e = y - Xb.

Now let f(e) = (2.pi.sigma^2)^-(1/2) . exp(- (e^2)/(2.sigma^2)

In this case f is a normal density function or normal distribution. If the error is 0, the density is highest (around .44), so a good estimation will need to have small errors.

The likelihood function we want to maximize is the joint probability distribution of the sample. This means we compute the density of the error for every case and multiply all of them. We may just as well take the log and sum the logs of the error density. When this sum of logs is maximizes, we have the same set of coefficients b as in OLS. Note that this will be the true b only if the error term is indeed normally distributed.

Discrete dependent variables

When the outcome variable y_obs is binary (0,1), such a linear model will not work. Instead we estimate a function that lies behind the observed outcome. If that function results in a number greater than 0, we will observe a success (1) and if not, a failure (0). Hence again, we have:

y = xb

Yet in this case we cannot simply compute a residual. In the probit case, we take the cumulative probability of xb. It is obvious that this depends on a scaling parameter sigma.

The likelihood function is then P(xb) if y_obs = 1, and 1-P(xb) if y_obs = 1. Another way to express this is:

L_i = P(xb)^y_obs * (1-P(xb))^(1-y_obs).

Maximizing the product of all L_i or the sum of the logs yields good estimates of b.

Note: an alternative to probit is the logit or logistic regression. Then the function P is not the normal cumulative distribution function, but instead it is the inverse logit:

P(xb) = exp(xb) / (1+ exp(xb) )

The advantage here is that no scaling sigma has to be estimated. It used to be popular in the early days of sociology, before econometrics became dominant in the social sciences.

Links

The last link has a nice picture comparing the cumulative distribution function and the inverse logit (or logistic) function. The latter has 'fatter' tails, even though the difference is negligible in practice.