The idea of Maximum Likelihood is very simple, yet powerful. A regression returns a residual for every case, telling us how far the estimation is off. In Ordinary Least Squares, we minimize the residual. In Maximum Likelihood, we ... maximize the likelihood of the prediction. It sounds like it is the same thing, and in some cases it is the same thing. But whereas
OLS mathematically computes the parameters of the model,
ML uses a maximization algorithm (generally Newton-Raphson) that checks multiple possibilities. These are called iterations, and when the model works well, they will 'converge'. If there is no maximum (i.e. the maximizand is non concave), or several maxima, you are out of luck.
Continuous dependent variables
Take a simple regression like
y = Xb, and suppose - as in simplified OLS - that the error term is normally distributed. Then
e = y - Xb.
Now let
f(e) = (2.pi.sigma^2)^-(1/2) . exp(- (e^2)/(2.sigma^2)
In this case f is a normal density function or normal distribution. If the error is 0, the density is highest (around .44), so a good estimation will need to have small errors.
The likelihood function we want to maximize is the joint probability distribution of the sample. This means we compute the density of the error for every case and multiply all of them. We may just as well take the log and sum the logs of the error density. When this sum of logs is maximizes, we have the same set of coefficients b as in
OLS. Note that this will be the true b only if the error term is indeed normally distributed.
Discrete dependent variables
When the outcome variable
y_obs is binary (0,1), such a linear model will not work. Instead we estimate a function that lies behind the observed outcome. If that function results in a number greater than 0, we will observe a success (1) and if not, a failure (0). Hence again, we have:
y = xb
Yet in this case we cannot simply compute a residual. In the
probit case, we take the cumulative probability of
xb. It is obvious that this depends on a scaling parameter
sigma.
The likelihood function is then
P(xb) if
y_obs = 1, and
1-P(xb) if
y_obs = 1. Another way to express this is:
L_i = P(xb)^y_obs * (1-P(xb))^(1-y_obs).
Maximizing the product of all
L_i or the sum of the logs yields good estimates of
b.
Note: an alternative to
probit is the
logit or
logistic regression. Then the function
P is not the normal cumulative distribution function, but instead it is the inverse
logit:
P(xb) = exp(xb) / (1+ exp(xb) )
The advantage here is that no scaling sigma has to be estimated. It used to be popular in the early days of sociology, before econometrics became dominant in the social sciences.
Links
The last link has a nice picture comparing the cumulative distribution function and the inverse logit (or logistic) function. The latter has 'fatter' tails, even though the difference is negligible in practice.