Continuous dependent variables
Take a simple regression like y = Xb, and suppose - as in simplified OLS - that the error term is normally distributed. Then e = y - Xb.Now let f(e) = (2.pi.sigma^2)^-(1/2) . exp(- (e^2)/(2.sigma^2)
In this case f is a normal density function or normal distribution. If the error is 0, the density is highest (around .44), so a good estimation will need to have small errors.
The likelihood function we want to maximize is the joint probability distribution of the sample. This means we compute the density of the error for every case and multiply all of them. We may just as well take the log and sum the logs of the error density. When this sum of logs is maximizes, we have the same set of coefficients b as in OLS. Note that this will be the true b only if the error term is indeed normally distributed.
Discrete dependent variables
When the outcome variable y_obs is binary (0,1), such a linear model will not work. Instead we estimate a function that lies behind the observed outcome. If that function results in a number greater than 0, we will observe a success (1) and if not, a failure (0). Hence again, we have:y = xb
Yet in this case we cannot simply compute a residual. In the probit case, we take the cumulative probability of xb. It is obvious that this depends on a scaling parameter sigma.
The likelihood function is then P(xb) if y_obs = 1, and 1-P(xb) if y_obs = 1. Another way to express this is:
L_i = P(xb)^y_obs * (1-P(xb))^(1-y_obs).
Maximizing the product of all L_i or the sum of the logs yields good estimates of b.
Note: an alternative to probit is the logit or logistic regression. Then the function P is not the normal cumulative distribution function, but instead it is the inverse logit:
P(xb) = exp(xb) / (1+ exp(xb) )
The advantage here is that no scaling sigma has to be estimated. It used to be popular in the early days of sociology, before econometrics became dominant in the social sciences.
Links
The last link has a nice picture comparing the cumulative distribution function and the inverse logit (or logistic) function. The latter has 'fatter' tails, even though the difference is negligible in practice.
- http://en.wikipedia.org/wiki/Logit
- http://en.wikipedia.org/wiki/Probit
- http://en.wikipedia.org/wiki/Logistic_regression#Bayesian_logistic_regression