I try to explain sample selection bias in the simplest of words so that even I can understand it.
Typically in OLS, we do at regression at the mean. This ensures that the mean predicted RHS is equal to the observed LHS. This is equivalent to saying the expected error is zero. This is an essential point.
Now suppose that the real error does not have a zero mean, for instance because you are missing a part of the sample, and this part has some unobserved characteristics (error u in the selection model) that correlate with the real error in the model. What will happen then?
Y = Xb + error
Say that those who are not selected may be out of the sample because of the same characteristics that lead to a lower Y. In that case e > e* and the error terms e* and u will be positively correlated. B will also have a positive bias (towards zero if negative).
It is possible to control for the sample selection bias by including the Inverse Mill's Ratio (IMR or lambda), which is the unlikelihood of bein in the model. It is the density of the expected selection chance over the cumulative probability. If you try to map this, you will notice that this is a non-linear function. However, it has a near linear part. As a result, when you predict the selection chance, you need to include more than just the variables in the model to be able to identify the selection bias. However, you cannot leave out the variables from the model if they determine the selection chance. If you do this, the estimated IMR will be wrong and (OVB). This is similar to the case of collinearity: leaving out one of two correlated variables will bias the effect of the other.
Just in case you also want to interprete results: the coefficient of IMR (or lambda), say b_m, is in fact the product of the correlation between the error terms (rho) and a scaling parameter (sigma) which is always positive. If you use the heckman command rho is shown straight away.
Now the tricky question is: why is it that a good estimate of IMR solves the sample selection problem? This is a matter of making the derivation of Heckman. I will add this later.