woensdag 21 januari 2015

Sample selection bias: intuition

I try to explain sample selection bias in the simplest of words so that even I can understand it.

Typically in OLS, we do at regression at the mean. This ensures that the mean predicted RHS is equal to the observed LHS. This is equivalent to saying the expected error is zero. This is an essential point.

Now suppose that the real error does not have a zero mean, for instance because you are missing a part of the sample, and this part has some unobserved characteristics (error u in the selection model) that correlate with the real error in the model. What will happen then?

Y = Xb + error

Say that those who are not selected may be out of the sample because of the same characteristics that lead to a lower Y. In that case e > e* and the error terms e* and u will be positively correlated. B will also have a positive bias (towards zero if negative).

It is possible to control for the sample selection bias by including the Inverse Mill's Ratio (IMR or lambda), which is the unlikelihood of bein in the model. It is the density of the expected selection chance over the cumulative probability. If you try to map this, you will notice that this is a non-linear function. However, it has a near linear part. As a result, when you predict the selection chance, you need to include more than just the variables in the model to be able to identify the selection bias. However, you cannot leave out the variables from the model if they determine the selection chance. If you do this, the estimated IMR will be wrong and (OVB). This is similar to the case of collinearity: leaving out one of two correlated variables will bias the effect of the other.

Just in case you also want to interprete results: the coefficient of IMR (or lambda), say b_m, is in fact the product of the correlation between the error terms (rho) and a scaling parameter (sigma) which is always positive. If you use the heckman command rho is shown straight away.

Now the tricky question is: why is it that a good estimate of IMR solves the sample selection problem? This is a matter of making the derivation of Heckman. I will add this later.

zondag 4 januari 2015

Stata user command -reindex- for rapid conversion between levels, factor changes, percentage changes and index

This program expresses level variables as changes and allows to convert between three indicators of trends: factors, percentage change and an index. The newly constructed variable needs to be set.

Syntax

reindex varname, generate(name) [i(name)] j(name) from(string) to(string) [fromscale(integer 1)] [toscale(integer 1)] [tobase(integer -999)]

varname is the name of the variable you want to manipulate

Options

generate()  Required: give a name for the new variable.
i()         Set the panel id variable if there are panels. You can use just one variable and it needs to be numeric. Pretty much like xtset. If there is no panel id, you can leave it out.
j()         Required: time variable. Intervals must always be the same.
from()      Required: choose between levels, percent, factor, and index
to()        Required: choose between percent, factor, and index. You can't go back to levels.
fromscale() In case your variable was multiplied by some factor (say 100 for percentages or an index).
toscale()   Same for the target variable.
tobase()    Set the base time period (required with the option to(index))

Working example

This command will convert gdp growth figures from factors (1.02 for 2 percent growth) to and index with base 2000 = 100.

reindex gdp_growth, generate(gdp_index2000) from(factor) to(index) tobase(2000) scale(100)) i(country) j(year)

Install

Download the following file: reindex.ado and put it in your personal ado folder (on Windows this is generally C:\ado\personal\). Put it in the subfolder r\. Stata will now search this directory for programs and have the command ready when you call it.

Please, after a few weeks of using the program, send me a mail with your remarks in order to improve the code and help out bugs. My address is in the ado file.

Stata user command -eurostat- for downloading Eurostat data

Note: feel free to use the program but you may want to follow up on the developments as I'm merging the job with some work Sébastien Fontenay is doing. Basically, we will add labels and allow datasets with monthly or quarterly data (something Diego José Torres already included in his syntax).

I have written a simple program to bulk download Eurostat data from http://ec.europa.eu/eurostat/data/bulkdownload. The output is a file with the same name as the data set. Flags are erased from the data. Execution could take a while because of the reshape command that is used if you don't specify the wide option (and you shouldn't).

Syntax

eurostat namelist [, long wide keep tab excel save clear]

namelist should include one Eurostat data file to be downloaded, unzipped and processed. You should just specify the name, not the .tsv or .gz extension.

Options

keep  saves the original .tsv file in the active folder
long  creates output in the long format (time in rows) - default
wide  creates output in the wide format (time in columns) - when saving 'wide' is added as a suffix to the filename
tab   saves output in a tab separated text (.txt) file
excel saves output in an Excel (.xlsx) file
save  saves output in a Stata data (.dta) file - default when tab nor excel are entered
clear clears data memory before proceeding

Install

Download the following file: eurostat.ado and put it in your personal ado folder (on Windows this is generally C:\ado\personal\). Put it in the subfolder e\ to keep the folder orderly. Stata will now search this directory for programs and have the command ready when you call it.

If you use Windows you also need to install 7-zip into the program files directory (C:\Program Files\7-Zip\7zG.exe). If you install it elsewhere, the ado needs to be changed - you can do that. Mac users don't need to do anything, a Linux shell should also be straightforward to add but it is not in the ado.

Please, after a few weeks of using the program, send me a mail with your remarks in order to improve the code and help out bugs. My address is in the ado file.