I try to explain sample selection bias in the simplest of words so that even I can understand it.
Typically in OLS, we do at regression at the mean. This ensures that the mean predicted RHS is equal to the observed LHS. This is equivalent to saying the expected error is zero. This is an essential point.
Now suppose that the real error does not have a zero mean, for instance because you are missing a part of the sample, and this part has some unobserved characteristics (error u in the selection model) that correlate with the real error in the model. What will happen then?
Y = Xb + error
Say that those who are not selected may be out of the sample because of the same characteristics that lead to a lower Y. In that case e > e* and the error terms e* and u will be positively correlated. B will also have a positive bias (towards zero if negative).
It is possible to control for the sample selection bias by including the Inverse Mill's Ratio (IMR or lambda), which is the unlikelihood of bein in the model. It is the density of the expected selection chance over the cumulative probability. If you try to map this, you will notice that this is a non-linear function. However, it has a near linear part. As a result, when you predict the selection chance, you need to include more than just the variables in the model to be able to identify the selection bias. However, you cannot leave out the variables from the model if they determine the selection chance. If you do this, the estimated IMR will be wrong and (OVB). This is similar to the case of collinearity: leaving out one of two correlated variables will bias the effect of the other.
Just in case you also want to interprete results: the coefficient of IMR (or lambda), say b_m, is in fact the product of the correlation between the error terms (rho) and a scaling parameter (sigma) which is always positive. If you use the heckman command rho is shown straight away.
Now the tricky question is: why is it that a good estimate of IMR solves the sample selection problem? This is a matter of making the derivation of Heckman. I will add this later.
woensdag 21 januari 2015
zondag 4 januari 2015
Stata user command -reindex- for rapid conversion between levels, factor changes, percentage changes and index
This program expresses level variables as changes and allows to convert between three indicators of trends: factors, percentage change and an index. The newly constructed variable needs to be set.
Please, after a few weeks of using the program, send me a mail with your remarks in order to improve the code and help out bugs. My address is in the ado file.
Syntax
reindex varname, generate(name) [i(name)] j(name) from(string) to(string) [fromscale(integer 1)] [toscale(integer 1)] [tobase(integer -999)]
varname is the name of the variable you want to manipulate
Options
generate() Required: give a name for the new variable.
i() Set the panel id variable if there are panels. You can use just one variable and it needs to be numeric. Pretty much like xtset. If there is no panel id, you can leave it out.
j() Required: time variable. Intervals must always be the same.
from() Required: choose between levels, percent, factor, and index
to() Required: choose between percent, factor, and index. You can't go back to levels.
fromscale() In case your variable was multiplied by some factor (say 100 for percentages or an index).
toscale() Same for the target variable.
tobase() Set the base time period (required with the option to(index))
Working example
This command will convert gdp growth figures from factors (1.02 for 2 percent growth) to and index with base 2000 = 100.
reindex gdp_growth, generate(gdp_index2000) from(factor) to(index) tobase(2000) scale(100)) i(country) j(year)
Install
Download the following file: reindex.ado and put it in your personal ado folder (on Windows this is generally C:\ado\personal\). Put it in the subfolder r\. Stata will now search this directory for programs and have the command ready when you call it.Please, after a few weeks of using the program, send me a mail with your remarks in order to improve the code and help out bugs. My address is in the ado file.
Stata user command -eurostat- for downloading Eurostat data
Note: feel free to use the program but you may want to follow up on the developments as I'm merging the job with some work Sébastien Fontenay is doing. Basically, we will add labels and allow datasets with monthly or quarterly data (something Diego José Torres already included in his syntax).
I have written a simple program to bulk download Eurostat data from http://ec.europa.eu/eurostat/data/bulkdownload. The output is a file with the same name as the data set. Flags are erased from the data. Execution could take a while because of the reshape command that is used if you don't specify the wide option (and you shouldn't).
If you use Windows you also need to install 7-zip into the program files directory (C:\Program Files\7-Zip\7zG.exe). If you install it elsewhere, the ado needs to be changed - you can do that. Mac users don't need to do anything, a Linux shell should also be straightforward to add but it is not in the ado.
Please, after a few weeks of using the program, send me a mail with your remarks in order to improve the code and help out bugs. My address is in the ado file.
I have written a simple program to bulk download Eurostat data from http://ec.europa.eu/eurostat/data/bulkdownload. The output is a file with the same name as the data set. Flags are erased from the data. Execution could take a while because of the reshape command that is used if you don't specify the wide option (and you shouldn't).
Syntax
eurostat namelist [, long wide keep tab excel save clear]
namelist should include one Eurostat data file to be downloaded, unzipped and processed. You should just specify the name, not the .tsv or .gz extension.
Options
keep saves the original .tsv file in the active folder
long creates output in the long format (time in rows) - default
wide creates output in the wide format (time in columns) - when saving 'wide' is added as a suffix to the filename
tab saves output in a tab separated text (.txt) file
excel saves output in an Excel (.xlsx) file
save saves output in a Stata data (.dta) file - default when tab nor excel are entered
clear clears data memory before proceeding
Install
Download the following file: eurostat.ado and put it in your personal ado folder (on Windows this is generally C:\ado\personal\). Put it in the subfolder e\ to keep the folder orderly. Stata will now search this directory for programs and have the command ready when you call it.If you use Windows you also need to install 7-zip into the program files directory (C:\Program Files\7-Zip\7zG.exe). If you install it elsewhere, the ado needs to be changed - you can do that. Mac users don't need to do anything, a Linux shell should also be straightforward to add but it is not in the ado.
Please, after a few weeks of using the program, send me a mail with your remarks in order to improve the code and help out bugs. My address is in the ado file.
Abonneren op:
Posts (Atom)