vrijdag 23 oktober 2015

Stata float detail

Beware, Stata may be the most comfy and reliable software for STATistical Analysis, but it sucks in comparison with your handheld calculator if you want to come up with the simplest of figures. Why? Because computers do not see 0.1 as we do (decimal issue), and because there is a trade off between precision and speed or efficient storage (rounding issue).

Below is an experiment to show the second issue. Make three variables with long numerics, store them as float, double, and default. As there are more then 7 digits of precision, the float variable will be rounded.

clear
set obs 1
gene float  varfloat = 2567987654 
gene double vardouble = 2567987654
gene vardefault         = 2567987654

gene float varfloatchange = 10
replace varfloatchange   = 2567987654
format * %15.0f
list

     +---------------------------------------------------+
     |   varfloat    vardouble   vardefault   varfloat~e |
     |---------------------------------------------------|
  1. | 2567987712   2567987654   2567987712   2567987712 |
     +---------------------------------------------------+

As you can see, the default type was float, an Stata did not store any float variable with the same precision as the input. Instead, we have another figure, which happens to be 2567987712 for no apparent reason. It's not random, because if you rerun the syntax it is always the same. Probably it is some rounded exponentiated figure or the closest figure obtainable in a binary series of 7 digits or so.


donderdag 22 oktober 2015

Stata code for Herfindahl concentration index

Herfindahl concentration index

The Herfindahl concentration index computes a score between 0 (total dispersion) and 1 (total concentration) which is the sum of all squared shares within a unit.

In the working example hours per job for workers are used.

tempvar t1
egen `t1' = total(hours), by(workerid year)
tempvar t2
gene `t2' = (hours / `t1')^2
egen herfindahl = total(`t2'), by(workerid year)
drop `t1' `t2'
label var herfindahl "Herfindahl concentration index"

maandag 21 september 2015

Strongly balanced sample (Stata simulation)

clear
local base = 4      /* Set nr. of observations per units */
local size = 200    /* Set nr. of units */

*****
local tot = `size' + `base' - 1
di `tot'
set obs `tot'
gene id = floor(_n / `base')
gene time = `base'*(_n/`base' - id) + 1
drop if _n < `base'

*****
xtset id time

dinsdag 7 juli 2015

Using -reindex- to calculate real trends

We live in a real world, not in a nominal fiction. Hence it is practical and sensible to express amounts of money in real terms. I will show how to use my Stata user command - reindex -  to help you with this.

There is no straightforward way of deflating. When the monetary base expands, inflation may result. But it is also possible that economic activity increases, in which case you can still buy one can of coke with one euro, but as you happen to have two euros, you can buy two.

There are four standard ways of deflating prices:

  1. The GDP deflator is the ratio between the nominal gdp and the 'real' gdp which is the counterfactual with prices of the reference year. Even if it seems a logical way of calculating price increases, there are numerous biases in this measure - I can't find out what to do with new goods that weren't priced the year before. The standard GDP deflator is measured based on goods produced in the economy. 
  2. The deflator of consumed goods is exactly the same but with the current goods that are purchased. Importantly, the basket of goods changes every year.
  3. The (national) consumer price index is a fixed basket of goods corresponding to the average household's consumption, of which prices are tracked over time. Only in the longer term the basket is changed, often when goods have become out of fashion for a while already. Sometimes there are minor adjustments to the weights of goods.
  4. The harmonised consumer price index is the same as above but with an internationally harmonised basket of goods. Of all deflators I find this one capturing inflation best.
Data may be monthly, quarterly, come as an index or as a yearly change. Before continuing, make sure you transform into the time period of the nominal variable you want to convert. Indicate the unit within i() and the year variable in j(). Merge the data based on year and units.

Now use -reindex- to set both the deflator and the nominal variable to the same type. Reindex can work with four types:
  • Levels
  • Index
  • Percentages
  • Factors
Note that levels can only be converted into the trend types, not the other way around. Note also that percentages and factors are actually the same thing, but the percentage change is the factor change minus one.

For percentage trends, we commonly find an expression like the following:

%.real growth = %.nominal growth - %.inflation

This is a rule of thumb that is ok for small numbers. However, the right calculation is in factors:

nominal growth / real growth = inflation,

which in logs would be ln.nominal - ln.real = ln.inflation and as the log of a factor between .90 and 1.10 is very close to the percentual change (i.e. factor - 1), the 'easy' formula above would hold.

We can move factors to obtain:

real = nominal / inflation

The above we can also use with indices, where you set the base for all indices in one and the same year marked by tobase() to 100 using toscale(), except for the deflator (inflation) which you set to 1. The reindex program allows setting the base time and scale. 

woensdag 10 juni 2015

Stata user commands

Here's a list of Stata user commands I have found valuable:

Data manipulation

  • eurostatuse - our Eurostat download plugin (get it through ssc install eurostatuse)
  • usespss and savespss - spss conversion (be careful with variable precision and rounding)
  • gsample - extended sample

Data cleaning

  • copydesc - copy labels between variables
  • nmissing and npresent - to check the number of valid and missing cases on a variable
  • isko - conversion of ISCO codes into EGP and SEI
  • kountry - ISO country codes
  • labvars - extended labeling options
  • varlab - extended labeling options

Calculations

  • domin - decomposition of explained variance by regressors
  • nldecompose - non-linear decomposition methods
  • oaxaca - decompositions
  • polychoric - for calculating polychoric correlation matrices and performing polychoric factor analysis

Output creation

  • md - log file parser
  • grc1leg - graph combine with 1 legend
  • profileplot - plots with means on several variables
  • markup - log file parser
  • markdown - log file parser
  • synlight - log file parser
  • weave - syntax parser

I forgot why I found these useful

  • quote - no idea
  • results - no idea
  • txt - no idea

donderdag 30 april 2015

Markdown

I'm a sucker for simplicity. I found Word too multifunctional, I found LaTeX too hard. Then Markdown appeared, the language used on Wikipedia, on my Trello app, and - commonly - in emails.

Markdown syntax

There are some complication, like hyperlinks that I don't want to mention here, in order not to obscure the absolute simplicity of Markdown. This is what you'll use:
# Heading 1
## Heading 2
### Heading 3 (you get the point)
*bold*
**italic**
![Graph alt text](./path.png) 
Nothing more.

Markdown on OS X

One address: MacDown. It is a beautiful program that has a split screen set-up: left you have a WYSIWYM interface, right there is a live preview. Did I say it is beautiful? It's free.

Markdown on Windows

I have tried a few, but on Windows with no luck. There is always some premium function for sale and I don't like that. WriteMonkey is a good idea, but the pdf support is ... not included. There are also a bunch of web apps, most notably Dillinger and StackEdit. They look good, but I like to open a file from windows explorer and then that's just a pain.

In fact, gVim will do all you want, but there's a learning curve as it doesn't behave like normal textpads (I like notepad++ for instance as an in-between solution).

Two programs however will do the job, and they are software packages! Obviously, R is free, and Stata is not. You will know that I love Stata - I think it's the best, albeit expensive, textbook in statistics and it also has free software with it.

Markdown in R and Rstudio

Rstudio is brilliant software. Use the -knitr- package to enjoy writing markdown files and exporting to pdf. It's made for that and it's used by millions.

Markdown in Stata

Stata is a different beast. In a way it is less flexible and the markdown integration doesn't feel as native as in R. However, it is also easier.

You output an smlc log file, then you parse the file with the user command -markdoc- (Converting SMCL to Markdown and other formats using Pandoc). That's all there is to it. Remember to put your markdown code between command /* beginnings and endings */ which you put on separate lines. Also use - quietly - to reduce output.

There are a couple of other Stata programs doing a similar job:
  • Weaver: HTML and PDF dynamic report producer
  • Ketchup: HTML and PDF dynamic report producer
  • Synlight: SMCL to HTML convertor and syntax highlighter

woensdag 29 april 2015

Weights in Stata

Stata has four weights, your average statistical software has one. Still Stata is right, as I will try to explain in even simpler words than elsewhere. But let's ignore the iweight for programmers, and focus on the other three:

fweight or frequency weight - is probably the easiest, but most abused. It says that one observation represents the number indicated by the weight. Imagine you collapse a dataset based on gender, region, and educational attainment, and you regress education on gender, then the count of each line would be your fweight. Data is commonly stored in this way to reduce duplicate lines. It follows that fweights should be integers, because there is no such thing as a half respons.

pweight or sampling weight. For instance, you need to have 20% of women with an academic degree, but for some reason the sampling only gathered 10%, so you will want to double their weight. Using fweigh would overestimate the number of cases but underestimate the variance. In other software, you might rescale the weight so that it sums to the original n, but using pweight is better. Stata leaves you no choice because fweight does not work with nonintegers.

aweight provides analytical weights. Imagine that your data is collapsed including the mean of another variable, say wage. In that case the count still works as in fweight or pweight for point estimates, but precision increases with higher weights as the variance of the expected mean is more precise the more cases there are. Both pweight and aweight do rescaling not to inflate the number of cases above the total count, in contrast to fweight.

In sum, all weights return exactly the same coefficients, but different standard errors depending on the kind of data we're dealing with. One further note of caution: pweights and aweights are nonintegers, so precision is very important. I recommend storing such weights at double precision, not float. Also convert data from other formats using the 'double' option.

Links

dinsdag 3 februari 2015

Maximum Likelihood - with examples of regressions of continuous and discrete dependent variables

The idea of Maximum Likelihood is very simple, yet powerful. A regression returns a residual for every case, telling us how far the estimation is off. In Ordinary Least Squares, we minimize the residual. In Maximum Likelihood, we ... maximize the likelihood of the prediction. It sounds like it is the same thing, and in some cases it is the same thing. But whereas OLS mathematically computes the parameters of the model, ML uses a maximization algorithm (generally Newton-Raphson) that checks multiple possibilities. These are called iterations, and when the model works well, they will 'converge'. If there is no maximum (i.e. the maximizand is non concave), or several maxima, you are out of luck.

Continuous dependent variables

Take a simple regression like y = Xb, and suppose - as in simplified OLS - that the error term is normally distributed. Then e = y - Xb.

Now let f(e) = (2.pi.sigma^2)^-(1/2) . exp(- (e^2)/(2.sigma^2)

In this case f is a normal density function or normal distribution. If the error is 0, the density is highest (around .44), so a good estimation will need to have small errors.

The likelihood function we want to maximize is the joint probability distribution of the sample. This means we compute the density of the error for every case and multiply all of them. We may just as well take the log and sum the logs of the error density. When this sum of logs is maximizes, we have the same set of coefficients b as in OLS. Note that this will be the true b only if the error term is indeed normally distributed.

Discrete dependent variables

When the outcome variable y_obs is binary (0,1), such a linear model will not work. Instead we estimate a function that lies behind the observed outcome. If that function results in a number greater than 0, we will observe a success (1) and if not, a failure (0). Hence again, we have:

y = xb

Yet in this case we cannot simply compute a residual. In the probit case, we take the cumulative probability of xb. It is obvious that this depends on a scaling parameter sigma.

The likelihood function is then P(xb) if y_obs = 1, and 1-P(xb) if y_obs = 1. Another way to express this is:

L_i = P(xb)^y_obs * (1-P(xb))^(1-y_obs).

Maximizing the product of all L_i or the sum of the logs yields good estimates of b.

Note: an alternative to probit is the logit or logistic regression. Then the function P is not the normal cumulative distribution function, but instead it is the inverse logit:

P(xb) = exp(xb) / (1+ exp(xb) )

The advantage here is that no scaling sigma has to be estimated. It used to be popular in the early days of sociology, before econometrics became dominant in the social sciences.

Links

The last link has a nice picture comparing the cumulative distribution function and the inverse logit (or logistic) function. The latter has 'fatter' tails, even though the difference is negligible in practice.








woensdag 21 januari 2015

Sample selection bias: intuition

I try to explain sample selection bias in the simplest of words so that even I can understand it.

Typically in OLS, we do at regression at the mean. This ensures that the mean predicted RHS is equal to the observed LHS. This is equivalent to saying the expected error is zero. This is an essential point.

Now suppose that the real error does not have a zero mean, for instance because you are missing a part of the sample, and this part has some unobserved characteristics (error u in the selection model) that correlate with the real error in the model. What will happen then?

Y = Xb + error

Say that those who are not selected may be out of the sample because of the same characteristics that lead to a lower Y. In that case e > e* and the error terms e* and u will be positively correlated. B will also have a positive bias (towards zero if negative).

It is possible to control for the sample selection bias by including the Inverse Mill's Ratio (IMR or lambda), which is the unlikelihood of bein in the model. It is the density of the expected selection chance over the cumulative probability. If you try to map this, you will notice that this is a non-linear function. However, it has a near linear part. As a result, when you predict the selection chance, you need to include more than just the variables in the model to be able to identify the selection bias. However, you cannot leave out the variables from the model if they determine the selection chance. If you do this, the estimated IMR will be wrong and (OVB). This is similar to the case of collinearity: leaving out one of two correlated variables will bias the effect of the other.

Just in case you also want to interprete results: the coefficient of IMR (or lambda), say b_m, is in fact the product of the correlation between the error terms (rho) and a scaling parameter (sigma) which is always positive. If you use the heckman command rho is shown straight away.

Now the tricky question is: why is it that a good estimate of IMR solves the sample selection problem? This is a matter of making the derivation of Heckman. I will add this later.

zondag 4 januari 2015

Stata user command -reindex- for rapid conversion between levels, factor changes, percentage changes and index

This program expresses level variables as changes and allows to convert between three indicators of trends: factors, percentage change and an index. The newly constructed variable needs to be set.

Syntax

reindex varname, generate(name) [i(name)] j(name) from(string) to(string) [fromscale(integer 1)] [toscale(integer 1)] [tobase(integer -999)]

varname is the name of the variable you want to manipulate

Options

generate()  Required: give a name for the new variable.
i()         Set the panel id variable if there are panels. You can use just one variable and it needs to be numeric. Pretty much like xtset. If there is no panel id, you can leave it out.
j()         Required: time variable. Intervals must always be the same.
from()      Required: choose between levels, percent, factor, and index
to()        Required: choose between percent, factor, and index. You can't go back to levels.
fromscale() In case your variable was multiplied by some factor (say 100 for percentages or an index).
toscale()   Same for the target variable.
tobase()    Set the base time period (required with the option to(index))

Working example

This command will convert gdp growth figures from factors (1.02 for 2 percent growth) to and index with base 2000 = 100.

reindex gdp_growth, generate(gdp_index2000) from(factor) to(index) tobase(2000) scale(100)) i(country) j(year)

Install

Download the following file: reindex.ado and put it in your personal ado folder (on Windows this is generally C:\ado\personal\). Put it in the subfolder r\. Stata will now search this directory for programs and have the command ready when you call it.

Please, after a few weeks of using the program, send me a mail with your remarks in order to improve the code and help out bugs. My address is in the ado file.

Stata user command -eurostat- for downloading Eurostat data

Note: feel free to use the program but you may want to follow up on the developments as I'm merging the job with some work Sébastien Fontenay is doing. Basically, we will add labels and allow datasets with monthly or quarterly data (something Diego José Torres already included in his syntax).

I have written a simple program to bulk download Eurostat data from http://ec.europa.eu/eurostat/data/bulkdownload. The output is a file with the same name as the data set. Flags are erased from the data. Execution could take a while because of the reshape command that is used if you don't specify the wide option (and you shouldn't).

Syntax

eurostat namelist [, long wide keep tab excel save clear]

namelist should include one Eurostat data file to be downloaded, unzipped and processed. You should just specify the name, not the .tsv or .gz extension.

Options

keep  saves the original .tsv file in the active folder
long  creates output in the long format (time in rows) - default
wide  creates output in the wide format (time in columns) - when saving 'wide' is added as a suffix to the filename
tab   saves output in a tab separated text (.txt) file
excel saves output in an Excel (.xlsx) file
save  saves output in a Stata data (.dta) file - default when tab nor excel are entered
clear clears data memory before proceeding

Install

Download the following file: eurostat.ado and put it in your personal ado folder (on Windows this is generally C:\ado\personal\). Put it in the subfolder e\ to keep the folder orderly. Stata will now search this directory for programs and have the command ready when you call it.

If you use Windows you also need to install 7-zip into the program files directory (C:\Program Files\7-Zip\7zG.exe). If you install it elsewhere, the ado needs to be changed - you can do that. Mac users don't need to do anything, a Linux shell should also be straightforward to add but it is not in the ado.

Please, after a few weeks of using the program, send me a mail with your remarks in order to improve the code and help out bugs. My address is in the ado file.