woensdag 29 april 2015

Weights in Stata

Stata has four weights, your average statistical software has one. Still Stata is right, as I will try to explain in even simpler words than elsewhere. But let's ignore the iweight for programmers, and focus on the other three:

fweight or frequency weight - is probably the easiest, but most abused. It says that one observation represents the number indicated by the weight. Imagine you collapse a dataset based on gender, region, and educational attainment, and you regress education on gender, then the count of each line would be your fweight. Data is commonly stored in this way to reduce duplicate lines. It follows that fweights should be integers, because there is no such thing as a half respons.

pweight or sampling weight. For instance, you need to have 20% of women with an academic degree, but for some reason the sampling only gathered 10%, so you will want to double their weight. Using fweigh would overestimate the number of cases but underestimate the variance. In other software, you might rescale the weight so that it sums to the original n, but using pweight is better. Stata leaves you no choice because fweight does not work with nonintegers.

aweight provides analytical weights. Imagine that your data is collapsed including the mean of another variable, say wage. In that case the count still works as in fweight or pweight for point estimates, but precision increases with higher weights as the variance of the expected mean is more precise the more cases there are. Both pweight and aweight do rescaling not to inflate the number of cases above the total count, in contrast to fweight.

In sum, all weights return exactly the same coefficients, but different standard errors depending on the kind of data we're dealing with. One further note of caution: pweights and aweights are nonintegers, so precision is very important. I recommend storing such weights at double precision, not float. Also convert data from other formats using the 'double' option.

Links