woensdag 25 mei 2011

Cluster analysis: yeah yeah yeah

Context
For a recent working paper, I did some cluster analyses. Plural, my friend, because there's not one way. Before reading further, please understand that cluster analysis is an explorative, non-inferential method.

Problem
The Stata manual says it all: "Some researchers claim that there are as many clustering methods as there are researchers. This is untrue, there are many more methods than researchers." The wording may not be exact, but I agree with the statement. In this post, I will address some difficulties.

Issues

  • Linkage method: there are five linkage methods, which define how cases get grouped: which distances to look after. There is no optimal method: single linkage causes linking patterns (one cases after another joining the same cluster), average linkage and ward's linkage are sensitive to outliers, ward's linkage distance measurement is not possible to interprete and centroid linkage will even refuse to return dendrograms.
  • Similarity distance measurement: there are many distance measurements: simple, Euclidian, city block, Mahalanobis, ... Which one to choose? There's no truth.
  • Cases: excluding cases may skew central values of clusters, resulting in wrongly added cases
  • Order: in two-step cluster analysis, some pre-clustering is done before tackling the full data set (because it may be too large). Shuffle your deck and you'll get different results.
  • Variables: ideally you cluster orthogonal factors, still, and in the general case, it doesn't mean that each variable/factor is equally important for clustering. They will all have the same a priori impact on clustering though.
  • Succes garanteed: you'll always get a result. Does that mean you found something? No. 
  • Stopping rules: there are many rules to determine the optimal number of clusters (see Mulligan, 1985). Calinski is default in Stata. There are some issues though, such as multiple optima or none at all. And does it make sense to choose a 23 cluster solution?
Conclusion

Cluster analysis is a wonderful way to reduce and explore data. However, I would recommend to experiment with different ways and keep my hands of it if they bear no similarity. Heck, a cluster analysis of outcomes would be useful!

Judgement: not to be trusted

maandag 2 mei 2011

Do people actually read them too?

Nick Cox is the incarnation of Stata. His view on table-machines, when asked to make more user friendly Stata tables:


http://www.stata.com/statalist/archive/2010-11/msg00071.html


I think that is a good summary of a widely held view. I have no axe to grind here as I am not a provider in the main territory that Thomas has in mind, but on behalf of fellow user-programmers I suggest that the descriptor "ad hoc" does not quite fit the situation.


Of the programs implied here, and that I know about, I'd say that they all have a clear vision of what they want to do which has been maintained throughout their development. It can seem ad hoc if you want to do something else, but that is a different matter. As I've already remarked in this thread, user-programmers tend to write programs for themselves, with no guarantee of meeting anyone else's needs.


The overall problem here is describable in two words "better tables" and lots of users want to second that. But some want more unified syntax for tables within Stata, some want more detailed control, some want greater support for export to their own favourite foreign programs, standard or otherwise, and some want two or three of those. All understandable enough, but don't complain if this all turns into a [T] manual several hundred pages long to meet not only your reasonable requests, but most other people's too!


Emphasis here varies depending on where you come from. Some people seem routinely to be producing tens or hundreds of tables in rigid formats full of coefficients, standard errors and P-values and those awful stars. Do people actually read them too?


Nick
n.j.cox@durham.ac.uk


P.S. On a key question of intellectual priority, I lay claim to "Some Alternative Software", as indeed could anyone else who came up with it earlier or later. But (with thanks to Maarten for the compliment) the joke about there being so many standards to choose from is certainly not mine. Andrew Tanenbaum got there much earlier.