vrijdag 21 september 2012

Labour Stats goes through a linguistic crisis

I thought about changing the name of this blog to 'Labor Stats'. Decent American spelling. It is said to be more logical. But just how logical do languages have to be? Does logic imply internal consistency, respecting etymology or rather streamlining common practice? Then which spelling is preferable: Merriam-Webster, Cambridge Dictionary, Oxford English Dictionary, BBC practice, MS Worst?

Having studied classical languages, I first considered etymology and the infamous -ising/-izing debate. The Greek ending is -idzein, Latin -izare. So why making it French when it isn't? Even old English uses the -izing form, although mixed with the s-spelling. There are other similar cases. Color is Latin too, it means colour. Why not simply write it the way it has always been (like the Americans do)? Merriam-Webster is on your side.

Second: internal consistency. Humour leads to humorous. We could spell it humor right away. It is more logical, easier for foreigners. In particular I like the use of the suffix -ize for everything you 'make'. That keeps you from writing analize, unless you have perverse motives. It is analyse or analyze. There's a preference for the former, because it comes from analysis-ize, largely omitting the suffix. Common British spelling (en-UK) thinks about it differently.

Etymology and consistency. Jolly good, but we may not forget that language is an independent cultural object. Even strict grammar and spelling rules cannot enter the living room of people and decided what dialect they speak, words they take up. During some period of time, French has influenced English. It has changed words English already had, and introduced words that have a history before they were French. These words had a history before they were Latin or Greek too. It is a cultural bias to assume civilization started 1000 BC. Maybe writing did, but then again this must have been more evolutionary than revolutionary. If you study classical languages, you quickly learn a new vocabulary for every Greek poet you read. Don't try to find the original writings, cause you will not even understand most signs.

This is exactly why I keep Labour in Labour Stats. It looks familiar, it has history and it's not that far from logical. If English would evolve towards 'labor', I would not mind at all, but for the time being labour ends on -our, error on -or and labourer on -er. It has been different before and it will be in the future, but the changes are minor and reflect the history of a word.

Then again, when the European Union tries to -ise English, primarily to please the French, I will first oppose. For me, the Oxford English Dictionary is a way more important authority that nicely balances logic, etymology and consistency. It interacts with language as it is used, without standardizing fads. Here is someone who agrees.

P.S. Don't try to make Microsoft Office talk Oxford English. You just bumped into the limits of closed source software.

dinsdag 11 september 2012

Making the Pay Gap look good (Tableau)

Here's a nice illustration of the capabilities of Tableau Software. Quite useful for the stakeholders in some projects I did.
http://www.tableausoftware.com/public/gallery/making-pay-gap-look-good



Now everybody's impressed. But actually Tableau does not help you that much. It is a nice tool, but the quintessential thing about making graphs and tables is to give a message. These Tableau tables just provide a way to explore, yet that is what researchers are paid for. Using the interactive Tableau charts, you basically provide a fun tool to viewers. I agree there is a lot of demand for useless stuff too.

I liked the Pay Gap visualization anyway, but let's look at far worse graphs:
http://www.tableausoftware.com/public/gallery/best-dutch-restaurants --> Is this a restaurant guide or a statistical tool? Main conclusion is that fancy restaurants are more likely to be found near the coast and the capital. Oh no...

Or this graph, which has wonderful colours, except you don't know what they stand for. Also, they do linear regression on sight, suggesting a discrete point where effects pop in. Could be an exponential effect though. http://www.tableausoftware.com/public/gallery/game-reviews

Yet, let us finish on a good note. While being easy to make with any statistical program, this graph and the interactive menus confront us with some astonishing insight. Very appropriate point of view.
http://www.tableausoftware.com/public/gallery/great-divide


donderdag 6 september 2012

Arme West-Vlamingen

Een tijdje geleden vroeg een journaliste mij uitleg bij een opvallende statistiek. Uit de loonstructuurenquête bleek dat de West-Vlamingen minder verdienen dan de meeste Walen (!). Ik vind het artikel waarop ze zich baseerde niet onmiddellijk terug, maar hier staat iets gelijkaardigs.

Na wat uitleg over arbeidsmobiliteit en de methodologie (de SES-enquête houdt de lonen bij op basis van de werkplaats en enkel voor bedrijven met meer dan tien werknemers) besloot ik dat de rijkdom in West-Vlaanderen allicht niet bij de werknemers zit.

Deze vakantie ontmoette ik dan een arme Amerikaan uit Montana. Dus dacht ik het regionaal GDP/BBP van Montana te vergelijken met dat van West-Vlaanderen. Montana, met ca. 900 000 inwoners, haalt 36 miljard USD. West-Vlaanderen, met 1 160 000 inwoners, 34 miljard EUR (in 2009). Close match, ook per capita. Maar dit rapport van Eurostat toont meteen ook aan dat het regionaal BBP van West-Vlaanderen helemaal niet lager is dan dit van Oost-Vlaanderen en Limburg. Antwerpen, Vlaams & Waals-Brabant scheren weliswaar hogere toppen, maar zo arm zijn die West-Vlamingen dus niet. Ik denk dat dit het bewijs is voor mijn fameuze quote.

Even more graphics ... Excel to the rescue?

After a while, you get bored of Excel so much that you eventually forget it is capable of doing some good things too. That is, with the right plug ins.

The must-have plug in is Daniel's XL Toolbox. It has some statistical options that may save you some bucks on the basic SPSS package. I particularly like the option to make charts publication ready.

Not really a plug-in, but simply stunning, is Juice Labs' Chart Chooser. An online gallery of templates you can download to get started with better-than-average graphs. The guys made Chart Cleaner in the past, which improve your charts, BUT it forces you to enable macros.

The people at Juice Labs have some more goods in store, amongst which I am simply stunned by the Excel Geocoding Tool. What this allows you to do is to have a field describing an address or a town (don't try regions or countries), and then with one press on the button obtain the longitude and the latitude. There is even a link to Google Maps. What's the deal? Well, I'm thinking of the Stata tmap user command. This plots borders on your graph, and with the geocoding tool, you can overlay regional capital town names, for instance, or add values on exactly that spot. I realize the tmap geographic projection is not the best in the world, but at least this procedure is logical, and I like that.

woensdag 5 september 2012

More graphics ... javascript and html

As I see it, three teams compete for the King of Charts title:
  • Illustrators
  • Programmers
  • Statisticians
However powerful our statistical languages, we're still far from the code that drives the Internet. A few websites of our online friends provide javascript code for very neat graphical output.
  • Google Chart Tools: you must love the investment Google does in products that won't bring money in the short run. Very good graphics, intuitive syntax. There is even this playground.
  • Fusion charts: it's free for schools and it does charts... but kind of an overload to me
  • Highcharts: didn't really bother, yet the graphs look good
All in all, I believe these programmer tools are pretty useless for now. For instance, I manage to make a geochart with the Google syntax, but it's not possible to display city names unless you hover over the dots. Nice online, but not what we'd print out. Too bad, as Google's simplicity and respect for the ISO country codes is what we need.

Cornell on statistical packages and Ethan Fosse's blog

This guy, Ethan Fosse, is a sociologist who uses R. Instead of Stata. Picture this. He does know a lot about graphics but I particularly liked this corner of the blog where he points to unsolved questions in sociology. Rising inequality in any field is one such question that intriges me too. Interesting readings. Found it through this R blog (Revolutions), when I was actually looking for sociologist using Stata. And yes, it's what people at Cornell recommend to learn first. I quote:


General Purpose Software

Quantitative data analysis in sociology is dominated by three all-purpose programs, Stata, SPSS, and SAS. All three are available on Cornell's Athena computer cluster, and all three are excellent.
We typically recommend that graduate students learn Stata first. SPSS has some attractive features, but it has been losing ground to the other two for the past 15 years. Although SAS is the most comprehensive, it has a relatively inefficient programming language. Stata is almost as comprehensive as SAS, and it has a much more efficient programming language. And, if you need one of the special routines that SAS offers but Stata does not, you will often end up using a more specialized computer program anyway because even SAS is not quite as good as the specialized software (see below). Nonetheless, SAS is particularly well-suited for very large datasets, and SAS is also the dominant package for the federal government. If you work with many types of government data, you will find that the best supporting documentation is written for SAS users.
UCLA statistical computing has the best (we think) on-line set of resources for these three programs. See http://www.ats.ucla.edu/stat/. Also, Cornell's CISER offers tutorials for all three programs. Seehttp://www.ciser.cornell.edu/ASPs/workshops.aspx.



Infographics

To us, researchers, two opposites demand always pop up:

  • Be precise
  • Make science look fun
In order to lean a bit more towards the latter, people created infographics and infographics tools. It's like the kind of graphs you read in newspapers or on posters. You may get close to the chart itself if you're really handy with Stata, but it takes way to much time for the simple chart you get. There are many tools that may do that easier and better. There are also much more artistic contribution you'll read about on the cool infographics blog. And be sure to check out this Swiss website on data visualizations.

The drawback: which tool will do what you want and exactly what you want? Basically, you need to be able to customize and add data. That brings you terribly close to Stata, doesn't it. Yet sometimes, these tools will really make it happen:
  • Tableau: it's the more sophisticated package, available in a free ("public"), paid desktop and server version. You get it. Consultancy stuff. Basically it's all about two axis and a third dimension expressed in colours or size of dots. Also has annoying interactive dashboards we do not need.
  • Many eyes: it's IBM software, so it should eventually find its way into SPSS. Nice charts, hard way to get there though. Worth a try and looks professional.
  • Infogr.am: I haven't really figured out. It makes nice graphs, most importantly including the matrix charts I love (coloured little people representing counts). It's beta and buggy at the moment, and quite inflexible. Like how on earth do you change the shapes in the matrix chart? I do not know.

  • Pik to chart: this is only the lay-out part: nice fonts, few charts really.
Word clouds and social media
  • wordle: looks very good
  • visual.ly: the sole purpose of this website is to show off your work. Boring. Oh, there are a few tools to combine with your Facebook or Twitter website. Fancy stuff. Far from what we do.

Network visualizations
  • Gephi: network visualizations. I don't do that. Much like Graph Viz, but I suppose more intuitive.

(I got the overview here, here and mostly here, still plenty of tools I haven't tried yet)

Olympic Games of graphics - Junk Chart

This is one hell of a good graphical overview, on the NYT website. Just take a look and be amazed.

http://www.nytimes.com/interactive/2012/08/05/sports/olympics/the-100-meter-dash-one-race-every-medalist-ever.html

They invested a lot of money it seems, as here you have the same engine applied to Swimming.



And that's just an introduction to a wonderful blog about charts, called Junk Charts. The discussion of the Olympic charts is found here. And this is a nice appreciation of Kenworthy's work.