Wednesday, June 08, 2005

So I’m a sucker for whizzy visualisations

Here’s another: WordCount, an interactive visualisation of word frequency in British English.

The data is from the British National Corpus, built between 1991–1994 from samples of written and spoken English. It’s interesting how fast the corpus has aged in the 10 years since then, which saw the rise of the Internet out of the realms of academics and hobbyists and into the general population: look how lowly rated internet (30525), email (44758), browser (51513) are. Website and webpage don’t appear at all. The World Wide Web was just starting to emerge in the period in which the corpus was being compiled, but hadn’t yet hit the public eye much.

Modem (13751) comes in a lot higher than broadband (45214) — a ratio now reversed, if Google is anything to go by. (64.5 million hits for “broadband”; 24.9 million for “modem”, and that includes 2.3 million for “cable modem”.)

WordCount results for “kew”: rank 19914.

In the obligatory vanity search, my reasonably-unusual surname — common enough that there’s a few in every phonebook, uncommon enough to usually require me to spell it out (“That’s K-E-W.”) — comes in higher than most of these internet-related keywords: james (1000) kew (19914). I would guess the usage is inflated a bit by sharing a surname with a district of London, not to mention a prominent botanic garden. (“Yes, Kew, like the Gardens.”)

And a nice bit of data collection: the WordCount people keep count of queries made and use it to generate QueryCount, applying the same visualisation to queries. The results confirm what we already know: given a dictionary, the first thing most people will do is look up naughty words. And it also suggests that their second impulse is to look up their own name: forenames rank a lot higher in QueryCount than in WordCount. James, ranked 1000 in WordCount, leaps to 70 in QueryCount; melinda from 41090 to 2437. Every name I tried shows the same order-of-magnitude increase. People are more interested in themselves than the world at large is.


I entered "knowles" into the word count website and it came back with "knowles, detested, bullshit" as the next two most common entries. I feel proud to be in such company and at being just slightly more popular than BS.