Friday, December 17, 2010

Google's n-gram viewer - pros and cons

Google just released an n-gram viewing tool for tracking trends in half a trillion words across 400 years of text. Check it out. It's super fast and interesting, kind of like Google trends on timescale steroids. Even better, they released the data.

Assorted thoughts poached from conversations with friends:
  • Someone should tell Fox about "Christmas"
  • I predict that in the year 165166 AD, fully one hundred percent of the words written in English will be the word "dance." (This will make 500-word APSA applications much more straightforward.)

My take
(In 18 words: For scientists, a small, representative sample usually beats a large, non-random sample. Google has given us the latter.)

The n-gram viewer will be a lot of fun for journalists, who often want a quick graph to illustrate a trend. It's a gold mine for linguists, who want to understand how syntax and semantics have changed over time. For both of these fields, lots of data equals better.

As a bonus for me, maybe this tool will popularize the term n-gram, so I won't have to keep explaining it in conference talks.

I'm not sure about the impact the n-gram viewer will have on the areas where the researchers seem to want to apply it most: culture, sociology, and social science in general. The reason is that in those fields we tend to care a lot about who is speaking, not just how much speech we have on record. This is why the field of sampling theory has been a consistent cornerstone in social science for almost a century. From the press releases, we can see Google's claim that this is 5.2% of all books ever written. But we don't know if it's a random sample.

And in any case, books have never been written by a random sample of people. I suspect that the n-gram viewer will have very little to offer to researchers hoping to study the culture and sociology of, say, blacks in the Jim Crow south, or the early phases of Lenin's revolution. By and large, we already know what cultural elites were saying in those periods, and the rest weren't writing books.

This means that there are two filters in place for any social scientist who wants to pull cultural trends out of these data. First, there's Google's method for sampling books, for which I haven't seen any documentation. (No offense, but this is pretty typical of computer scientists, who think in terms of scale more than sampling). Second, there's the authorship filter: you have to keep in mind that any trends are derived from written language, produced by whatever subset of the population was literate in a given period

Example
As a political scientist, I'm interested in conflict. If you go to the site and punch in "war," from 1500 to 2000, you get a graph showing quite a few interesting trends. Here are some stories that I could naively tell from this data. I suspect that many are false.
  • In general, humanity has been far more interested in war since 1750.
  • That said, interest in war has generally declined since then.
  • Interest in wars spikes during wars.
  • Proportionally, wars since 1750 affect people far less than they did previously -- the use of the term "war" jumped up to 5 times the period baseline during the English civil war, but less than double the period baseline during the World Wars.
  • World wars I and II were the historical high points for human interests in war, but only slightly higher than the sustained interest of the peak period in 1750.
Interesting conjectures, and I can spin a story for each of these. In several cases, I'm pretty sure we're seeing the effects of cultural or linguistic artifacts in the data. Potential confounders include: changes in literacy rates, introduction of public schooling, the American revolution, improvements in printing technology, linguistic drift, etc.

I'm cautiously pessimistic here. Given all the potential confounding variables, it's hard to see how we can sort out what's really going on in the line graphs. Maybe we can do it with fancy statistics. But we're not there yet.

I'll give the last word to BJP:
My instinct suggests that what is talking here is less the folds of culture and more the wrinkles of the record of the technology expressing that culture.

No comments: