The Lowly Wonk: text mining

Showing posts with label text mining. Show all posts

Monday, April 18, 2011

Another data resource

I'm on a real data kick this week. Here's another one that should be useful for people training language classifiers.

lang_samples.tar.gz (1.5G gzipped)

This archive contains language samples from the 2008 static wikipedia dumps available at http://static.wikipedia.org/downloads/2008-06/. I downloaded all 261 archives, and extracted samples of text from each.

Friday, December 17, 2010

Google's n-gram viewer - pros and cons

Google just released an n-gram viewing tool for tracking trends in half a trillion words across 400 years of text. Check it out. It's super fast and interesting, kind of like Google trends on timescale steroids. Even better, they released the data.

Assorted thoughts poached from conversations with friends:

Someone should tell Fox about "Christmas"

I predict that in the year 165166 AD, fully one hundred percent of the words written in English will be the word "dance." (This will make 500-word APSA applications much more straightforward.)

My take
(In 18 words: For scientists, a small, representative sample usually beats a large, non-random sample. Google has given us the latter.)

The n-gram viewer will be a lot of fun for journalists, who often want a quick graph to illustrate a trend. It's a gold mine for linguists, who want to understand how syntax and semantics have changed over time. For both of these fields, lots of data equals better.

As a bonus for me, maybe this tool will popularize the term n-gram, so I won't have to keep explaining it in conference talks.

I'm not sure about the impact the n-gram viewer will have on the areas where the researchers seem to want to apply it most: culture, sociology, and social science in general. The reason is that in those fields we tend to care a lot about who is speaking, not just how much speech we have on record. This is why the field of sampling theory has been a consistent cornerstone in social science for almost a century. From the press releases, we can see Google's claim that this is 5.2% of all books ever written. But we don't know if it's a random sample.

And in any case, books have never been written by a random sample of people. I suspect that the n-gram viewer will have very little to offer to researchers hoping to study the culture and sociology of, say, blacks in the Jim Crow south, or the early phases of Lenin's revolution. By and large, we already know what cultural elites were saying in those periods, and the rest weren't writing books.

This means that there are two filters in place for any social scientist who wants to pull cultural trends out of these data. First, there's Google's method for sampling books, for which I haven't seen any documentation. (No offense, but this is pretty typical of computer scientists, who think in terms of scale more than sampling). Second, there's the authorship filter: you have to keep in mind that any trends are derived from written language, produced by whatever subset of the population was literate in a given period

Example
As a political scientist, I'm interested in conflict. If you go to the site and punch in "war," from 1500 to 2000, you get a graph showing quite a few interesting trends. Here are some stories that I could naively tell from this data. I suspect that many are false.

In general, humanity has been far more interested in war since 1750.
That said, interest in war has generally declined since then.
Interest in wars spikes during wars.
Proportionally, wars since 1750 affect people far less than they did previously -- the use of the term "war" jumped up to 5 times the period baseline during the English civil war, but less than double the period baseline during the World Wars.
World wars I and II were the historical high points for human interests in war, but only slightly higher than the sustained interest of the peak period in 1750.

Interesting conjectures, and I can spin a story for each of these. In several cases, I'm pretty sure we're seeing the effects of cultural or linguistic artifacts in the data. Potential confounders include: changes in literacy rates, introduction of public schooling, the American revolution, improvements in printing technology, linguistic drift, etc.

I'm cautiously pessimistic here. Given all the potential confounding variables, it's hard to see how we can sort out what's really going on in the line graphs. Maybe we can do it with fancy statistics. But we're not there yet.

I'll give the last word to BJP:

My instinct suggests that what is talking here is less the folds of culture and more the wrinkles of the record of the technology expressing that culture.

Thursday, November 4, 2010

Starting to get some dissertation results...

Apologies for the long delay between posts. Stock excuse: "Dissertation... blah blah blah..."

Actually, I'm starting to get some nifty results from my dissertation. I've spent a long summer writing surveys and software, and in the next few weeks I hope to have something to show for it. Exhibit A: a word cloud for an automated classifier of political content.

Orange words are associated with political content, and blue words are disassociated. The size of a word denotes the strength of association -- essentially, the size of each word corresponds to the absolute value of the beta value of the word in a logistic regression with "political-ness" as the dependent variable. The layout of the words is done by computer algorithm to conserve space; it doesn't carry any important information.

I used wordle for the layout. The classifier runs regularized logistic regression using the scikits.learn package for python. The training data is from a team of undergraduate research assistants.

Thursday, August 5, 2010

Link mishmash

Four links worth a quick visit.

1. "Minimum parking requirements act like a fertility drug for cars." Interesting NYTimes editorial on the consequences of government-mandated parking. I'm interesting in getting the libertarian take on this one. Do minimum parking requirements distort the market for land use by overproducing parking lots and roads, or do they facilitate commerce by lower transaction costs? Hat tip: David Smith.

2. An infographic from Bloomberg Businessweek on negative buzz, doping, and Lance Armstrong's reputation. The interesting thing here is the source: automated buzz tracking services applying sentiment analysis to the blogosphere. This is a technology I plan to use in my dissertation.

3. The myth of a conservative corporate America - A nice infographic on corporate donations to politics. Good use of FEC data.

4. Some good resources for screen scraping

Friday, May 28, 2010

Software for text mining, esp unsupervised document clustering

A friend in the department recently asked me about software for text mining. Among other things, she was looking for programs that do "unsupervised document clustering." I went through my notes and did some web searching and came up with some promising options.

I haven't worked with any of these directly (unsupservised learning is a step removed from the stuff I do) but I figured the results of the search were worth passing on.

One option close to hand is WordStat on the computer in the [UM political science] bullpen. It supports clustering and is pretty easy to use.

Another option is Justin Grimmer's Galileo package. I don't know if he's made this publically available yet. Last I heard he was trying to patent and maybe market it. Grimmer is one of Gary King's students; he was on the market this alst year. One plus to using Grimmer's work is that he's published in polisci journals, so his methods already have good credibility within the field.

A third option: RapidMiner. I haven't used this, but it's free, well-documented, and fits the bill for what you're trying to do.

Like I said, I haven't worked with any of these directly. Anybody have good/bad experiences with this kind of software?