Monday, April 18, 2011

Another data resource

I'm on a real data kick this week. Here's another one that should be useful for people training language classifiers.

lang_samples.tar.gz (1.5G gzipped)
This archive contains language samples from the 2008 static wikipedia dumps available at I downloaded all 261 archives, and extracted samples of text from each.

