Thursday, April 14, 2011

Data release: 75K political web sites

Data release:
pol_site_sample.tar.gz (1.2GB, gzipped)
This archive contains results from a front-page-only crawl of ~75,000 political web sites on 4/9/2011.

These sites span a large portion of the political web, but they are not a representative sample. Each was classified as very likely to be political (p > 0.9) in a crawl performed in August 2010. In other words, these are pages from sites that featured political content 8 months ago. Presumably, most -- but certainly not all -- of the sites still feature political content today.
I'm going to use this to train content classifiers. Might be useful to others as well.

No comments: