The Lowly Wonk: web spiders

Showing posts with label web spiders. Show all posts

Thursday, May 5, 2011

Working paper: An automated snowball census of the political web

Here's my paper for the JITP Future of Computational Social Science conference in a couple weeks. This paper describes my process for using SnowCrawl and a highly trained text classifier to search out political web sites -- pretty much all of them -- on the web.

Final census results are available here. I'm planning to run another iteration of this census before too long. I welcome comments and suggestions.

Tuesday, April 19, 2011

Code: a random web crawler

This code crawls the web to generate a pseudo-random sample of web sites. Not my prettiest code, but it works and may save somebody an afternoon of coding.

random_crawler.tar.gz (14MB, zipped)

This script explores the web for a pseudo-random sample of sites. The crawl proceeds in a series of 100 (by default) waves. In each wave, 2,000 (by default) crawlers attempt to download pages. When page downloads are successful, all the unique outbound hyperlinks are stored to a master list. Then sites for the next round are sampled.

Sites from common domain names are undersampled, to avoid getting stuck (e.g. within specific social networking sites). When sites are selected for the next round, weights are equal x^(3/4), where x is the number of sites in the same domain.

After several waves, the sample should be well-mixed -- a pseudo-random sample of the web.

Note: This code is kind of sloppy, does not scale well, and is poorly commented. Bad code! Bad!

Friday, April 15, 2011

More text archives

Three more data sets (108MB, gzipped) for training classifiers. All these files are text-only, crawled in the week of 12/17/2010. It just took me a while to get around to releasing them.

political (115M):
Front pages from ~2,500 political sites. These sites were rated extremely likely to be political (p>.99) in an early version of my census of the political web. I revisited the same sites several months later and downloaded the new front pages to create this dataset. They should be appropriate for case-control training of a political classifier.

random (471M):
Front pages from ~8,000 "random" sites. These are a pseudo-representative sample of the web. I ran several web spiders in parallel, recording all outbound links from visited sites. I deliberately undersampled common namespaces, in order to avoid getting trapped in densely-linked social networking sites. The 8,000 sites are a random sample from a list of ~2 million sites generated in a crawl of this type.

porn (43M):
Front pages from ~1,200 pornography sites. I never thought I'd be curating pornography as part of my dissertation, but so much of the web is porn that I've had to build special classifiers to screen it out before sending to my undergrad research team for coding. These 1,200 sites were all linked from http://research.vision-options.com/research.php.

Thursday, April 14, 2011

Data release: 75K political web sites

Data release:
pol_site_sample.tar.gz (1.2GB, gzipped)

This archive contains results from a front-page-only crawl of ~75,000 political web sites on 4/9/2011.

These sites span a large portion of the political web, but they are not a representative sample. Each was classified as very likely to be political (p > 0.9) in a crawl performed in August 2010. In other words, these are pages from sites that featured political content 8 months ago. Presumably, most -- but certainly not all -- of the sites still feature political content today.

I'm going to use this to train content classifiers. Might be useful to others as well.