random_crawler.tar.gz (14MB, zipped)
This script explores the web for a pseudo-random sample of sites. The crawl proceeds in a series of 100 (by default) waves. In each wave, 2,000 (by default) crawlers attempt to download pages. When page downloads are successful, all the unique outbound hyperlinks are stored to a master list. Then sites for the next round are sampled.
Sites from common domain names are undersampled, to avoid getting stuck (e.g. within specific social networking sites). When sites are selected for the next round, weights are equal x^(3/4), where x is the number of sites in the same domain.
After several waves, the sample should be well-mixed -- a pseudo-random sample of the web.
Note: This code is kind of sloppy, does not scale well, and is poorly commented. Bad code! Bad!
3 comments:
Interesting. What is the purpose of such pseudo-random sample? Thanks.
Aside from the curiosity value ("Find a random website!"), pseudo-random content can be helpful as a baseline sample for training text classifiers. That is, suppose you have 500 sites with content you're interested in. You can take these "yes" sites, grab 500 (or more) random pseudo-random "no" sites, and use standard machine learning techniques to parse out the differences. The classifier will now be able to recognize the difference between sites like the ones you started with, and everything else on the web. Moreover, a classifier trained this way is likely to be more robust than one trained on a convenience sample of sites. Technical, I know, but it's been very helpful in my work.
There doesn't appear to be a link-- where can I get a copy of your code?
Post a Comment