Showing posts with label politics. Show all posts
Showing posts with label politics. Show all posts

Friday, April 29, 2011

Computational politics: U.S. House legislation may move to XML?

http://thehill.com/blogs/hillicon-valley/technology/158339-boehner-cantor-want-house-to-use-open-data-formats

This would be a huge boon to computational social scientists and groups like the Sunlight Project

The economics and politics of the Death Star

The economics and politics of the Death Star

"What’s the economic calculus behind the Empire’s tactic of A) building a Death Star, B) intimidating planets into submission with the threat of destruction, and C) actually carrying through with said destruction if the planet doesn’t comply?"

Fun discussion with lots of analogies to history and current politics

Friday, April 15, 2011

More text archives

Three more data sets (108MB, gzipped) for training classifiers. All these files are text-only, crawled in the week of 12/17/2010. It just took me a while to get around to releasing them.
political (115M):
Front pages from ~2,500 political sites. These sites were rated extremely likely to be political (p>.99) in an early version of my census of the political web. I revisited the same sites several months later and downloaded the new front pages to create this dataset. They should be appropriate for case-control training of a political classifier.

random (471M):
Front pages from ~8,000 "random" sites. These are a pseudo-representative sample of the web. I ran several web spiders in parallel, recording all outbound links from visited sites. I deliberately undersampled common namespaces, in order to avoid getting trapped in densely-linked social networking sites. The 8,000 sites are a random sample from a list of ~2 million sites generated in a crawl of this type.

porn (43M):
Front pages from ~1,200 pornography sites. I never thought I'd be curating pornography as part of my dissertation, but so much of the web is porn that I've had to build special classifiers to screen it out before sending to my undergrad research team for coding. These 1,200 sites were all linked from http://research.vision-options.com/research.php.

Thursday, April 14, 2011

Data release: 75K political web sites

Data release:
pol_site_sample.tar.gz (1.2GB, gzipped)
This archive contains results from a front-page-only crawl of ~75,000 political web sites on 4/9/2011.

These sites span a large portion of the political web, but they are not a representative sample. Each was classified as very likely to be political (p > 0.9) in a crawl performed in August 2010. In other words, these are pages from sites that featured political content 8 months ago. Presumably, most -- but certainly not all -- of the sites still feature political content today.
I'm going to use this to train content classifiers. Might be useful to others as well.

Monday, February 21, 2011

Glenn Beck conspiracy generator

A Glenn Beck conspiracy generator.

How does this thing work? I'm guessing mturk or some mailing list. The phrases don't seem quite formulaic enough for Markov generation or automated madlibs.

Thursday, January 27, 2011

Inspiring, hopeful, education, and salmon -- Not in that order

This is everywhere already, so why not put it up here too? NPR asked people what they heard in the state of the union, then put the 4,000+ responses into a word cloud. Here it is.
Remind me to include a couple of jokes next time I give a talk.

Monday, June 28, 2010

Reasons to study political blogging


I'm working like crazy on my dissertation prospectus. Data work, lit reviews, etc. To escape from early research purgatory, I plan to blog parts of the prospectus as I write them.

I'll kickoff today with introductory definitions and motivation. Feedback is much appreciated. Beware of dry, academic writing!

What is a blog?
Paraphrasing wikipedia, a blog is a website containing regular entries ("posts") of commentary, links, or other material such as photos or video. On most blogs, posts are displayed in reverse-chronological order -- the most recent post appears first. Although most blogs are maintained by individuals, some are run by small groups, and blogs speaking on behalf of corporations, churches, newspapers, political campaigns, etc. are increasingly common. Many blogs focus on a specific topic, ranging from broad to narrow: entertainment, cooking, astronomy,the Detroit Tigers, to cold fusion. For my dissertation, I plan to focus on political blogs.

Why study political blogs?
Here are five reasons to study political blogs.
  1. Blogs are public facing. Lots of people read them, including politicians and journalists. The extent to which blogs are replacing mainstream media is an open question, but it's certain that blogs have come to play an important role in public discourse, with real impact on politics.
  2. Bloggers span a wide variety of opinions. The blogosphere embraces everyone from conservative wingnuts to liberal moonbats to political moderates. Some political bloggers are politically omnivorous, writing about anything political. Others focus on specific issues and topics: foreign policy, Congress, feminism, etc.
  3. Bloggers include both experts and amateurs. Dividing the same pie in a different direction, many A-list bloggers (e.g. Andrew Sullivan, Ariana Huffington, Glenn Reynolds, Michelle Malkin) clearly qualify as political elites: they are experts, immersed in politics, well-informed and well-connected. Other political bloggers are more obscure, casual -- closer to the average Joes who make up the "mass public."
  4. Blogs are updated frequently. This has two nice consequences. First, frequent posts allow us to replay bloggers' reactions to events as they unfold. Second, frequent posts mean we have a lot of posts to work with.
  5. Blogs are archived publicly. Unlike most forms of political speech and action, blogging leaves a permanent data trail.
The combination of these attributes creates a kind of perfect storm for social science. Understanding the flow of opinions and ideas has always been difficult for social scientists, because most of our data have come from surveys.