Friday, April 29, 2011

Computational politics: U.S. House legislation may move to XML?

http://thehill.com/blogs/hillicon-valley/technology/158339-boehner-cantor-want-house-to-use-open-data-formats

This would be a huge boon to computational social scientists and groups like the Sunlight Project

The economics and politics of the Death Star

The economics and politics of the Death Star

"What’s the economic calculus behind the Empire’s tactic of A) building a Death Star, B) intimidating planets into submission with the threat of destruction, and C) actually carrying through with said destruction if the planet doesn’t comply?"

Fun discussion with lots of analogies to history and current politics

Tuesday, April 26, 2011

Is bitcoin a decentralizing, democratizing agent?

A response to Duane on the importance of bitcoin as a new, decentralized currency. (This has been a fun running debate among friends over the last couple weeks.)

I read his recent post as claiming "decentralized is better." My response: "sometimes, but not always." Here's my reasoning.

Point 1: Money is a figment of our shared imagination. It has value because we all collectively accept that it has value. There's nothing particularly special about green paper, or gold rocks, or any given set of bits, except for the common knowledge that other people will also accept those currencies in exchange for goods and services. In that sense, the valuation of any currency is already "democratic" -- its real value exists in the minds of a distributed network of people. People were using rare stones as currency long before governments got involved.

Point 2: Enlightened self-interest leads us to centralize some of the responsiblity for maintaining currencies. One of the main threats to any currency is counterfeiting, so it makes sense to centralize responsibility for preventing counterfeiting to mints and enforcement agencies like the secret service. Another threat is inflation (or deflation), which is why every modern state with a large economy uses a central bank to manage inflation via the money supply. A third "threat" is transaction costs, which is why we use credit cards for so many things -- the artificial "currencies" maintained by these corporations are so convenient that they have displaced state-authorized currency in many transactions*.

Point 3: As a side-effect of centralization, governments gain some power to regulate other uses of money. (Note that governments, like currencies, are also a figment of collective imagination. The Constitution is law because we all agree that it's the law.) Thus, the imposition of taxes, tariffs, and embargoes; and continuing efforts by organized criminals to circumvent the system by laundering money. These things are not necessarily good or bad. They depend on whether we approve of the use of government power in those areas. By and large, I imagine most people approve of governments shutting down mobsters and sex trafficking rings. Using the same suite of tools for defense spending, planned parenthood, agricultural subsidies, "bailouts," etc. is more controversial.

On the whole, I'd argue that people are better off because governments have these capabilities at their disposal -- especially people living in places where government is reasonably transparent and accountable to its citizens. We're better off because we've decided to pay taxes for roadways, police stations, and schools. We're better off because the value of a dollar is reasonably stable. We're better off because corporations (especially publically traded firms) are forced to keep a strict accounting of their transactions. We're better off because the FBI and CIA can use financial information to crack down on terrorists and organized criminals.

Putting that all together, I see bitcoin as an attempt to float an unregulateable currency using P2P technology and cryptography. By construction, such a currency would make it very difficult for government to intervene in the ways discussed in points two and three. Although I don't agree with everything our government does**, I don't see compelling reasons to deny ourselves the ability to use those tools as a society. Together, we're made better off by careful use of centralized financial regulation and lawmaking. That being the case, bitcoin seems partly radical, partly old hat, and partly a step backwards.


*Semantic question: are credit cards a centralized or decentralized currency? What about frequent flyer miles? What about Subway sandwich discount punch cards? At some level, the labels "centralized" and "decentralized" are too blunt to be really useful. This seems related to Duane's comments about needing both more and less regulation.

**Nobody does. That's the nature of compromise.

Monday, April 25, 2011

The best $5 I've spent all year

I finally started using Amazon's EC2 yesterday. I've been meaning to learn it forever, but assumed it would be time-consuming to get registered, set up an instance, and so on.

Not true. Thursday morning at 10am, I registered for EC2. By 10:30 I had an instance of Drew Conway's Py/R AMI up and running, with several additional libraries installed, and a few GB of data I wanted to crunch uploaded to the server. Very fast turnaround.

Eight hours and $4.77 later, I'd crunched a lot of numbers -- by far my most productive workflow all week. Highly recommend it.

Friday, April 22, 2011

Scalable education

I got a lot of great responses on my previous post about scalable education, and wanted to share them back out here. I also got a lot of questions on what I meant by "scalable" education. Let me speak to that first.

Thesis: the number of students getting a good education today is about 20 times the number of good teachers. Under our current classroom model, "number of high-quality teachers" is the limiting variable. By scalable education, I mean models of teaching that could feasibly grow to orders of magnitude greater than 20 -- systems that would allow one good teacher (plus the right support system) to teach 2,000 or 2,000,000 kids.


Here's a loosely annotated list of links to proposed models of "scalable education," sorted from least to most:

Jump - New curriculum deployment. Not bad, but not revolutionary.

PBS documentary on "digital learning" - This is a mixed bag. Some models (e.g. the Smithsonian scavenger hunt) are neat, engaging students in new ways. I was less impressed with several segments that are just trading classrooms for classrooms plus computers.

Teach for America - TFA's teachers do a lot of good, but they also tend move on quickly. (This may be changing.) On the other hand, TFA is very proactive about maintaining their alumni network. If this is going the be revolutionary, it will be as a policymaking network, more than a teaching force.

itunes-U - Possibly revolutionary. But content delivery < education.

TED - "The first major educational brand to emerge in a century." Very cool topics, but see previous.

Harvard's Distance Education - Lectures are free; course credit with the Harvard brand cost $. Many universities are doing this; Harvard is probably the best known. This is changing things, but I'm not sure how competition is going to play out in this space.

The Khan Academy - My number one vote for a potentially disruptive model of education.

Thursday, April 21, 2011

Wednesday, April 20, 2011

What I've been watching lately

Ralph Lagner's fascinating TED talk on reverse-engineering the Stuxnet worm. This is the best presentation on cyber-terrorism/security I've seen so far.



The bubble sort algorithm, illustrated as a Hungarian line dance. (via flowingdata) Fun, but a little long. What sort of dance could illustrate a heap sort?


A quick (4min) TED talk on Google's driverless cars. Nifty.

Tuesday, April 19, 2011

Code: a random web crawler

This code crawls the web to generate a pseudo-random sample of web sites. Not my prettiest code, but it works and may save somebody an afternoon of coding.

random_crawler.tar.gz (14MB, zipped)
This script explores the web for a pseudo-random sample of sites. The crawl proceeds in a series of 100 (by default) waves. In each wave, 2,000 (by default) crawlers attempt to download pages. When page downloads are successful, all the unique outbound hyperlinks are stored to a master list. Then sites for the next round are sampled.

Sites from common domain names are undersampled, to avoid getting stuck (e.g. within specific social networking sites). When sites are selected for the next round, weights are equal x^(3/4), where x is the number of sites in the same domain.

After several waves, the sample should be well-mixed -- a pseudo-random sample of the web.

Note: This code is kind of sloppy, does not scale well, and is poorly commented. Bad code! Bad!

Monday, April 18, 2011

Another data resource

I'm on a real data kick this week. Here's another one that should be useful for people training language classifiers.

lang_samples.tar.gz (1.5G gzipped)
This archive contains language samples from the 2008 static wikipedia dumps available at http://static.wikipedia.org/downloads/2008-06/. I downloaded all 261 archives, and extracted samples of text from each.

Saturday, April 16, 2011

What does it take to be a data scientist?

Conway on what it takes to be a data scientist (@ ZIA, ht: Mike B).


The full article is here. It's short and sweet, and offers a nice counterpoint to some of the claims made by people with a more computer-science-centric view of the world. Turns out that modeling assumptions (i.e. math and statistics) and theory (i.e. substantive expertise) matter. You ignore them at your own risk.

PS: The title makes it sound like this is about U.S. intelligence, but almost all the points in the article apply to business and academia as well.

Friday, April 15, 2011

More text archives

Three more data sets (108MB, gzipped) for training classifiers. All these files are text-only, crawled in the week of 12/17/2010. It just took me a while to get around to releasing them.
political (115M):
Front pages from ~2,500 political sites. These sites were rated extremely likely to be political (p>.99) in an early version of my census of the political web. I revisited the same sites several months later and downloaded the new front pages to create this dataset. They should be appropriate for case-control training of a political classifier.

random (471M):
Front pages from ~8,000 "random" sites. These are a pseudo-representative sample of the web. I ran several web spiders in parallel, recording all outbound links from visited sites. I deliberately undersampled common namespaces, in order to avoid getting trapped in densely-linked social networking sites. The 8,000 sites are a random sample from a list of ~2 million sites generated in a crawl of this type.

porn (43M):
Front pages from ~1,200 pornography sites. I never thought I'd be curating pornography as part of my dissertation, but so much of the web is porn that I've had to build special classifiers to screen it out before sending to my undergrad research team for coding. These 1,200 sites were all linked from http://research.vision-options.com/research.php.

Thursday, April 14, 2011

It's data dump week!

In case this wasn't clear already, it's data dump week. I'm gearing up for another crawl of the political web, and posting lots of bits of code and data along the way. If nothing else, I'll be able to find them here in the future. If these (or similar) resources are useful to you, please let me know.

Cheers!

Data release: 75K political web sites

Data release:
pol_site_sample.tar.gz (1.2GB, gzipped)
This archive contains results from a front-page-only crawl of ~75,000 political web sites on 4/9/2011.

These sites span a large portion of the political web, but they are not a representative sample. Each was classified as very likely to be political (p > 0.9) in a crawl performed in August 2010. In other words, these are pages from sites that featured political content 8 months ago. Presumably, most -- but certainly not all -- of the sites still feature political content today.
I'm going to use this to train content classifiers. Might be useful to others as well.

Tuesday, April 12, 2011

Lightweight pdf renderers

I'm finishing up dissertation data collection in the next ~6 weeks, which means I'm going to be spending a lot less time writing code, and a lot more time analyzing data and writing papers. So R and laTex are going to be my new best friends.

Taking a good look at my workflow around these packages, I realized that viewing pdfs was really slowing me down. Every time I generate a graph or paper, I have to open up the pdf version and see what it looks like. Adobe's very bulky software takes several seconds to load -- very frustrating when you're playing with margins or table formating and want to iterate quickly.

So I went out looking for a lightweight pdf viewer. Here's what I found:

http://www.downloadmunkey.net/2008/04/random-monday-foxit-reader-vs-pdf-xchange-viewer-vs-sumatra/
http://portableapps.com/node/17260
http://www.techsupportalert.com/best-free-non-adobe-pdf-reader.htm

Any other advice?

Based on those reviews, I'm going to give PDF-XChange a shot. I'll let you know how it goes.

Monday, April 11, 2011

Redesigned my layout

I've been meaning to do this for a while. tpmotd's nudge finally got me to make the neccesary 15 minutes.

Now all my pictures should actually fit!

Saturday, April 9, 2011

Prezi: AI and games

A few months ago I first tried out prezi. Since then, I've seen it trickling into presentations here and there. The novelty is great, and the usability is a little better than last time I played with it. So I went ahead and

These are slides for my (mock) TED talk from the Hill Street TED activity yesterday.

Thursday, April 7, 2011

Crowdsourcing and buzzword lumping

In general, I'm a big supporter of crowdsourcing, but I worry about lumping together too many things under one popular buzzword. A few NYTimes articles have spoken to this issue recently (ht Gloria). This one is pretty starry-eyed. This one unpacks things (a little) more.

Let me push on this idea of lumping. Wikipedia defines crowdsourcing as
the act of outsourcing tasks, traditionally performed by an employee or contractor, to an undefined, large group of people or community (a "crowd"), through an open call.
As I read that definition, all of these are crowdsourcing:
  • Wikipedia
  • Ideo's open design lab
  • Innocentive's innovation contests
  • 99design's graphic design sweatshop
  • Elections
  • Spam farms
  • Penny stock pump-and-dump marketing
  • Bounty hunters and privateers
Trick question: so is crowdsourcing a good thing or a bad thing?

My position: We're at a place where technology is enabling new institutions. It would be backward to ignore that potential. But there are all kinds of issues with corruption, lack of expertise, bias in who participates, etc. that "crowdsourcing" doesn't solve automatically. Just like other institutions, crowdsourced institutions have to be designed carefully to head off those problems. I don't think it's a magic bullet, but I do think it can help.


Tuesday, April 5, 2011

Announcing SnowCrawl!

Announcing the beta release of SnowCrawl, a python library for directed webcrawls. Nice features include: saved state for backup, support for threading and client-server architecture, lots of flexibility.

The project is open source, hosted at google code. More details to follow.
Link