Monday, December 20, 2010

A graphic design challenge...

Next semester, I'm going to be conducting a large-scale survey of bloggers for my dissertation. (I've had a pilot wave in the field for the last several weeks.) For credibility's sake, I need a public face for the project -- a web site people can visit to reassure themselves that my survey request isn't some kind of phishing scam.

I'm branding the project as the Online Political Speech Project, with the goal of "understanding the voices of the Internet." I've worked out some rough ideas for visual motifs, but I'm no graphic designer. If you are, or know someone who is, I'd love to be in touch. I'm hoping this can be a fun project for some talented graphic designer...



Project specs: I need to design and lay out this web page by the second week in January. I'm imagining about five pages (splash page, FAQ, contact, blog, data repository). Scope of work would be limited to designing templates -- I can plug in content later.

This should be reasonably quick work, with a lot of creative control for the designer. I can offer a little cash, but probably a lot less than going professional rates. (That's because, as a grad student, I get paid a lot less than going professional rates, too.) Maybe a good portfolio piece for a student web designer...? Or else a fun project for someone who wants to play with some of the styles and themes that have emerged lately on the web...?

Please email me (agong at umich dot edu) if you're interested.

Friday, December 17, 2010

Google's n-gram viewer - pros and cons

Google just released an n-gram viewing tool for tracking trends in half a trillion words across 400 years of text. Check it out. It's super fast and interesting, kind of like Google trends on timescale steroids. Even better, they released the data.

Assorted thoughts poached from conversations with friends:
  • Someone should tell Fox about "Christmas"
  • I predict that in the year 165166 AD, fully one hundred percent of the words written in English will be the word "dance." (This will make 500-word APSA applications much more straightforward.)

My take
(In 18 words: For scientists, a small, representative sample usually beats a large, non-random sample. Google has given us the latter.)

The n-gram viewer will be a lot of fun for journalists, who often want a quick graph to illustrate a trend. It's a gold mine for linguists, who want to understand how syntax and semantics have changed over time. For both of these fields, lots of data equals better.

As a bonus for me, maybe this tool will popularize the term n-gram, so I won't have to keep explaining it in conference talks.

I'm not sure about the impact the n-gram viewer will have on the areas where the researchers seem to want to apply it most: culture, sociology, and social science in general. The reason is that in those fields we tend to care a lot about who is speaking, not just how much speech we have on record. This is why the field of sampling theory has been a consistent cornerstone in social science for almost a century. From the press releases, we can see Google's claim that this is 5.2% of all books ever written. But we don't know if it's a random sample.

And in any case, books have never been written by a random sample of people. I suspect that the n-gram viewer will have very little to offer to researchers hoping to study the culture and sociology of, say, blacks in the Jim Crow south, or the early phases of Lenin's revolution. By and large, we already know what cultural elites were saying in those periods, and the rest weren't writing books.

This means that there are two filters in place for any social scientist who wants to pull cultural trends out of these data. First, there's Google's method for sampling books, for which I haven't seen any documentation. (No offense, but this is pretty typical of computer scientists, who think in terms of scale more than sampling). Second, there's the authorship filter: you have to keep in mind that any trends are derived from written language, produced by whatever subset of the population was literate in a given period

Example
As a political scientist, I'm interested in conflict. If you go to the site and punch in "war," from 1500 to 2000, you get a graph showing quite a few interesting trends. Here are some stories that I could naively tell from this data. I suspect that many are false.
  • In general, humanity has been far more interested in war since 1750.
  • That said, interest in war has generally declined since then.
  • Interest in wars spikes during wars.
  • Proportionally, wars since 1750 affect people far less than they did previously -- the use of the term "war" jumped up to 5 times the period baseline during the English civil war, but less than double the period baseline during the World Wars.
  • World wars I and II were the historical high points for human interests in war, but only slightly higher than the sustained interest of the peak period in 1750.
Interesting conjectures, and I can spin a story for each of these. In several cases, I'm pretty sure we're seeing the effects of cultural or linguistic artifacts in the data. Potential confounders include: changes in literacy rates, introduction of public schooling, the American revolution, improvements in printing technology, linguistic drift, etc.

I'm cautiously pessimistic here. Given all the potential confounding variables, it's hard to see how we can sort out what's really going on in the line graphs. Maybe we can do it with fancy statistics. But we're not there yet.

I'll give the last word to BJP:
My instinct suggests that what is talking here is less the folds of culture and more the wrinkles of the record of the technology expressing that culture.

Thursday, December 16, 2010

Link mish-mash

All kinds of interesting stuff on the web today. Not that this wasn't also true yesterday, or the day before...

  • IBM says its Watson AI is ready to take on human Jeopardy champs for a $1 million prize. The showdown is scheduled for Feb 14. This is reminiscent of the Kasparov/Deep Blue showdown, except that Watson will be competing on human home turf: making sense of the linguistic ambiguity in the hints, phrasing, puns, etc. of Jeopardy prompts. (The AI has one advantage: I bet Watson will always remember to phrase its answer in the form of a question.)
  • A rehash of physicist Aaron Clauset's work on "the physics of terrorism." I'm not a big fan of his stuff, to be honest. My view: Clauset showed that the severity of terrorist attacks follow a powerlaw distribution, and has been wildly extrapolating from that single finding ever since.
  • A Jeremiad about the state of journalism, from Pulitzer prize winner David Cay Johnson. He talks trends (reporters know less and less about government; papers keep cutting content and raising prices), and hints at causes. The last paragraph is especially intriguing to me. Read it as a claim about how good reporting is supposed to uncover truth.
  • Predictions about 2011 from 1931. Eighty years ago, the NYTimes gathered a brain trust of experts in various fields and asked them what the world would look like today. Follow the link for predictions and some commentary. (Hat tip Marginal Revolution.)
  • This paper should depress my libertarian friends. Evidently, profit is evil, with an r-value of -.62.

Monday, December 13, 2010

HPN health prize: $3,000,000

The Heritage Provider Network is offering a $3 million prize for predicting hospitalization -- which patients are likely to be hospitalized or rehospitalized?
In 2006 well over $30 billion was spent on unnecessary hospital admissions. How many of those hospital admissions could have been avoided if only we had real-time information as to which patients were at risk for future hospitalization?

Saturday, December 11, 2010

Programming Language Popularity Contest

I keep telling people that python is the language to learn. Here's proof. Half a million github and StackOverflow programmers can't be wrong, right?

(Ignore javascript. It's an in-browser language only. Fun if you want to animate a web page; useless if you want to write a file.)



via : Programming Language Popularity Contest

Tuesday, December 7, 2010

Journalism and activism: Stylized facts


Trying this idea on for size. Reactions?

(Also, I'm attempting to post to facebook via Notes. Let's hope the feed actually works now...)

Thursday, December 2, 2010

Arsenic-based life and autocatalytic sets

NASA has just announced the discovery of a new kind of extremophile bacteria. If I'm reading the (sketchy) early release info correctly, these critters use arsenic in their DNA, instead of phosphorus. This is a biggish deal because every other known life form uses phosphorus.

How big a deal? Setting aside shades of the Andromeda Strain*, people seem pretty underwhelmed. Partly it's a bait-and-switch involving the NASA brand: the microbe in question, Gammaproteobacteria GFAJ-1, isn't an alien visitor. It's a mutant strain of a run-of-the-mill terrestrial bug.

So the real question is: How surprised should we be that little GFAJ-1 managed to assimilate arsenic into its DNA? Or the converse: if it's so easy, should we be surprised that no other life form bothered to do the same thing?

Taking a conversational step sideways, this seems like a good moment to put in a plug for a really fascinating theory on the origin of life: autocatalytic sets. This theory--which I find persuasive--argues that life isn't rare or unexpected -- it's virtually inevitable. I can't find a good non-technical description of this underappreciated set of ideas online, so I thought I'd take a crack at it here . (My explanation is based largely on the opening chapters of Kauffman's At Home in the Universe).

The puzzle: all forms of life we know of are pretty complicated. There are a couple hundred cell differentiations in a human body; thousands of enzymes in a working cell; millions of base pairs in even the simplest DNA. (Prions and viruses, which are often simpler, don't count because they can't replicate without hijacking a cell's machinery.) With these kinds of numbers, the odds of getting the just right combination for life are astronomical. Try flipping 220 million coins until they line up with the base pairs in a single human chromosome. Practically speaking, it will never happen.

For most explanations for the origins of life, complexity is a stumbling block. We just got lucky, or God intervened, or maybe an asteroid seeded the planet to produce something capable of self-replication. There's a leap in the logic at the point where we try to explain the incredibly low probability of life emerging spontaneously from a lifeless chemical soup.

The autocatalytic theory turns this logic around: it argues that life exists because of its chemical complexity, not in spite of it.

The theory builds from three simple assumptions. First, some chemicals react; most don't, with roughly constant odds for any given pair of chemicals reacting. Second, chemical reactions produce new chemicals. Third, the number of possible chemicals is very large.

Thought experiment: suppose you find yourself in front of chemistry set containing a rack of n beakers, each filled with an unknown chemicals. Channeling your inner mad scientist, you begin mixing them at random. Suppose 1 in 100 pairs creates a reaction. How many total reactions should you be able to get out of your chemistry set?

If you have two chemicals, the odds of a reaction are just 1 in 100.

If you have three chemicals, you have three potential pairs, and the odds of a reaction are about 3 in 100.

If you have four chemicals, you actually have six potential pairs, so the odds of a reaction are a little better than 6 in 100.

At this point, exponentials start working rapidly in your favor. With five chemicals, you have 10 potential pairs, for a 9.5% chance of at least one reaction. Twelve chemicals gets you 66 pairs with 48.4% odds of at least one reaction. The deluxe 30-chemical starter kit has 435 potential pairs, with 98.7% odds of at least one reaction.

What does this prove? The number of likely reactions in a pool increases faster than the number of chemicals in the pool.

It keeps going, too. With 1 in 100 odds, you would expect to get about 4 reactions out of your 30-chemical kit. If each reaction creates a new chemical, you now have 34 chemicals in your pool, with correspondingly greater odds of additional reactions. Eventually, you pass a tipping point, and the expected number of compounds becomes infinite. If I've got my math right, it happens around 80 chemicals in this scenario, because the expected number of new reactions exceeds the number of reactions in the existing set. The more you mix, the more potential mixtures you get.

A quick pause: when we talk about chemicals, we're not talking about atoms in the periodic table of the elements. Except during fission and fusion, atoms themselves don't combine and react. Instead, we're talking about molecules.

In particular, organic molecules -- the ones that ended up supporting life -- fit this model very well. Enzymes, RNA, and DNA are all organic molecules. Believe it or not, a strand of DNA is one enormous molecule. Organic molecules are all built mainly from the same base atoms: carbon, hydrogen, oxygen, nitrogen, phosphorus -- and now arsenic(!) These atoms happen to be good at linking to form long chains. Because of the way these chains fold, they react with each other in often unpredictable ways. Most organic molecules don't react with each other, but quite a few do. And because the chains can get very long, the set of potential molecules made of CHONP atoms is essentially infinite.

Now getting back to the main story... We're halfway through. We've shown how simple rules for reactions can get you from a small-ish starting set to an infinite variety of chemicals to play with. It seems very reasonable to suppose that the primordial organic soup included enough organic reactants to pass the tipping point into infinite variety. But that just means a more flavorful soup. How do we get to life?

Setting aside the transcendental, life is defined by sustainable reproduction. A cell is a bag of chemicals and reactions that keeps working long enough to make at least one copy of itself. As part of the deal, the cell has to be in the right sort of environment, with whatever energy sources and nutrients are necessary.

Our cells achieve sustainability by using enzymes to catalyze other reactions. It turns out that the same logic that applies to pairs of reactions also applies to catalyzers: the probability of catalysts in a pool increases faster than the number of chemicals in the pool. Once you get enough chemicals, it's virtually certain that you'll have quite a few catalyzed reactions.

Here's the really nifty bit. As the number of catalyzed reactions in the set increases, eventually some of them will form an autocatalytic set -- a loop of reactions catalyzing each other. Reaction A creates the catalyst that enables reaction B, creating the catalyst for C, and so on back to catalyst Z that enables reaction A.

Based on the same logic we saw earlier, these loops always appear once the pool of chemicals gets large enough. They are typically long and complicated, cycling through a seemingly random group of chemicals among a much larger set of nutrients and byproducts. They tap nutrients and energy sources in the environment, increasing themselves the longer they run. In other words, autocatalytic sets look a whole lot like life as we know it.

I find this theory compelling. It takes the biggest objection to prevailing theories -- the inherent complexity of life -- and makes it the cornerstone of a different approach.

And as a bonus, it makes arsenic-based life forms seem very plausible. Given NASA's results, it seems reasonable to say that arsenic-based DNA is another unexplored evolutionary path for viable autocatalytic sets. Bill Bryson says it well in A Short History of Nearly Everything:
"It may be that the things that make [Earth] so splendid to us---well-proportioned Sun, doting Moon, sociable carbon, more molten magma than you can shake a stick at and all the rest---seem splendid simply because they are what we were born no count on. No one can altogether say. ... Other worlds may harbor beings thankful for their silvery lakes of mercury and drifting clouds of ammonia."


PS: Autocatalytic sets don't have much to do with the evolution/intelligent design debate. They propose a mechanism that could be responsible for jumpstarting evolution. So if you're comfortable with the idea that God would choose cosmological constants and direct evolutionary processes with some goal in mind, it probably won't bother you to add the idea that he would use chemical soups and catalysis networks along the way.

PPS: The main difference between an autocatalytic set and life as we know it is the absence of a cell wall. It's not hard to close the gap conceptually. Once a catalytic loop gets started, other loops usually form as well. At this point, competition and natural selection between autocatalytic cycles can kick in. If one autocatalytic loop happened to produce a hydrophobic byproduct (like a fat or lipid), it could easily act as a primitive cell wall. This kind of barrier would enable the autocatalytic loop to increase its concentration, and therefore its reaction rate. This kind of pseudocell would reproduce faster and very likely evolve into more sophisticated organisms.

*A ten word review of the Andromeda Strain: Typical Crichton--some interesting ideas; mildly annoying narration; mostly plotless.

** A wonderful book for putting some color on the messy process of scientific discovery.

Wednesday, December 1, 2010

Prezi - A quick review

I spent a couple hours playing around with Prezi, an online presentation builder being promoted as an alternative to the slideshow format of powerpoint, keynote, etc. Instead of a series of slides, Prezi presentations consist of a series of views over one large image. The format parallels drawing on a whiteboard, instead of clicking through slides on a projector. A good concept, but the execution is a little clunky.

My review: Preparing good presentations is time consuming, for two reasons: 1) it takes some trial and error to figure out the best way to express an idea, verbally and visually, and 2) presentation software is clunky, requiring a lot of fiddling to get things right. In my experience, spending time on (1) is fun and creative; spending time on (2) is frustrating and stressful.

Because of the "whiteboard" metaphor and brand emphasis on good design, I was hoping Prezi would deliver a slick and streamlined user experience. Being free from interface hassles and able focus on creative expression would be wonderful. Alas, I quickly ran into many GUI annoyances.
  • The interface for importing images is very clunky. You have to download or save the image to your desktop, then upload. On the plus side, you can batch upload several images at a time.
  • The whole image is static, which means that you can't mark up images over the course of a presentation. To some extent this makes sense -- dynamic images would mess up the concept of arranging your display in space rather than time. However, it breaks the whiteboard metaphor. When I do whiteboard presentations, I often have an agenda that I revisit, adding checkmarks and lines to relevant content. I can't do that in Prezi.
  • Rudimentary tools for grouping object are not available. This one really gets me. You can accomplish the same thing (visually) by putting several objects together in an invisible frame. But every time you want to move the group, it takes several extra clicks to select everything and drag it around. Poor usability.
  • You can only use a handful of presentation styles. Your only alternative is to hire Prezi staff to build a custom style for $450.
Summary: I would really like a tool that lets me express myself clearly, fast. Prezi offers some advantages for clarity, but not really for speed. Overall, I'm mildly impressed, but not overwhelmed. For the moment, the main benefit of Prezi seems to be novelty.