Tuesday, November 30, 2010

Forums and platforms for crowdsourcing

I've always thought of crowdsourcing as a narrow, niche thing. But then I ran across this excellent list of forums and platforms for crowdsourcing. Skimming down the list, I realized that crowdsourcing is actually starting to play an important role in quite a few industries.

It looks like we're just starting to climb the adoption curve here. Where should we expect crowdsourcing to stop? What sorts of problems will it solve (and not solve)? Who won't ever use it?

PS: In case you haven't run across the term before, here's wikipedia's definition of crowdsourcing:
Crowdsourcing is the act of outsourcing tasks, traditionally performed by an employee or contractor, to an undefined, large group of people or community (a crowd), through an open call.

Wednesday, November 24, 2010

How long are books in political science?

As a polisci student who has passed his prelims, I have a very good idea how long books take to read. But I realized this week that I didn't have a good idea how long books take to write.

As I was setting up the template to write a book-format dissertation, I decided to give it a look. Here are some rough stats for five well-known polisci books*, pulled from my reading shelf more or less at random:










HindmanZallerMutzLupia and
McCubbins
Huber and
Shipan
Lines/page3845354239
Words/line1313111211
Total pages141310125229210
Total chapters7125108
~Pages/chapter20.125.82522.926.2
~Total words69,65418,135048,12511,541690,090


These estimates are probably biased upwards, because I based them on pages of full text without any tables, graphs, or chapter breaks. But still, it's a pretty good idea for the rough scale of the project.

To put it in perspective, the current working draft for my dissertation includes 10,653 words. If I'm shooting for 70,000 total, that means I'm 15% done already!


*These are the books I used for estimates:
Matthew Hindman, The Myth of Digital Democracy
John Zaller, The Nature and Origins of Mass Opinion
Diana Mutz, Hearing the Other Side
Arthur Lupia and Matthew McCubbins, The Democratic Dilemma
John Huber and Chuck Shipan, Deliberate Discretion

Tuesday, November 23, 2010

An update on voice recognition software

Thanks to everyone who sent comments about voice recognition via Dragon, Microsoft, and Mac. I'm still playing with different options to see what will work.

In the meantime, I've discovered another option: Google Voice. It turns out there's an easy way to set up Google Voice so that you can dictate messages to yourself (info here). GV will then automatically transcribe them and send them to your email, or mobile device. Nifty!

Since it's free and didn't require any software installation or training, I decided to give this a shot first. The result: laughably bad transcription (see below), but a substantial boost in productivity. How's that? For me, drafting is the most time-consuming part of writing. Once I have some basic ideas on paper, I can edit and elaborate reasonably quickly. But it takes me a long time to get that first version out.

As a result, dictating an early version has been very helpful. It forces me to say something. Since I'll be editing soon, it doesn't matter if that something is bad (it is, usually). So dictation, even with GV's horribly inaccurate transcription has worked pretty well for me. Bottom line: I'm certainly going to be investing in a voice recognition package in the near future.

An example of GV transcription. Here's what I said:
This chapter describes methods and data sources for the book. My goal is to describe the logic behind the research design. The focus here is validity: what types of conclusions can we draw from these data? Technical details---of which there are many---are saved for appendices.

Here's what Google thought I said:
This chapter just tries messages and data sources for the book. My goal is to describe the logic behind the research design. The focus here is validity. What types of conclusions. Can we draw from the Yeah, technical details. I wish there are many her saved for the appendicitis.
In fairness, Google just got into this business in the last year or so. I'm sure their transcription will get better over time. But for the moment, "I wish there are many her saved for the appendicitis."

Saturday, November 20, 2010

Anybody have any experience with dictation software?

Faced with writing a 200-page thesis in the next year, I'm toying with the idea of getting myself a voice recognition software package. Dragon seems to be the industry leader. Here's a demo.

The software seems to work, is getting good reviews on Amazon, and is not terribly expensive. On the other hand, it seems like it might be hard to find a place to talk loudly to oneself and a computer for long periods of time. It also seems likely that training the software and editing its mistakes could be pretty time consuming. Also, Dragon seems built with MS Office in mind. Given that I lean open source (and don't even have Office on my laptop), would it work for me?

Bottom line: would this be more like the digital stylus I got a few months ago (and use all the time)? Or the extra external hard drives I got a year before that (and have only booted up to make sure they work)?


So I'm on the fence. I really wish Dragon had a trial version. Anybody have anything to add here?

PS: Yes, I do know that one episode of The Office. (Dwight: "Cancel card. Can-suhl car-Duh.")

PPS: How long do you think until a viable open-source option opens up? Probably at least a couple years, unless someone like Google decides to launch a free version. They seem to be making moves in that direct (see here for an easy application, and here for a broader one).

Friday, November 19, 2010

Training a text classifier

I've been doing a lot of work with text classifiers lately. (See this post for an example.) It's an interesting process blending intuition, text, and math in some nifty ways.

Anyway, I was going to spell out how it all works in a blog post, but I got side tracked making an illustration, and now I'm out of time for blogging. But I like the illustration. More to come...

Thursday, November 4, 2010

Starting to get some dissertation results...

Apologies for the long delay between posts. Stock excuse: "Dissertation... blah blah blah..."

Actually, I'm starting to get some nifty results from my dissertation. I've spent a long summer writing surveys and software, and in the next few weeks I hope to have something to show for it. Exhibit A: a word cloud for an automated classifier of political content.


Orange words are associated with political content, and blue words are disassociated. The size of a word denotes the strength of association -- essentially, the size of each word corresponds to the absolute value of the beta value of the word in a logistic regression with "political-ness" as the dependent variable. The layout of the words is done by computer algorithm to conserve space; it doesn't carry any important information.

I used wordle for the layout. The classifier runs regularized logistic regression using the scikits.learn package for python. The training data is from a team of undergraduate research assistants.