Friday, May 28, 2010

Software for text mining, esp unsupervised document clustering

A friend in the department recently asked me about software for text mining. Among other things, she was looking for programs that do "unsupervised document clustering." I went through my notes and did some web searching and came up with some promising options.

I haven't worked with any of these directly (unsupservised learning is a step removed from the stuff I do) but I figured the results of the search were worth passing on.

One option close to hand is WordStat on the computer in the [UM political science] bullpen. It supports clustering and is pretty easy to use.

Another option is Justin Grimmer's Galileo package. I don't know if he's made this publically available yet. Last I heard he was trying to patent and maybe market it. Grimmer is one of Gary King's students; he was on the market this alst year. One plus to using Grimmer's work is that he's published in polisci journals, so his methods already have good credibility within the field.

A third option: RapidMiner. I haven't used this, but it's free, well-documented, and fits the bill for what you're trying to do.

Like I said, I haven't worked with any of these directly. Anybody have good/bad experiences with this kind of software?

1 comment:

Anonymous said...

I have used RapidMiner a lot for text mining during my PhD. Really a great piece of software (especially the latest version). Thumbs up for this one.

Best,
Marc