Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts

Monday, August 1, 2011

Mining and visualizing twitter from RStudio in EC2

Here's code I'm going to use for my ICPSR class today. This is set up to run immediately from an EC2 instance of my AMI agongRStudio2 (ID:ami-1bb47272). It's the simplest introduction to text mining I've been able to pull together so far.

Step-by-step instructions for getting started in EC2 are here (pdf and docx). These are intended to get you started in command-line R. For this exercise, we want to use the RStudio GUI instead, so there are a few changes.

1. On step 6, use this Community AMI:
agongRStudio2 / ami-1bb47272

2. On step 8, you don't need to download the keypair. "Proceed without a keypair" instead.

2. On step 9, you also need to enable port 8787, the port the RStudio server uses.

3. On step 11 stop following the tutorial. Instead, open up your EC2 URL in your browser, with port 8787. It will look something like this:
http://ec2-123-45-67-890.compute-1.amazonaws.com:8787/

4. I'll give out the username and password in class. If you're not in the class, email me and I can clue you in.

5. Here's a first script to run

library(twitteR)
library(tm)
library(wordcloud)

#Grab the 200 most recent tweets about #bachmann
#http://www.slideshare.net/jeffreybreen/r-by-example-mining-twitter-for
k = 200
my_tweets <- searchTwitter("#bachmann", n=k)

#Convert tweet status objects to text
#http://www.r-bloggers.com/word-cloud-in-r/
my_text <- data.frame( text=unlist( lapply( c(1:k), function(x){my_tweets[[x]]$text} ) ) )

#Convert text to a tm corpus object
my_corpus <- Corpus( DataframeSource( my_text ) )
my_corpus <- tm_map(my_corpus, removePunctuation)
my_corpus <- tm_map(my_corpus, tolower)
my_corpus <- tm_map(my_corpus, function(x) removeWords(x, stopwords("english")))

#Convert corpus to matrix
tdm <- TermDocumentMatrix(my_corpus)#, control = list(weighting = weightTfIdf))
m <- as.matrix(tdm)

#Get features and frequencies
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

#Display as a word cloud
wordcloud(d$word,d$freq,min.freq=5,use.r.layout=T,vfont=c("sans serif","plain"))


#Basic bluster analysis of words
#From: http://www.statmethods.net/advstats/cluster.html
m2 <- m[colSums(m)>15,]

dist_matrix <- dist(m2, method = "euclidean") # distance matrix
fit <- hclust(dist_matrix, method="ward")
plot(fit) # display dendogram
We're going to be trying this in class. I have 20 minutes budgeted, so hopefully it's really this easy.

PS - Don't forget to terminate your EC2 instance when you're done, or you will use up your free hours, then run up a smallish (~50 cents/day) Amazon bill until you remember

Wednesday, June 22, 2011

Political science and big data resources from JITP

I'm cleaning out papers from my files, tired of carrying around all these dead trees. Here are notes on nifty resources mentioned at the JITP conference a few weeks ago.

TDT: topic detection and tracking (http://projects.ldc.upenn.edu/TDT/)

Socrata, the Open data company (http://www.socrata.com/)

Google's Data Liberation Front (http://www.dataliberation.org/)

TESS: Time sharing experiments in the social sciences (http://www.tessexperiments.org/)

TREC (Text retrieval conference) benchmark data sets (http://trec.nist.gov/data.html)

And the good old American National Election Study (ANES) (http://www.electionstudies.org/)

Saturday, April 16, 2011

What does it take to be a data scientist?

Conway on what it takes to be a data scientist (@ ZIA, ht: Mike B).


The full article is here. It's short and sweet, and offers a nice counterpoint to some of the claims made by people with a more computer-science-centric view of the world. Turns out that modeling assumptions (i.e. math and statistics) and theory (i.e. substantive expertise) matter. You ignore them at your own risk.

PS: The title makes it sound like this is about U.S. intelligence, but almost all the points in the article apply to business and academia as well.