Last week I posted instructions to get started with RStudio on Amazon's Elastic Compute Cloud. Here are new and improved* instructions (docx, pdf). These steps should be enough to get you into the cloud in 15 minutes or less, for free.
Please let me know in the comments if you have any trouble or questions with this demo. I'm trying to lower the startup costs for people to do computational social science, so I'm happy to be a resource for others working their way down the cloud computing path.
Cheers!
PS - The instance comes with several fun R libraries pre-installed: tm, igraph, and twitteR.
*I've 1) dropped several steps that aren't necessary for running RStudio, 2) added a few screenshots, and 3) clarified a steps that were giving people trouble. Thanks again to Kevin J. for putting together the original slides.
Politics, lifehacking, data mining, and a dash of the scientific method from an up-and-coming policy wonk.
Showing posts with label cloud computing. Show all posts
Showing posts with label cloud computing. Show all posts
Monday, August 8, 2011
Monday, August 1, 2011
Mining and visualizing twitter from RStudio in EC2
Here's code I'm going to use for my ICPSR class today. This is set up to run immediately from an EC2 instance of my AMI agongRStudio2 (ID:ami-1bb47272). It's the simplest introduction to text mining I've been able to pull together so far.
Step-by-step instructions for getting started in EC2 are here (pdf and docx). These are intended to get you started in command-line R. For this exercise, we want to use the RStudio GUI instead, so there are a few changes.
1. On step 6, use this Community AMI: agongRStudio2 / ami-1bb47272
2. On step 8, you don't need to download the keypair. "Proceed without a keypair" instead.
2. On step 9, you also need to enable port 8787, the port the RStudio server uses.
3. On step 11 stop following the tutorial. Instead, open up your EC2 URL in your browser, with port 8787. It will look something like this:
4. I'll give out the username and password in class. If you're not in the class, email me and I can clue you in.
5. Here's a first script to run
PS - Don't forget to terminate your EC2 instance when you're done, or you will use up your free hours, then run up a smallish (~50 cents/day) Amazon bill until you remember
Step-by-step instructions for getting started in EC2 are here (pdf and docx). These are intended to get you started in command-line R. For this exercise, we want to use the RStudio GUI instead, so there are a few changes.
1. On step 6, use this Community AMI: agongRStudio2 / ami-1bb47272
2. On step 8, you don't need to download the keypair. "Proceed without a keypair" instead.
2. On step 9, you also need to enable port 8787, the port the RStudio server uses.
3. On step 11 stop following the tutorial. Instead, open up your EC2 URL in your browser, with port 8787. It will look something like this:
http://ec2-123-45-67-890.compute-1.amazonaws.com:8787/
4. I'll give out the username and password in class. If you're not in the class, email me and I can clue you in.
5. Here's a first script to run
library(twitteR)We're going to be trying this in class. I have 20 minutes budgeted, so hopefully it's really this easy.
library(tm)
library(wordcloud)
#Grab the 200 most recent tweets about #bachmann
#http://www.slideshare.net/jeffreybreen/r-by-example-mining-twitter-for
k = 200
my_tweets <- searchTwitter("#bachmann", n=k)
#Convert tweet status objects to text
#http://www.r-bloggers.com/word-cloud-in-r/
my_text <- data.frame( text=unlist( lapply( c(1:k), function(x){my_tweets[[x]]$text} ) ) )
#Convert text to a tm corpus object
my_corpus <- Corpus( DataframeSource( my_text ) )
my_corpus <- tm_map(my_corpus, removePunctuation)
my_corpus <- tm_map(my_corpus, tolower)
my_corpus <- tm_map(my_corpus, function(x) removeWords(x, stopwords("english")))
#Convert corpus to matrix
tdm <- TermDocumentMatrix(my_corpus)#, control = list(weighting = weightTfIdf))
m <- as.matrix(tdm)
#Get features and frequencies
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
#Display as a word cloud
wordcloud(d$word,d$freq,min.freq=5,use.r.layout=T,vfont=c("sans serif","plain"))
#Basic bluster analysis of words
#From: http://www.statmethods.net/advstats/cluster.html
m2 <- m[colSums(m)>15,]
dist_matrix <- dist(m2, method = "euclidean") # distance matrix
fit <- hclust(dist_matrix, method="ward")
plot(fit) # display dendogram
PS - Don't forget to terminate your EC2 instance when you're done, or you will use up your free hours, then run up a smallish (~50 cents/day) Amazon bill until you remember
Monday, July 11, 2011
+Computation: Got an AWS in Education grant!
I just received a generous grant for usage on Amazon's Web Services -- cloud computing, storage space, and bandwidth. This is just in time for a bunch of heavy-duty text crunching I've been planning to do. Thank you, Amazon!
Monday, April 25, 2011
The best $5 I've spent all year
I finally started using Amazon's EC2 yesterday. I've been meaning to learn it forever, but assumed it would be time-consuming to get registered, set up an instance, and so on.
Not true. Thursday morning at 10am, I registered for EC2. By 10:30 I had an instance of Drew Conway's Py/R AMI up and running, with several additional libraries installed, and a few GB of data I wanted to crunch uploaded to the server. Very fast turnaround.
Eight hours and $4.77 later, I'd crunched a lot of numbers -- by far my most productive workflow all week. Highly recommend it.
Not true. Thursday morning at 10am, I registered for EC2. By 10:30 I had an instance of Drew Conway's Py/R AMI up and running, with several additional libraries installed, and a few GB of data I wanted to crunch uploaded to the server. Very fast turnaround.
Eight hours and $4.77 later, I'd crunched a lot of numbers -- by far my most productive workflow all week. Highly recommend it.
Subscribe to:
Comments (Atom)