Thursday, March 19, 2015

Twitter Sentiment Analysis in R

This has been one of the hardest projects I've taken on.

I've never been asked to do this. It's just for fun, but it was challenging. I'm used to coding in Java, and since I figured using R might help me in the long-run, it would be nice to able to do some things worth mentioning.
With that being said, I got interested in using Twitter's API to analyze tweets, and I ultimately came across this YouTube video on Twitter Mining and Sentiment Analysis.



I liked the video. Michael Herman admitted he wasn't very experienced with R at the time he made his video, but he still managed to execute the code and get sentiment analysis off of the tweets he extracted. This particular method of sentiment analysis seems to be widely used as far as R-Twitter tutorials are concerned. There are a few problems though. One, this video was published on 2012, which is important to note because there may have been some changes in the R versions and Twitter API between now and then. Two, the R-code for the Sentiment Scores function can use a few tweeks, if not a complete re-do; and three, there's not a lot of channels that have an updated version of this tutorial.

The sentiment scores function (or method) is just a R-Programming method someone made to help the user with sentiment analysis, a process aimed at discerning the widespread opinion or sentiment on any given topic, idea, product or person. The scores function helps us use the group of words you're interested in analyzing in a quantifiable way, identifying and categorizing the opinions (often numerically), especially in order to determine whether the someones attitude towards a particular topic, product, etc., is positive (greater than zero) , negative (less than zero), or neutral (zero).  

I've been looking around for a good, functional code that would help repeat what Herman did in his video, because I came across some problems. It wasn't easy, but after playing around with the code and doing a lot of searches (because I still consider myself a beginner with R), I successfully analysed the Twitter data. You should watch the video to see where the changes were made.

Here is my R code for the Sentiment Analysis:

Importing the data

The code provided in the video is outdated and thus will not work because Twitter changed the way a user can access its API. Now you're going to need authentication in order to grab tweets. In order to get that authentication, you're going to need to create a Twitter app (it's pretty easy). The authentication comes in the form of keys and tokens. Luckily, the twitteR package has been updated to accommodate the change.
 library(twitteR); library(plyr)

 setup_twitter_oauth("API key", "API secret", "Access Token", "Access Token")     
Like the example in the video, we will search for  '#abortion', but I instead of searching for 1500 tweets, I will search for 200 because (1) R can take a while just to grab 30 tweets, and (2) I don't feel like waiting too long to do this; I just want to show that the code works.
tweets = searchTwitter("abortion", n=200) 

length(tweets) 

The Algorithm

The next thing to do, if you havent already, is to download the documents that Michael mentioned in the video. I did this, and it turned out to be messy. I don't know why, but when I went to the links, the text files I downloaded had an html format to it. Incredibly complicated the process. Instead, I used this link provided by Bing Liu, which gives you a rar file containing the two files you need. These files are just lists of positive (positive-words.txt) and negative (negative-words.txt) words. We will need them to match the list of tweets with the words from both files.

To make this easier to call in R, try and save or place them in the same folder that placed your R directory. For instance, you can just set your desktop to be the directory (or the place R expects to easily call files from), and then you wont have to worry about about using the exact file's location, because R already knows where the file is (your directory).

Now, the meaty part of this tutorial is the score.sentiment() function. This was what actually gave me the most problems, because if this isn't right, your analysis (for this tutorial) isn't going anywhere. I've checked with Michael's attempt (the code I saw circling around the internet the most), and I've tried Silvia Planella's updated version of the code; both times I encountered errors, and they were incredibly frustrating. With Michael's code, I got confused because there were no words I wished to exclude, but the function appeared to have required that (with the 'exc.words' input of the function), and at the same time, Michael didn't need to provide any, while R wouldn't let me continue unless I did. Planella's version of the code removed the 'exc.words' part of the function, which eleminates that problem, but the code never accounted for characters that R cannot recognized.

For instance, here's one of the tweet messages I had extracted: 'Awarded €2,000 & incited change. 2 yrs later #abortion was legalized in Portugal í ½í¹Œ @Vesselthefilm @BostonDoulas #reprojustice'. When I try to process the messages in the score.sentiment() function, the entire process would provide an error that looked like this:
 Error during wrapup: invalid input 'Awarded €2,000 & incited change. 2 yrs later #abortion was legalized in Portugal í ½í¹Œ @Vesselthefilm @BostonDoulas #reprojustice' in 'utf8towcs'   

If you're not used to this kind of thing, then I bet you're gonna look at it like I did...heck, maybe experts have problems with this from time to time.


The problem is some characters are not valid, so if you come across this problem, you got to find a way to exclude the invalid characters from the analysis, either from within R or before importing the files for processing.

Thankfully, I found someone with valuable experience for this situation: Gaston Sanchez. From his blog post, I learned that the error I encountered occurred when these unrecognized characters are passing through the tolower() function, which was used in the sentiment score function, so I updated the code by adding a try-catch function to account for these potential errors; problem fixed.
 score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)
  require(stringr)
  
  # we got a vector of sentences. plyr will handle a list
  # or a vector as an "l" for us
  # we want a simple array of scores back, so we use
  # "l" + "a" + "ply" = "laply":
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    
    # clean up sentences with R's regex-driven global substitute, gsub():
    sentence = gsub('[[:punct:]]', '', sentence)
    sentence = gsub('[[:cntrl:]]', '', sentence)
    sentence = gsub('\\d+', '', sentence)
    
    # convert to lower case:
    # Instead of a regular tolower function, make a try-catch function
    tryTolower = function(sentence)
    {
      # create missing value
      # this is where the returned value will be
      y = NA
      # tryCatch error
      try_error = tryCatch(tolower(sentence), error = function(e) e)
      # if not an error
      if (!inherits(try_error, "error"))
        y = tolower(sentence)
      return(y)
    }
    sentence = tryTolower(sentence)
    
    # split into words. str_split is in the stringr package
    word.list = str_split(sentence, '\\s+')
    # sometimes a list() is one level of hierarchy too much
    words = unlist(word.list)
    
    # compare our words to the dictionaries of positive & negative terms
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    
    # match() returns the position of the matched term or NA
    # we just want a TRUE/FALSE:
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    score = sum(pos.matches) - sum(neg.matches)
    
    return(score)
  }, pos.words, neg.words, .progress=.progress )
  
  scores.df = data.frame(score=scores, text=sentences)
  return(scores.df)
} 
The score.sentiment() function returns tabular data with multiple columns and multiple rows. In R, the data.frame is the workhorse for such spreadsheet-like data.

Subsequent Analyses

After you use the function I provided, you should be able to get similar results and the same functionality as it is in the video. Cheers.
 analysis = score.sentiment(tweets.text, pos, neg, .progress="text")    
> table(analysis$score)

-3 -2 -1  0  1  2  3 
 1  5 86 54 31 21  2 
> median(analysis$score)
[1] 0
> mean(analysis$score)
[1] -0.1
> hist(analysis$score)

I think there is more that can be done with the sentiment analysis, but right now this is good enough. I checked out this Villanova University paper and it provided a neat template for a sentiment analysis function. Maybe I will be able to contribute to this someday.

Learn More:


4 comments: