Introduction: I did this project in my Social Media Analytics course during my MS in Business Analytics program. Consider the following when evaluating my project:
The first block of code loads all required packages for the rest of the project.
rm(list=ls())
library(DBI)
library(RMySQL)
library(tm)
library(e1071)
library(pROC)
library(readr)
library(data.table)
library(plyr)
Next, I needed to pull the tweets from a SQL Database.
-Note: SQL Credentials are no longer active and the code will be using locally saved data.
driver <- dbDriver("MySQL")
myhost <- "localhost"
mydb <- "studb"
myacct <- "cis434"
mypwd <- "LLhtFPbdwiJans8F@S207"
conn <- dbConnect(driver, host=myhost, dbname=mydb, myacct, mypwd)
airline_data <- dbGetQuery(conn, "SELECT id, airline, tweet FROM AirlineTweets WHERE tag='bZiq4j98$Vru'")
dbDisconnect(conn)
Below I define a simple F1 Evaluation Function to assess model performance.
Evaluation <- function(pred, true, class)
{
tp <- sum( pred==class & true==class)
fp <- sum( pred==class & true!=class)
tn <- sum( pred!=class & true!=class)
fn <- sum( pred!=class & true==class)
precision <- tp/(tp+fp)
recall <- tp/(tp+fn)
F1 <- 2/(1/precision + 1/recall)
F1
}
The next block of code does creates training and testing data, specifically it:
#Training Data
noncomplaints <- data.table(read_csv("C:/Users/isaac/Desktop/School/Social Media Analytics/noncomplaint1700.csv"))
complaints <- data.table(read_csv("C:/Users/isaac/Desktop/School/Social Media Analytics/complaint1700.csv"))
#Assign Y_labels
noncomplaints$sentiment = 1
complaints$sentiment = 0
#Create Corpus for Training
testTweets = rbind(noncomplaints, complaints)
docs <- Corpus(VectorSource(testTweets$tweet))
#Create Doc Term Matrix to use as X Variable
dtm.control = list(tolower=T, removePunctuation=T, removeNumbers=T,
stopwords=c(stopwords("english")), stripWhitespace=T)
dtm.full <- DocumentTermMatrix(docs, control=dtm.control)
dtm <- removeSparseTerms(dtm.full,0.99)
#Create X and Y Values
X <- as.matrix(dtm)
Y = testTweets$sentiment
#Create Training and Test Data (75/25 Ratio)
set.seed(1) # fixing the seed value for the random selection guarantees the same results in repeated runs
n=length(Y)
n1=round(n*0.75)
n2=n-n1
train=sample(1:n,n1)
With the data split and processed, I can train a Support Vector Machine to predict tweet sentiment.
Note: I also tested some other models (Logit, Naive Bayes, etc.) and this model performed best. I then tuned the models parameters to achieve the highest accuracy.
svm.model <- svm(Y[train] ~ ., data = X[train,], kernel='radial', gamma = .000058)
pred <- predict( svm.model, X[-train,] )
pred.class <- as.numeric( pred>0.6 ) # try varying the threshold distance
table(pred.class, Y[-train])
##
## pred.class 0 1
## 0 206 31
## 1 233 380
Evaluation( pred.class, Y[-train], 1 )
## [1] 0.7421875
#0.7421875.
#Best Model Found
#Retrain on Full Data
svm.model <- svm(Y ~ ., data = X, kernel='radial', gamma = .000058)
The model achieved an accuracy of 74.2%, which is better than guessing (50% accuracy).
Now, I can use the model to predict airline sentiments.
The code block below does most of the same thing as when I created the training data, but with a few differences:
load("C:/Users/isaac/Desktop/School/Social Media Analytics/socialProject_isaac.RData") #load tweets data for project predicts
docs = Corpus(VectorSource(tweets$tweet))
#Create Doc Term Matrix to use as X Variable
dtm.control = list(tolower=T, removePunctuation=T, removeNumbers=T,
stopwords=c(stopwords("english")), stripWhitespace=T)
dtm.full <- DocumentTermMatrix(docs, control=dtm.control)
dtm <- removeSparseTerms(dtm.full,0.99)
#Create X and Y Values
X_final <- as.matrix(dtm)
xx <- data.frame(X_final[,intersect(colnames(X_final),
colnames(X))]) #remove all terms that didnt appear in training DTM
yy <- read.table(textConnection(""), col.names = colnames(X),
colClasses = "integer") #create table of all terms in teh original DTM
xx = data.table(xx)
zz <- rbind.fill(xx, data.table(yy)) #join the two term tables xx and yy to create a table of terms equal to training DTMs
colnames(zz)[colnames(zz) == 'next.'] = 'next' #one of the columns was named next. but needs to be renamed next
zz = data.table(zz) #convert to data.table so i can then be converted to matrix (not sure why, but doesnt work otherwise)
zz = as.matrix(zz)
zz[is.na(zz) == T] = 0 #replace NAs (filled in line rbind.fill) with zeros
#actually predictions now
pred <- predict( svm.model,zz )
pred.class <- as.numeric( pred>0.7 ) #predict everything thats above 0.6 as positive
#create CSV file
isaac_revette = tweets
colnames(isaac_revette) = c('id', 'my_eval', 'tweet')
isaac_revette = isaac_revette[pred.class == 1,]
isaac_revette$my_eval = NA
#write.csv(isaac_revette, 'isaac_revette.csv', row.names = FALSE)
During the project, I had an idea of how many of the validation tweets are complaints. After reviewing the predictions, I found the model was predicting to many complaints. I decided to change the threshold probability of a tweet being a complaint from 60% to 70%. With the new threshold, I identified 323 non-complaint tweets which seemed more accurate.
The resulting precision (correctly classifying tweets as non-complaint) was around 61% and the recall (correctly classifying tweets as complaints) was somewhere around 90%.