About the data

The problem that we are working on is regarding the mobilization of the Islamic State militant group called ISIS. While there are many very important parts of ISIS that are integral in understanding their growth and mobilization, one thing we will be focusing on is their growth as an organization and how they spread information together. On November 13, 2015 outside of a very crowded sports stadium, various suicide bombers as well as gunmen killed over 100 people, children included, and injured approximately 400 others. Since these attacks, there have been various other attacks both in Europe and America that ISIS has taken claim over. What is important is the way that ISIS has been mobilizing and the difficulty many secret government agencies have had in trying to track them. They have been utilizing the usage of social media, specifically twitter which is where our dataset derives from.

Since the Paris attacks, there have been over 17,000 tweets from more than 100 Pro-ISIS twitter accounts that have been scraped; this is the data that we will be using. The data consists of the following attributes: name, username, description, location, follower, numberstatuses, time, tweets. Numberstatuses refers to the number of statuses by the user when the tweet was downloaded and followers refers to the number of followers at the time the tweet was downloaded. The last attribute is the character string of the tweet itself.

Preprocessing and Exploring the Data

We began by exploring our dataset and understanding more thoroughly what sort of information we were provided. We knew that the majority of our work would come from text analysis, but we wanted to first see what information we could learn about the data before we delved into processing individual tweets.

tweets <- read.csv("tweets.csv")

summary(tweets)
##                 name                username
##  Rami             : 1475   Uncle_SamCoco: 1580
##  War BreakingNews : 1191   RamiAlLolah  : 1475
##  Conflict Reporter: 1095   warrnews     : 1191
##  Salahuddin Ayubi : 1056   WarReporter1 : 1095
##  Ibni Haneefah    :  709   mobi_ayubi   : 1056
##  wayf44rer        :  420   _IshfaqAhmad :  709
##  (Other)          :11464   (Other)      :10304
##                                                                                                                                                          description
##                                                                                                                                                                :2682
##  Here to defend the  American freedom and also the freedom of coconut . Cat Lover or Hater. Kebab Fan . We're all living in America, America ist wunderbar #USA:1580
##  Real-Time News, Exclusives, Intelligence & Classified Information/Reports from the ME. Forecasted many Israeli strikes in Syria/Lebanon. Graphic content.     :1475
##  we provide fresh news from every battlefield                                                                                                                  :1191
##  Journalist, specialize in ongoing war against terrorism.\nRetweet is not endorsement.                                                                         :1056
##  Reporting on conflicts in the MENA and Asia regions. Not affiliated to any group or movement.                                                                 : 718
##  (Other)                                                                                                                                                       :8708
##                                           location      followers
##                                               :5978   Min.   :   16
##  Read my blog                                 :1475   1st Qu.:  266
##  world                                        :1191   Median :  928
##  Worldwide contributions                      : 998   Mean   : 3975
##  Texas, USA                                   : 993   3rd Qu.: 1791
##  اُÙ<U+0085>تِ Ù<U+0085>ُسÙ<U+0084>Ù<U+0085>ہ Ù<U+0088>Ù<U+0084>اÛ<U+008C>ت Ú©Ø´Ù<U+0085>Û<U+008C>ر: 709   Max.   :34692
##  (Other)                                      :6066
##  numberstatuses               time
##  Min.   :    1   4/14/2016 20:10:   18
##  1st Qu.:  207   4/20/2016 22:52:   17
##  Median :  908   4/17/2016 0:19 :   16
##  Mean   : 4761   4/25/2016 2:34 :   15
##  3rd Qu.: 6865   2/12/2016 19:06:   13
##  Max.   :33091   4/25/2016 2:33 :   13
##                  (Other)        :17318
##                                                                                                                                                                                                   tweets
##  'Free Whores of #Kurdistan' claims responsibility of #Ankara blast killed tens of innocents.. #Turkey #TwitterKurds https://t.co/6u6DsoDu5t                                                         :    1
##  'Inspected' VSO mercenaries (aka. FSA) reportedly looting al-Rai border town after #ISIS withdrawal.. #Syria https://t.co/zV2JMW2TaY                                                                :    1
##  'Muslim' leaders of later generation give kuffars sword as a gift for kiling Muslims while Muslims of earlier generation used it differently                                                        :    1
##  'اÙ<U+0084>طرÙ<U+008A>فÙ<U+008A> : Ù<U+0087>Ù<U+0084> Ù<U+008A>عذر Ù<U+0085>Ù<U+0086> Ù<U+0084>ا Ù<U+008A>Ù<U+0082>Ù<U+008A>Ù<U+0085> اÙ<U+0084>حدÙ<U+0088>د Ù<U+0085>Ø«Ù<U+0084> بعض اÙ<U+0084>أحزاب اÙ<U+0084>إسÙ<U+0084>اÙ<U+0085>Ù<U+008A>Ø© ( Ù<U+0084>Ù<U+0085>Ù<U+0086> Ù<U+008A>Ù<U+0082>Ù<U+0088>Ù<U+0084> Ø£Ù<U+0086> اÙ<U+0084>Ø´Ù<U+008A>Ø® إخÙ<U+0088>اÙ<U+0086> Ù<U+0085>سÙ<U+0084>Ù<U+0085>Ù<U+008A>Ù<U+0086> )\nhttps://t.co/4jnjMrwkKs:    1
##  'Terrorism' was fighting #Iran &amp; its sectarian brutal proxies in #Iraq since 2003 when majority of dumb #Syria|ns were cheering #Hezbollah..                                                    :    1
##  'tis thus with the pupil of the eye; men think it black, though merely (concentrated) light.                                                                                                        :    1
##  (Other)                                                                                                                                                                                             :17404
head(tweets)
##            name        username
## 1 GunsandCoffee GunsandCoffee70
## 2 GunsandCoffee GunsandCoffee70
## 3 GunsandCoffee GunsandCoffee70
## 4 GunsandCoffee GunsandCoffee70
## 5 GunsandCoffee GunsandCoffee70
## 6 GunsandCoffee GunsandCoffee70
##                                    description location followers
## 1 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews                640
## 2 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews                640
## 3 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews                640
## 4 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews                640
## 5 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews                640
## 6 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews                640
##   numberstatuses           time
## 1             49 1/6/2015 21:07
## 2             49 1/6/2015 21:27
## 3             49 1/6/2015 21:29
## 4             49 1/6/2015 21:37
## 5             49 1/6/2015 21:45
## 6             49 1/6/2015 21:51
##                                                                                                                                         tweets
## 1     ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ABU MUHAMMED AL MAQDISI: http://t.co/73xFszsjvr http://t.co/x8BZcscXzq
## 2 ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI 'FOR THE PEOPLE OF INTEGRITY, SACRIFICE IS  EASY' http://t.co/uqqzXGgVTz http://t.co/A7nbjwyHBr
## 3                    ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWLANI (HA): http://t.co/TgXT1GdGw7 http://t.co/ZuE8eisze6
## 4  ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP: 'THE PROMISE OF VICTORY': http://t.co/3qg5dKlIwr http://t.co/7bqk1wJAzC
## 5            ENGLISH TRANSLATION: AQAP: 'RESPONSE TO SHEIKH BAGHDADIS STATEMENT 'ALTHOUGH THE DISBELIEVERS DISLIKE IT.' http://t.co/2EYm9EymTe
## 6                             THE SECOND CLIP IN A DA'WAH SERIES BY A SOLDIER OF JN: Video Link :http://t.co/EPaPRlph5W http://t.co/4VUYszairt
tweets$tweets = as.character(tweets$tweets)
tweets$tweets = str_conv(tweets$tweets,"UTF-8")
tweets$location = as.character(tweets$location)
tweets$time <- as.Date(tweets$time, format = "%m/%d/%Y %H:%M")

sum(complete.cases(tweets))
## [1] 17410
for(i in 1:length(tweets$location)){
  if(tweets[i, "location"] ==  "") {
    tweets[i, "location"] = NA
  }
}
rm(i)

fullLoc <- tweets[which(complete.cases(tweets)),]
ggplot(fullLoc) + geom_bar(aes(x = location, fill = numberstatuses)) + coord_flip() +
  ggtitle("Number of Tweets Sent by location")

rm(fullLoc)

tweets$time <- as.Date(tweets$time, format = "%m/%d/%Y %H:%M")

tweets$containsKill = grepl("kill", tweets$tweets)
tweets$containsBomb = grepl("bomb", tweets$tweets)
tweets$alawite = grepl("Alawite", tweets$tweets)
tweets$zaynab = grepl("Zaynab", tweets$tweets)

We knew that there had been 2 attacks in February of 2016 in Zaynab and Alawite, so we explored our dataset by graphing word frequency over time. We were expecting for there to be spikes of word frequency of these two attacks around their respective times. As you can see from the graphs we got what we were expecting and the data seems to make sense. Additionally we graphed the frequency of the use of the words “kill” and “bomb” over time. These graphs matched the overall shape of the data which was also expected, as it makes sense for these words to be evenly distributed among the tweets.

Next, we worked on processing the tweets using the tm (text mining) package. One of the first things that we dealt with was while there weren’t any entries that were missing, we had a lot of entries that were written in Arabic, so we had to remove those. Next we converted our data into a Corpus which converts everything into a Document type and then were able to use the built in information about a corpus to remove other information including: any numbers, anything which was considered a stopword (which essentially is any sort of possession word i.e. myself, himself, theirs, hers etc.), and transformed everything to lowercase. We also removed any links that were cited in the tweet; many tweets had links but the majority of them seemed to be disabled by twitter at this point and are therefore useless to us.

tw = tweets$tweets
dat3 <- grep("tw", iconv(tw, "latin1", "ASCII", sub="tw"))
tweets = tweets[-dat3,]
rm(tw)
rm(dat3)
# subset original vector of words to exclude words with non-ASCII char
head(tweets)
##            name        username
## 1 GunsandCoffee GunsandCoffee70
## 2 GunsandCoffee GunsandCoffee70
## 3 GunsandCoffee GunsandCoffee70
## 4 GunsandCoffee GunsandCoffee70
## 5 GunsandCoffee GunsandCoffee70
## 6 GunsandCoffee GunsandCoffee70
##                                    description location followers
## 1 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 2 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 3 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 4 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 5 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 6 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
##   numberstatuses       time
## 1             49 2015-01-06
## 2             49 2015-01-06
## 3             49 2015-01-06
## 4             49 2015-01-06
## 5             49 2015-01-06
## 6             49 2015-01-06
##                                                                                                                                         tweets
## 1     ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ABU MUHAMMED AL MAQDISI: http://t.co/73xFszsjvr http://t.co/x8BZcscXzq
## 2 ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI 'FOR THE PEOPLE OF INTEGRITY, SACRIFICE IS  EASY' http://t.co/uqqzXGgVTz http://t.co/A7nbjwyHBr
## 3                    ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWLANI (HA): http://t.co/TgXT1GdGw7 http://t.co/ZuE8eisze6
## 4  ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP: 'THE PROMISE OF VICTORY': http://t.co/3qg5dKlIwr http://t.co/7bqk1wJAzC
## 5            ENGLISH TRANSLATION: AQAP: 'RESPONSE TO SHEIKH BAGHDADIS STATEMENT 'ALTHOUGH THE DISBELIEVERS DISLIKE IT.' http://t.co/2EYm9EymTe
## 6                             THE SECOND CLIP IN A DA'WAH SERIES BY A SOLDIER OF JN: Video Link :http://t.co/EPaPRlph5W http://t.co/4VUYszairt
##   containsKill containsBomb alawite zaynab
## 1        FALSE        FALSE   FALSE  FALSE
## 2        FALSE        FALSE   FALSE  FALSE
## 3        FALSE        FALSE   FALSE  FALSE
## 4        FALSE        FALSE   FALSE  FALSE
## 5        FALSE        FALSE   FALSE  FALSE
## 6        FALSE        FALSE   FALSE  FALSE
dfCorpus = VCorpus(VectorSource(tweets$tweets))

dfCorpus<-tm_map(dfCorpus,removeNumbers)
dfCorpus<-tm_map(dfCorpus,removeWords, stopwords("english"))
dfCorpus<-tm_map(dfCorpus, content_transformer(strip), char.keep="#")
dfCorpus<-tm_map(dfCorpus,stemDocument)
dfCorpus<-tm_map(dfCorpus,content_transformer(tolower))

# remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
# tm v0.6

dfCorpus <- tm_map(dfCorpus, content_transformer(removeURL))

df = data.frame(text=unlist(sapply(dfCorpus, `[`, "content")),
    stringsAsFactors=F)
tweets$cleanedTweets = df$text
rm(removeURL)
rm(df)

writeLines(as.character(dfCorpus[[203]]))
## wilayatninawa answer allah swt khilafa group  pic nowlet see goe hell

N-Gram Analysis

After placing our data into a cleaned corpus, we could do n-gram analysis on the tweets. N-gram analysis allowed for us to understand the data better as it let us see frequent patterns in the tweets. We did unigram, bigram and trigram analysis and saw the frequencies of the most commonly used words/phrases by the users in our data set. We used a Term Document Matrix to get all combunations of one, two, and three words that happen in a row in all the tweets. Then we use the findFrequentTerms in the TM package to find the frequent terms that happen at least lowfreq times. Whith the frequent terms and their occurrences we are able to make a bar graph of the different n-gram analysis.

#unigrams
tdm <- TermDocumentMatrix(dfCorpus, control = list(wordLengths = c(1, Inf)))

idx <- which(dimnames(tdm)$Terms == "r")
#inspect(tdm[idx + (0:5), 101:110])
(freq.terms <- findFreqTerms(tdm, lowfreq = 300))
##  [1] "#iraq"   "#is"     "#isi"    "#syria"  "allah"   "amp"     "armi"
##  [8] "assad"   "attack"  "citi"    "fight"   "forc"    "i"       "is"
## [15] "isi"     "islam"   "kill"    "muslim"  "near"    "now"     "one"
## [22] "report"  "rt"      "soldier" "state"   "the"     "today"   "us"
## [29] "will"
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 300)
df <- data.frame(term = names(term.freq), freq = term.freq)

ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity") +ggtitle("Frequent words said over 300 times")+ xlab("Terms") + ylab("Count") + coord_flip()

rm(tdm)

#bigram
BigramTokenizer <-function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

bigramtdm <- TermDocumentMatrix(dfCorpus, control = list(tokenize = BigramTokenizer))

idxbi <- which(dimnames(bigramtdm)$Terms == "r")
#inspect(bigramtdm[idxbi + (0:5), 101:110])
(freq.termsbi <- findFreqTerms(bigramtdm, lowfreq = 50))
##  [1] "#iraq_armi"        "#turkey_armi"      "deir_ezzor"
##  [4] "i_think"           "islam_state"       "kill_today"
##  [7] "may_allah"         "north_#aleppo"     "rt_nidalgazaui"
## [10] "rt_ramiallolah"    "rt_sparksofirhabi" "shia_militia"
## [13] "soldier_kill"      "terror_group"      "the_islam"
z <- inspect( bigramtdm[freq.termsbi <- findFreqTerms(bigramtdm, lowfreq = 50) , dimnames(bigramtdm)$Docs] )
bigramdf <- rowSums(z)
bigramdf <- as.data.frame(bigramdf)
bigramdf$bigrams <- row.names(bigramdf)
ggplot(bigramdf, aes(x = bigrams, y = bigramdf)) + geom_bar(stat = "identity") +ggtitle("Frequent bigrams said over 50 times")+ xlab("Bigrams") + ylab("Count") + coord_flip()

#trigram
TrigramTokenizer <-function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

trigramtdm <- TermDocumentMatrix(dfCorpus, control = list(tokenize = TrigramTokenizer))

idxbi <- which(dimnames(trigramtdm)$Terms == "r")
#inspect(trigramtdm[idxbi + (0:5), 101:110])
(freq.termstri <- findFreqTerms(trigramtdm, lowfreq = 20))
## [1] "#amaqag_#islamicst_fighter" "#ypg_terror_group"
## [3] "fight_islam_state"          "islam_state_fighter"
## [5] "may_allah_accept"           "rt_afp_#break"
## [7] "rt_nidalgazaui_#break"      "rt_sparksofirhabi_the"
## [9] "the_islam_state"
a <- inspect( trigramtdm[freq.termsbi <- findFreqTerms(trigramtdm, lowfreq = 20) , dimnames(trigramtdm)$Docs] )
trigramdf <- rowSums(a)
trigramdf <- as.data.frame(trigramdf)
trigramdf$trigrams <- row.names(trigramdf)
ggplot(trigramdf, aes(x = trigrams, y = trigramdf)) + geom_bar(stat = "identity") +ggtitle("Frequent trigrams said over 20 times")+ xlab("Trigrams") + ylab("Count") + coord_flip()

The frequences of the n-grams drop drascticly from uni to bi to tri due to the unliklyhood of the exact three words beinng next to each other. To make the graphs look cleaner and show the most frequent n-grams we took unigrams that occured at least 300 times, bigrams that occured at least 50 times and trigrams that occured at least 20 times. Looking at the three bar graphs above we are able to see certain terms that we would expect to see in ISIS fanboys tweets. #isis is the second most comon unigram from the tweets and is used 988 times and could be a way to identify possible isis supporters. The frightening data is the occurence of n-grams like soldier kill, terror group, and attack. Seeing these words usesd so often by these users questions how dangerous these users are.

Sentiment Analysis

Given the nature of our data, we thought it would be interesting to measure the sentiment of the tweets by positive and negative words. Using the qdap package, we were able to get sentiment scores for each individual tweet. Initially, we used the qdap built in dictionary, but then realized that we were likely not including words that would be relevant in the context of our data. Some examples of words like this include: behead, #is (isis), #isis (isis again), etc. The polarity and sentiments of the tweets were measured on a scale of -1 to 1, where -1 was extremely negative and 1 was the most positive tweet.

negativeWords = scan("negativeWords.txt", what = "character", sep = " ")
positiveWords = scan("positiveWords.txt", what = "character", sep = " ")
negWordsDf = data.frame(x = negativeWords, y = rep.int(-1, length(negativeWords)))
posWordsDf = data.frame(x = positiveWords, y = rep.int(1, length(positiveWords)))
newDictionary = qdapDictionaries::key.pol
newDictionary[which(newDictionary$x == "mercy"), "y"] = -1
newDictionary = rbind(newDictionary, negWordsDf)
newDictionary = rbind(newDictionary, posWordsDf)
rm(negativeWords)
rm(positiveWords)
rm(negWordsDf)
rm(posWordsDf)

modifiedPolarityOfTweets = polarity(tweets$cleanedTweets, polarity.frame = newDictionary, constrain = TRUE)
rawPolarityOfTweets = polarity(tweets$cleanedTweets, constrain = TRUE)
tweets$rawPolarity = rep(NA, nrow(tweets))
tweets$modifiedPolarity = rep(NA, nrow(tweets))
indexesToRemove = rep(NA, 0)
indexOfIndexesToRemove = 1
for (i in 1:nrow(tweets)) {
  rawPolar = rawPolarityOfTweets$all$polarity[i]
  modifiedPolar = modifiedPolarityOfTweets$all$polarity[i]
  if (is.nan(rawPolar) || is.nan(modifiedPolar)) {
    indexesToRemove[indexOfIndexesToRemove] = i
    indexOfIndexesToRemove = indexOfIndexesToRemove + 1
  } else {
    tweets$rawPolarity[i] = rawPolar
    tweets$modifiedPolarity[i] = modifiedPolar
  }
}
tweets = tweets[-indexesToRemove, ]
head(tweets)
##            name        username
## 1 GunsandCoffee GunsandCoffee70
## 2 GunsandCoffee GunsandCoffee70
## 3 GunsandCoffee GunsandCoffee70
## 4 GunsandCoffee GunsandCoffee70
## 5 GunsandCoffee GunsandCoffee70
## 6 GunsandCoffee GunsandCoffee70
##                                    description location followers
## 1 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 2 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 3 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 4 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 5 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
## 6 ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews     <NA>       640
##   numberstatuses       time
## 1             49 2015-01-06
## 2             49 2015-01-06
## 3             49 2015-01-06
## 4             49 2015-01-06
## 5             49 2015-01-06
## 6             49 2015-01-06
##                                                                                                                                         tweets
## 1     ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ABU MUHAMMED AL MAQDISI: http://t.co/73xFszsjvr http://t.co/x8BZcscXzq
## 2 ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI 'FOR THE PEOPLE OF INTEGRITY, SACRIFICE IS  EASY' http://t.co/uqqzXGgVTz http://t.co/A7nbjwyHBr
## 3                    ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWLANI (HA): http://t.co/TgXT1GdGw7 http://t.co/ZuE8eisze6
## 4  ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP: 'THE PROMISE OF VICTORY': http://t.co/3qg5dKlIwr http://t.co/7bqk1wJAzC
## 5            ENGLISH TRANSLATION: AQAP: 'RESPONSE TO SHEIKH BAGHDADIS STATEMENT 'ALTHOUGH THE DISBELIEVERS DISLIKE IT.' http://t.co/2EYm9EymTe
## 6                             THE SECOND CLIP IN A DA'WAH SERIES BY A SOLDIER OF JN: Video Link :http://t.co/EPaPRlph5W http://t.co/4VUYszairt
##   containsKill containsBomb alawite zaynab
## 1        FALSE        FALSE   FALSE  FALSE
## 2        FALSE        FALSE   FALSE  FALSE
## 3        FALSE        FALSE   FALSE  FALSE
## 4        FALSE        FALSE   FALSE  FALSE
## 5        FALSE        FALSE   FALSE  FALSE
## 6        FALSE        FALSE   FALSE  FALSE
##                                                                                  cleanedTweets
## 1                english translat a messag to the truth in syria sheikh abu muham al maqdisi
## 2          english translat sheikh fatih al jawlani for the peopl of integr sacrific is easi
## 3                          english translat first audio meet with sheikh fatih al jawlani ha
## 4          english translat sheikh nasir al wuhayshi ha leader of aqap the promis of victori
## 5 english translat aqap respons to sheikh baghdadi statement although the disbeliev dislik it
## 6                              the second clip in a dawah seri by a soldier of jn video link
##   rawPolarity modifiedPolarity
## 1           0                0
## 2           0                0
## 3           0                0
## 4           0                0
## 5           0                0
## 6           0                0
usernames = levels(tweets$username)

# Find average polarity for each username
top20ActiveUsers = findTopNUsersByVar(20, "tweets")

# Pass in "raw" or "modified"
avgPol = function(type) {
  averagePolarity = data.frame(rep(9, times = length(usernames)), usernames)
  colnames(averagePolarity) = c("polarity", "username")
  numTweetsPerUsername = rep(0, times = length(usernames))
  names(numTweetsPerUsername) = usernames
  column = paste(type, "Polarity", sep = "")
  for (i in 1:nrow(tweets)) {
    user = tweets$username[i]
    averagePolarity[which(averagePolarity$username == user), "polarity"] =
      averagePolarity[which(averagePolarity$username == user), "polarity"] + tweets[i, column]
    numTweetsPerUsername[user] = numTweetsPerUsername[user] + 1
  }
  averagePolarity$polarity = averagePolarity$polarity / numTweetsPerUsername
  return(averagePolarity)
}

rawAveragePolarity = avgPol("raw")
modifiedAveragePolarity = avgPol("modified")
rawSubsetOfActiveUsers = rawAveragePolarity[which(rawAveragePolarity$username %in% top20ActiveUsers), ]
modifiedSubsetOfActiveUsers = modifiedAveragePolarity[which(modifiedAveragePolarity$username %in% top20ActiveUsers), ]

top20Followers = findTopNUsersByVar(20, "followers")
rawSubsetOfActiveFollowers = rawAveragePolarity[which(rawAveragePolarity$username %in% top20Followers), ]
modifiedSubsetOfActiveFollowers = modifiedAveragePolarity[which(modifiedAveragePolarity$username %in% top20Followers), ]

ggplot(rawSubsetOfActiveUsers, aes(x = username, y = polarity)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Raw Polarity of 20 Most Active Users")

ggplot(modifiedSubsetOfActiveUsers, aes(x = username, y = polarity)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Modified Polarity of 20 Most Active Users")

ggplot(rawSubsetOfActiveFollowers, aes(x = username, y = polarity)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Raw Polarity of 20 Most Followed Users")

ggplot(modifiedSubsetOfActiveFollowers, aes(x = username, y = polarity)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Modified Polarity of 20 Most Followed Users")

In the above graphs you can see we reference modified and raw polarities as well as most followed and most active users. Modified polarity refers to the polarity scores calculated using our modified dictionary and raw polarity was calculated using the default dictionary. In addition, we also subsetted the data to the 20 users that tweeted the most (Most Active) and the 20 users that had the most followers (Most Followed). It is interesting to notice that the most followed users tend to have higher polarity than the most active users with raw and modified polarity. It is also interesting when comparing modified to raw polarity graphs that the modified polarity didn’t affect the positive users as much as it affected the negative users. The most positive user from the graphs, ansarakhilafa, remained at about the same average polarity. On the other hand, not one of the 20 most active users has an average raw polarity score of less than -0.1. However, the modified polarity score has three users less than -0.1 and another user very close to -0.1.

Understanding Locations

Another interesting analysis for this data was maps. We used both the buitin maps tool from R, and a Google maps package. The end product was a map that graphed the number of tweets sent by location. This turned out to be a bit more tedious than we had originally thought because all of the location data isn’t from geotagged data, but instead is user inputted. This meant that we had to go through all of the location data that was actually reported, and make remove the non-sense locations (ranging from anything between: “I hate snitches”, “Land of Allah”, “Among the muslims”, to “Don’t need to know”). From this we also had to parse any of the locations that were not ASCII characters. We then got the geocode information from the google maps package, and mapped the longitute and latitude onto a map of the worldmap, and made the size of the location based on the number of tweets sent from that location.

tweets$location = as.character(tweets$location)
tweets$location = str_conv(tweets$location,"UTF-8")

removeLocs <- c("","Read my blog", "Don't need to know", "Land of Allah", "Among The Muslims", "Guetto",
                "world","Worldwide contributions","Middle of Nowhere", "I hate snitches", "Nowhere",
                "40+Suspension for the truth!","noway", "Among the mushrikeen","Prison ( Darul Kufr )",
                "AP","Al-Battar Media Foundatiom","dar al-kufr","darl mushrequeen","Among mushrikeen","Earth")

## ISIS began in Iraq, geocoded for there
for(i in 1:length(tweets$location)){
  if(is.na(tweets$location[i]) || is.element(tweets$location[i], removeLocs)) {
    tweets$realLocation[i] = NA
  }else if(tweets[i, "location"] == "Islamic State"){
    tweets$realLocation[i]= "Iraq"
  }else if(tweets[i,"location"] == "Male'. Maldives."){
    tweets$realLocation[i] = "Maldives"
  }else if(tweets[i,"location"] == "Wilayah Twitter" || tweets[i,"location"] == "Wilayah Kashmir"){
    tweets$realLocation[i] = "Wilayah"
  } else{
    tweets$realLocation[i] = tweets$location[i]
  }
}

tweets$realLocation[grep("Sirte", tweets$realLocation, ignore.case = T)] = "Sirte"
tweets$realLocation[grep("Deutschland", tweets$realLocation, ignore.case = T)] = "Munchen, Deutschland"
tweets$realLocation[grep("Wilayat Hadramaut", tweets$realLocation, ignore.case = T)] = "Wilayat"
tweets$realLocation[grep("Wazirstan",tweets$realLocation,ignore.case = T)] = "Wazirstan"

sum(is.na(tweets$realLocation))
## [1] 7294
uniqueLocs = NA
ind <- 1
for(i in 1:length(tweets$realLocation)){
  if(!is.element(tweets$realLocation[i], uniqueLocs)){
    uniqueLocs[ind] = tweets$realLocation[i]
    ind <- ind +1
  }
}
uniqueLocs
##  [1] "Iraq"
##  [2] NA
##  [3] "Munchen, Deutschland"
##  [4] "Maldives"
##  [5] "Dunya"
##  [6] "Wilayat"
##  [7] "EU"
##  [8] "."
##  [9] "Wazirstan"
## [10] "England, United Kingdom"
## [11] "<U+0623><U+0633><U+064A><U+0631> <U+0627><U+0644><U+062F><U+0646><U+064A><U+0627>"
## [12] "yamin, yasar raqum <U+0661><U+0664>"
## [13] "Antas, Bahia"
## [14] "Wilayah"
## [15] "<U+0627><U+064F><U+0645><U+062A><U+0650> <U+0645><U+064F><U+0633><U+0644><U+0645><U+06C1> <U+0648><U+0644><U+0627><U+06CC><U+062A> <U+06A9><U+0634><U+0645><U+06CC><U+0631>"
## [16] "United States"
## [17] "28th Street, Qamar Precint"
## [18] "Punch, Jammu And Kashmir"
## [19] "Amsterdam, The Netherlands"
## [20] "Dar Al Kufr"
## [21] "Gaziantep, Turkey"
## [22] "Texas, USA"
## [23] "Geneva, Switzerland"
## [24] "Sirte"
## [25] "Ghuraba"
## [26] "Lake City, GA"
## [27] "Singaparna, Indonesia"
## [28] "Germany"
## [29] "Bandar Seri Begawan, Negara Brunei Darussalam"
## [30] "<U+062E><U+0627><U+0631><U+062C> <U+0627><U+0644><U+062E><U+0644><U+0627><U+0641><U+0629>"
uniqueLocs <- c("Iraq","Munchen, Deutschland", "Maldives", "Dunya", "Wilayat","EU","United States",
                "Punch, Jammu And Kashmir","Gaziantep, Turkey","Texas, USA","Geneva, Switzerland",
                "Lake City, GA","Singaparna, Indonesia","Germany",
                "Bandar Seri Begawan, Negara Brunei Darussalam", "Sirte","Amsterdam, The Netherlands",
                "Antas, Bahia","England, United Kingdom","Wazirstan")


mapWorld <- borders("world", colour="gray50", fill="gray50") # create a layer of borders
locs <- geocode(uniqueLocs)
loc.x <- locs$lon
loc.y <- locs$lat

freq = NA
for(i in 1:length(uniqueLocs)){
  freq[i] = length(which(tweets$realLocation == uniqueLocs[i]))
}

freq[2] <- length(which(tweets$realLocation == "Munchen, Deutschland"))
freq[3] <- length(which(tweets$location == "Male'. Maldives."))
freq[5] <- length(which(tweets$location == "Wilayah Kashmir"))

mp <- ggplot() +  mapWorld + geom_point(aes(x = loc.x, y = loc.y, size = freq), color = "red") +
  ggtitle("Locations recorded by twitter users")
mp

This graph is a little unexpected. The large bubble locations are from the United States and Europe. One would expect for there to be more bubbles in the middle east and for those bubbles to be larger. An explanation for this is the self reported location. If we look at the output of sum(is.na(tweets$realLocation)), we can see that more than half of the location data gets removed because the location has no true mapping. If all of the tweets were geotagged by the true location, we would definitely get a better idea of what our data really represents.

Clustering

We decided to use build our own kmeans clustering algorithm for this project. The purpose was because we knew it would be difficult to cluster twiiter usernames based on sentiment alone, so we wanted to factor in ngram analysis into our tweet distance calculations. Then we would be able to easily tweak our distance calculations to improve our results. A problem with this was speed and efficiency of our algorithm. In the end, we had to remove the ngram distance capability as it made the algorithm far too slow to run. After clustering, we mapped clusters to usernames by finding the most common username in each cluster.

We first subsetted the data to the top 20 most active users. This was done in the same fashion as before for sentiment analysis.

## [1] "Clustering the top 20 users."
## [1] "Computing distance matrix"
## [1] "Distance matrix computed"
## [1] "ITERATION:  1"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] "ITERATION:  2"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] "ITERATION:  3"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] "ITERATION:  4"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] "ITERATION:  5"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] ""
## [1] "Accuracy: 0.0319335083114611"
## [1] ""
## [1] "Results for P =  melvynlion"
## [1] "Precision: 0.0189075630252101"
## [1] "Recall: 0.2"
## [1] "F1: 0.0345489443378119"
## [1] ""
## [1] "Results for P =  RamiAlLolah"
## [1] "Precision: 0.046831955922865"
## [1] "Recall: 0.105590062111801"
## [1] "F1: 0.0648854961832061"
## [1] ""
## [1] "Results for P =  Uncle_SamCoco"
## [1] "Precision: 0.0311501597444089"
## [1] "Recall: 0.684210526315789"
## [1] "F1: 0.0595874713521772"
## [1] ""
## [1] "Results for P =  WarReporter1"
## [1] "Precision: 0"
## [1] "Recall: 0"
## [1] "F1: NaN"
## [1] ""
## [1] "Results for P =  warrnews"
## [1] "Precision: 0.0571428571428571"
## [1] "Recall: 0.133333333333333"
## [1] "F1: 0.08"
## [1] ""

We can see that our results are terrible. We have an accuracy of 0.0319 and F1 measures of less than 0.1 for all mapped usernames. Another very interesting note is that we ran the kmeans algorithm with 20 clusters, with each starting medoid coming from a unique username in the top 20 most active users. However, after clustering, even with 20 clusters, only 5 usernames end up getting mapped to. This can be explained by the nature of our data. This dataset is dominated by a few very active users. It is possible that even if a user was fairly unique in their tweet style, resulting in a cluster, that more active users dominated the cluster and others. Additionally, it is possible that the tweets just do not cluster well even after polarity analysis, resulting in the users that were most active being output from our prediction function.

We then ran our clustering algorithm again, but this time subsetting our data set to the top five most active users.

## [1] "Clustering the top 5 users."
## [1] "Computing distance matrix"
## [1] "Distance matrix computed"
## [1] "ITERATION:  1"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] "ITERATION:  2"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] "ITERATION:  3"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] "ITERATION:  4"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] ""
## [1] "Accuracy: 0.19132455460883"
## [1] ""
## [1] "Results for P =  RamiAlLolah"
## [1] "Precision: 0.218978102189781"
## [1] "Recall: 0.193548387096774"
## [1] "F1: 0.205479452054795"
## [1] ""
## [1] "Results for P =  Uncle_SamCoco"
## [1] "Precision: 0.318777292576419"
## [1] "Recall: 0.226006191950464"
## [1] "F1: 0.264492753623188"
## [1] ""
## [1] "Results for P =  WarReporter1"
## [1] "Precision: 0.128125"
## [1] "Recall: 0.144876325088339"
## [1] "F1: 0.135986733001658"
## [1] ""
## [1] "Results for P =  warrnews"
## [1] "Precision: 0.170247933884298"
## [1] "Recall: 0.425619834710744"
## [1] "F1: 0.243211334120425"
## [1] ""

Again, we get awful results. Our accuracy has improved to 0.1913, with the highest F1 measure at 0.2432. What is more reassuring after this run though is that we have 4 usernames being mapped to from 5 clusters. From running this we can see that it is less likely that just a few of the many usernames are dominating our clusters. We still lose one cluster so it is still in the realm of possibility, but it now appears more likely that our dataset can’t be separated simply by polarity.

Finally, we attempted to cluster just the top two most active users.

## [1] "Clustering the top 2 users."
## [1] "Computing distance matrix"
## [1] "Distance matrix computed"
## [1] "ITERATION:  1"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] "ITERATION:  2"
## [1] "Assigning Clusters"
## [1] "Clusters Assigned"
## [1] "Recomputing medoids"
## [1] "Medoids Recomputed"
## [1] ""
## [1] "Accuracy: 0.425081433224756"
## [1] ""
## [1] "Results for P =  RamiAlLolah"
## [1] "Precision: 0.431372549019608"
## [1] "Recall: 0.0635838150289017"
## [1] "F1: 0.110831234256927"
## [1] ""
## [1] "Results for P =  Uncle_SamCoco"
## [1] "Precision: 0.424511545293073"
## [1] "Recall: 0.891791044776119"
## [1] "F1: 0.575210589651023"
## [1] ""

Unexpectedly, we got results that were worse than randomly guessing. For just the top two average users, even using 10 clusters, we still get an accuracy of 0.4251. Combining this with the specific measures for each username we can see that our clusters tend to just be a mix of usernames, with a few usernames dominating the clusters. This speaks to how clusterable the data is in terms of polarity.

To validate this result, we ran the R kmeans clustering algorithm to cluster based on modified polarity. We subsetted the data to the top five most active users and ran kmeans with a preset k = 5. Below is a printout of the usernames in cluster 1 outputted from the model.

##   [1] _IshfaqAhmad  RamiAlLolah   Uncle_SamCoco WarReporter1  _IshfaqAhmad
##   [6] WarReporter1  warrnews      warrnews      _IshfaqAhmad  RamiAlLolah
##  [11] WarReporter1  Uncle_SamCoco _IshfaqAhmad  warrnews      _IshfaqAhmad
##  [16] _IshfaqAhmad  Uncle_SamCoco RamiAlLolah   Uncle_SamCoco RamiAlLolah
##  [21] warrnews      warrnews      RamiAlLolah   RamiAlLolah   Uncle_SamCoco
##  [26] WarReporter1  _IshfaqAhmad  WarReporter1  _IshfaqAhmad  _IshfaqAhmad
##  [31] Uncle_SamCoco WarReporter1  WarReporter1  _IshfaqAhmad  warrnews
##  [36] WarReporter1  WarReporter1  WarReporter1  WarReporter1  warrnews
##  [41] Uncle_SamCoco Uncle_SamCoco warrnews      RamiAlLolah   _IshfaqAhmad
##  [46] _IshfaqAhmad  warrnews      WarReporter1  WarReporter1  RamiAlLolah
##  [51] WarReporter1  warrnews      Uncle_SamCoco _IshfaqAhmad  WarReporter1
##  [56] Uncle_SamCoco Uncle_SamCoco _IshfaqAhmad  WarReporter1  warrnews
##  [61] WarReporter1  Uncle_SamCoco _IshfaqAhmad  WarReporter1  _IshfaqAhmad
##  [66] warrnews      _IshfaqAhmad  WarReporter1  Uncle_SamCoco _IshfaqAhmad
##  [71] _IshfaqAhmad  WarReporter1  Uncle_SamCoco Uncle_SamCoco Uncle_SamCoco
##  [76] _IshfaqAhmad  Uncle_SamCoco RamiAlLolah   _IshfaqAhmad  _IshfaqAhmad
##  [81] WarReporter1  Uncle_SamCoco _IshfaqAhmad  Uncle_SamCoco _IshfaqAhmad
##  [86] _IshfaqAhmad  warrnews      _IshfaqAhmad  Uncle_SamCoco RamiAlLolah
##  [91] Uncle_SamCoco WarReporter1  _IshfaqAhmad  warrnews      Uncle_SamCoco
##  [96] WarReporter1  RamiAlLolah   Uncle_SamCoco RamiAlLolah   WarReporter1
## [101] Uncle_SamCoco _IshfaqAhmad  _IshfaqAhmad  _IshfaqAhmad  WarReporter1
## [106] warrnews      _IshfaqAhmad  _IshfaqAhmad  RamiAlLolah   WarReporter1
## [111] WarReporter1  Uncle_SamCoco warrnews      Uncle_SamCoco RamiAlLolah
## [116] Uncle_SamCoco RamiAlLolah   warrnews      RamiAlLolah   Uncle_SamCoco
## [121] WarReporter1  WarReporter1  Uncle_SamCoco WarReporter1  _IshfaqAhmad
## [126] _IshfaqAhmad  Uncle_SamCoco Uncle_SamCoco RamiAlLolah   RamiAlLolah
## [131] WarReporter1  RamiAlLolah   WarReporter1  RamiAlLolah   Uncle_SamCoco
## [136] _IshfaqAhmad  _IshfaqAhmad  WarReporter1  Uncle_SamCoco WarReporter1
## [141] WarReporter1  Uncle_SamCoco warrnews      RamiAlLolah   WarReporter1
## [146] WarReporter1  WarReporter1  WarReporter1  RamiAlLolah   Uncle_SamCoco
## [151] Uncle_SamCoco _IshfaqAhmad  _IshfaqAhmad  WarReporter1  RamiAlLolah
## [156] WarReporter1  warrnews      WarReporter1  WarReporter1  WarReporter1
## [161] Uncle_SamCoco RamiAlLolah   _IshfaqAhmad  WarReporter1  Uncle_SamCoco
## [166] Uncle_SamCoco RamiAlLolah   _IshfaqAhmad  WarReporter1  RamiAlLolah
## [171] Uncle_SamCoco warrnews      _IshfaqAhmad  RamiAlLolah   Uncle_SamCoco
## [176] WarReporter1  WarReporter1  warrnews      WarReporter1  WarReporter1
## [181] _IshfaqAhmad  RamiAlLolah   warrnews      _IshfaqAhmad  WarReporter1
## [186] Uncle_SamCoco RamiAlLolah   _IshfaqAhmad  WarReporter1  RamiAlLolah
## [191] Uncle_SamCoco _IshfaqAhmad  Uncle_SamCoco Uncle_SamCoco warrnews
## [196] warrnews      warrnews      WarReporter1  RamiAlLolah   WarReporter1
## [201] Uncle_SamCoco RamiAlLolah   _IshfaqAhmad  _IshfaqAhmad  Uncle_SamCoco
## [206] RamiAlLolah   RamiAlLolah   Uncle_SamCoco WarReporter1  Uncle_SamCoco
## [211] RamiAlLolah   RamiAlLolah   RamiAlLolah   RamiAlLolah   Uncle_SamCoco
## [216] RamiAlLolah   WarReporter1  _IshfaqAhmad  WarReporter1  RamiAlLolah
## [221] Uncle_SamCoco RamiAlLolah   RamiAlLolah   Uncle_SamCoco _IshfaqAhmad
## [226] WarReporter1  warrnews      WarReporter1  WarReporter1  _IshfaqAhmad
## [231] _IshfaqAhmad  RamiAlLolah   WarReporter1  WarReporter1  _IshfaqAhmad
## [236] WarReporter1  WarReporter1  _IshfaqAhmad
## 112 Levels: ___KU217_y __alfresco__ ... YazeedDhardaa25

The output supports our conclusion that, based purely on polarity, our dataset is not easily clusterable. The mix of usernames in the first cluster is fairly uniform. Based on the output above, it is obvious that it would be very difficult to accurately map from cluster 1 to any specific username.
We started with a data set that contained upwards of 17,000 tweets. After careful data preprocessing we were able to bring the number of tweets to analyze down to ~11,000 observations. To gain a better understanding for the data we played with word frequency over time and were able to validate the data by cross referencing the dates of terror attacks with the density of references to those attacks in the tweets. We were able to clean the tweets using text mining packages and then could perform n-gram analysis as well as sentiment analysis using the cleaned data. We also were able to scrape the location data to get the actual locations and visualize where all of these tweets were coming from.
This project turned out to be much more difficult and time consuming than we initially planned. At first it seemed like a possibility that we would be able to classify a username based on a tweet and previous patterns. While it is likely that this isn’t impossible, our time constraints and limits in terms of text mining knowledge can explain our bad results. Given more time we would like to better factor n-gram analysis into our model creation.
The tweet data we found also didn’t include the names of followers of the main twitter handles. A dataset that included that would be primed for social network analysis and would be easier to classify.

Some resources we used: https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html
https://78462f86-a-e2d7344e-s-sites.googlegroups.com/a/rdatamining.com/www/docs/RDataMining-slides-text-mining.pdf?attachauth=ANoY7co8mBLqvOze1kKofiT9YDmck83ZUFOp5wHe7tRyyNSXBw8jMh4nHhLakZg7qcEOQsxaYqX6YyMFVjZkjzxOi9ktOVykEy2ZwqMqi7w9pCS0Z7x-J3_zEo8GMlWgm1XI6-7xZ4DTLUhvmqkyS5u95d7qu0aQ_-1bMgyYT5aAVKib-Q8FUvUNk3XWD3mqMYeZLcX_gQl8UInRrK1QMR4fyxO4SqAUcgHczMuOBfhCzXwkS-1FVbY%3D&attredirects=0