<img src="./RladiesLima2019/titlepage-DataFestTbilisi.png" width="140%" style="display: block; margin: auto;" /> --- # Why analyzing text? * Can't read / won't read that much * Can't remember all I read and synthetize the information * *Increase* objectivity and *let the data* tell the story --- # What type of text analysis? * Word frequency * Word clouds * Topic modeling * Sentiment analysis --- # What type of text analysis? * **Word frequency** * Word clouds * Topic modeling * Sentiment analysis <img src="./RladiesLima2019/Bigram.png" width="140%" style="display: block; margin: auto;" /> * I do not endorse J.K.R.'s transphobic opinions --- # What type of text analysis? * Word frequency * **Word clouds** * Topic modeling * Sentiment analysis <div class="figure" style="text-align: center"> <img src="./RladiesLima2019/HP-wordcloud2.png" alt="Wordcloud of Harry Potter and the Sorcerer's Stone" width="140%" /> <p class="caption">Wordcloud of Harry Potter and the Sorcerer's Stone</p> </div> Wordcloud of Harry Potter and the Sorcerer's Stone --- # What type of text analysis? * Word frequency * Word clouds * **Topic modeling** * Sentiment analysis `\(~\)` <img src="./RladiesLima2019/Topics-DF-1.png" width="140%" style="display: block; margin: auto;" /> --- # What type of text analysis? * Word frequency * Word clouds * **Topic modeling** * Sentiment analysis `\(~\)` <img src="./RladiesLima2019/Topics-DF-2.png" width="140%" style="display: block; margin: auto;" /> --- # What type of text analysis? * Word frequency * Word clouds * Topic modeling * **Sentiment analysis** `\(~\)` <img src="./RladiesLima2019/tweet1.png" width="120%" style="display: block; margin: auto;" /> --- # What type of text analysis? * Word frequency * Word clouds * Topic modeling * **Sentiment analysis** `\(~\)` <img src="./RladiesLima2019/tweet2.png" width="120%" style="display: block; margin: auto;" /> -- More in the book [Text Mining with R](https://www.tidytextmining.com/sentiment.html) by Julia Silge and David Robinson --- # First, let's talk about data processing Key and a pain. * Remove redudant words * Standardize English (BrE or AmE) * Lemmatize * Filter out words with very low frequency --- # Time to play with R! Let's start with an example of tweets about DataFest Tbilisi. Calling all the packages needed and auxiliary functions ```r library(tidyverse) library(tidytext) library(stringr) library(tm) # removing words in Spanish library(textstem) # lemmatizing library(rtweet) library(ggwordcloud) library(topicmodels) path_tools <- "./AuxiliaryTextMining/" # path to some auxiliary files for text mining # calling auxiliary functions source(paste0(path_tools, "Americanizing.R")) source(paste0(path_tools,"cleaning_words_abstract.R")) ``` --- I've already created a token to download tweets. More info in [this link](https://towardsdatascience.com/a-guide-to-mining-and-analysing-tweets-with-r-2f56818fdd16). Now, searching tweets: ```r df_tbilisi <- search_tweets("DataFestTbilisi", n=1000, include_rts=TRUE) # from the past 6-9 days ``` There are many fields of information, and we'll just keep the actual tweets and their IDs ```r text_tb <- df_tbilisi %>% select(text, status_id) ``` -- Using the function to clean the text ```r df_words <- cleaning_words_abstract(text_tb, path_tools) ``` ``` ## Joining, by = "word" ## Joining, by = "word" ## Joining, by = "word" ``` ``` ## Joining, by = "word_am_lem" ``` Let's take a [closer look](./AuxiliaryTextMining/cleaning_words_abstract.R) --- Taking a look at the processed data frame ```r head(df_words) ``` ``` ## # A tibble: 6 x 4 ## status_id word word_am word_am_lem ## <chr> <chr> <chr> <chr> ## 1 1338934511061622785 day day day ## 2 1338934511061622785 datafesttbilisi datafesttbilisi datafesttbilisi ## 3 1338934511061622785 live live live ## 4 1338437103848448003 meet meet meet ## 5 1338437103848448003 human human human ## 6 1338437103848448003 rights rights right ``` --- # Word frequency Counting and taking relative frequencies ```r word_freq <- df_words %>% count(word_am_lem) word_freq <- word_freq %>% mutate(prop = n/sum(n)) %>% arrange(desc(n)) head(word_freq) ``` ``` ## # A tibble: 6 x 3 ## word_am_lem n prop ## <chr> <int> <dbl> ## 1 data 191 0.0445 ## 2 datafesttbilisi 155 0.0361 ## 3 talk 133 0.0310 ## 4 track 124 0.0289 ## 5 speaker 93 0.0217 ## 6 meet 89 0.0207 ``` --- # Word frequency Making a table with the 8 most used words ```r word_freq_most <- word_freq %>% mutate(perc = round(prop*100,2)) %>% select(word_am_lem, n, perc) %>% slice(1:8) knitr::kable(word_freq_most, format = 'html') ``` <table> <thead> <tr> <th style="text-align:left;"> word_am_lem </th> <th style="text-align:right;"> n </th> <th style="text-align:right;"> perc </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> data </td> <td style="text-align:right;"> 191 </td> <td style="text-align:right;"> 4.45 </td> </tr> <tr> <td style="text-align:left;"> datafesttbilisi </td> <td style="text-align:right;"> 155 </td> <td style="text-align:right;"> 3.61 </td> </tr> <tr> <td style="text-align:left;"> talk </td> <td style="text-align:right;"> 133 </td> <td style="text-align:right;"> 3.10 </td> </tr> <tr> <td style="text-align:left;"> track </td> <td style="text-align:right;"> 124 </td> <td style="text-align:right;"> 2.89 </td> </tr> <tr> <td style="text-align:left;"> speaker </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> 2.17 </td> </tr> <tr> <td style="text-align:left;"> meet </td> <td style="text-align:right;"> 89 </td> <td style="text-align:right;"> 2.07 </td> </tr> <tr> <td style="text-align:left;"> workshop </td> <td style="text-align:right;"> 74 </td> <td style="text-align:right;"> 1.72 </td> </tr> <tr> <td style="text-align:left;"> design </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 1.40 </td> </tr> </tbody> </table> --- # Word cloud We filter again based on frequency ```r word_freq_most2 <- word_freq %>% filter(n > 5) # Larger clouds are difficult to read ``` And use `ggwordcloud` to make the cloud ```r ggplot(data = word_freq_most2, aes(label = word_am_lem, size = prop, color = prop)) + geom_text_wordcloud_area(eccentricity = 1,grid_margin = 0) + scale_size_area(max_size = 30) + theme_bw() + scale_colour_gradientn(colors= c('#253494','palegoldenrod', 'orangered'), values=c(0,.25,1))+ theme(strip.text = element_text(size = 45), plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "pt"), strip.background = element_rect(fill = 'white'), panel.border = element_blank()) ``` --- # Word cloud <img src="./RladiesLima2019/wordcloud_df_tb.png" width="65%" style="display: block; margin: auto;" /> --- # Topic modeling * There are several ways to model topics. * Here I show an example of Latent Dirichlet Allocation (LDA) models * Bayesian mixture model: * Each topic is composed by a mixture of words * Each document (tweet) is composed of words, and a mixture of topics * Fixed number of topics -- We will use another example, extracting tweets about Tbilisi. -- ```r tbilisi <- search_tweets("tbilisi", n=10000, include_rts=TRUE, lang = "en") ``` ```r dim(tbilisi) ``` ``` ## [1] 1099 90 ``` --- # Topic modeling ## Data cleaning Selecting the columns of interest and cleaning the data ```r text_tb <- tbilisi %>% select(text, status_id) df_words <- cleaning_words_abstract(text_tb, path_tools) ``` --- # Topic modeling ## Data cleaning To reduce noise for the model: We filter out the words that appear only once throughout the data set ```r terms_extract <- data.frame(word_am_lem = word_freq$word_am_lem[which(word_freq$n == 1)], stringsAsFactors = FALSE) df_words <- df_words %>% anti_join(terms_extract) ``` And the word `tbilisi`, since it appears in all of the tweets ```r terms_extract <- data.frame(word_am_lem = "tbilisi", stringsAsFactors = FALSE) df_words <- df_words %>% anti_join(terms_extract) ``` --- # Topic modeling ## Formatting the cleaned data ```r # Counting how many times a word appears in each tweet datosFreq <- df_words %>% count(status_id, word_am_lem) head(datosFreq) ``` ``` ## # A tibble: 6 x 3 ## status_id word_am_lem n ## <chr> <chr> <int> ## 1 1335883242021543938 calculator 1 ## 2 1335883242021543938 come 1 ## 3 1335883242021543938 dec 1 ## 4 1335883242021543938 event 1 ## 5 1335883242021543938 facebook 1 ## 6 1335883242021543938 georgialabor 1 ``` ```r # Transforming into a Document Term Matrix class data_dtm <- datosFreq %>% cast_dtm(document = status_id, term = word_am_lem, value = n) ``` --- # Topic modeling Fitting an LDA with **4** hidden topics, with the `topicmodels` package. ```r res1 <- LDA(x = data_dtm, k = 4, control = list(alpha = 1)) ``` An output of this model is the estimated distribution of words per topic (beta). We use these values to: create wordclouds per topic `\(\rightarrow\)` interpret topics -- ```r # beta: the probability of that term being generated from that topic. papers_beta <- tidytext::tidy(res1, matrix = "beta") head(papers_beta) ``` ``` ## # A tibble: 6 x 3 ## topic term beta ## <int> <chr> <dbl> ## 1 1 calculator 5.09e- 29 ## 2 2 calculator 1.58e-138 ## 3 3 calculator 2.21e-139 ## 4 4 calculator 1.51e- 3 ## 5 1 come 1.94e- 3 ## 6 2 come 2.49e-133 ``` --- # Topic modeling We filter over beta to display few words in each cloud ```r papers_beta$topic_lab <- as.factor(papers_beta$topic) values_thresh = 0.003 topic_sample <- papers_beta %>% filter(beta > values_thresh) #%>% select(term,beta) ``` ```r ggplot(topic_sample, aes(label = term, size = beta, color = beta)) + geom_text_wordcloud_area(eccentricity = 1,grid_margin = 0) + scale_size_area(max_size = 40) + theme_bw() + scale_colour_gradientn(colors=c('#253494', 'palegoldenrod', 'orangered'), values=c(0,.25,1))+ facet_wrap(~topic_lab, scales = "free",shrink = F) + theme(strip.text = element_text(size = 45), plot.margin = margin(t = 0, r = 0, b = 0, l = 0, unit = "pt"), strip.background = element_rect(fill = 'white')) ``` --- # Topic modeling <img src="./RladiesLima2019/wordcloud_tb.png" width="90%" style="display: block; margin: auto;" />