Presentation Ninja

---

# Why analyzing text?

* Can't read / won't read that much
* Can't remember all I read and synthetize the information
* *Increase* objectivity and *let the data* tell the story

---

# What type of text analysis?

* Word frequency
* Word clouds
* Topic modeling
* Sentiment analysis

---

# What type of text analysis?

* **Word frequency**
* Word clouds
* Topic modeling
* Sentiment analysis

<img src="./RladiesLima2019/Bigram.png" width="140%" style="display: block; margin: auto;" />
* I do not endorse J.K.R.'s transphobic opinions

---

# What type of text analysis?

* Word frequency
* **Word clouds**
* Topic modeling
* Sentiment analysis

<div class="figure" style="text-align: center">
<img src="./RladiesLima2019/HP-wordcloud2.png" alt="Wordcloud of Harry Potter and the Sorcerer's Stone" width="140%" />
<p class="caption">Wordcloud of Harry Potter and the Sorcerer's Stone</p>
</div>
Wordcloud of Harry Potter and the Sorcerer's Stone

---

# What type of text analysis?

* Word frequency
* Word clouds
* **Topic modeling**
* Sentiment analysis

`$~$`

---

# What type of text analysis?

* Word frequency
* Word clouds
* **Topic modeling**
* Sentiment analysis

`$~$`

---

# What type of text analysis?

* Word frequency
* Word clouds
* Topic modeling
* **Sentiment analysis**

`$~$`

---

# What type of text analysis?

* Word frequency
* Word clouds
* Topic modeling
* **Sentiment analysis**

`$~$`

More in the book [Text Mining with R](https://www.tidytextmining.com/sentiment.html) by Julia Silge and David Robinson

---

# First, let's talk about data processing

Key and a pain.

* Remove redudant words
* Standardize English (BrE or AmE)
* Lemmatize
* Filter out words with very low frequency

---

# Time to play with R!

Let's start with an example of tweets about DataFest Tbilisi.

Calling all the packages needed and auxiliary functions

```r
library(tidyverse)
library(tidytext)
library(stringr)
library(tm) # removing words in Spanish
library(textstem) # lemmatizing 
library(rtweet)
library(ggwordcloud)
library(topicmodels)

path_tools <- "./AuxiliaryTextMining/" # path to some auxiliary files for text mining

# calling auxiliary functions
source(paste0(path_tools, "Americanizing.R"))
source(paste0(path_tools,"cleaning_words_abstract.R"))
```

---

I've already created a token to download tweets. More info in [this link](https://towardsdatascience.com/a-guide-to-mining-and-analysing-tweets-with-r-2f56818fdd16).

Now, searching tweets:

```r
df_tbilisi <- search_tweets("DataFestTbilisi", n=1000, 
                            include_rts=TRUE) # from the past 6-9 days
```

There are many fields of information, and we'll just keep the actual tweets and their IDs

```r
text_tb <- df_tbilisi %>% select(text, status_id)
```

Using the function to clean the text

```r
df_words <- cleaning_words_abstract(text_tb, path_tools)
```

```
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
```

```
## Joining, by = "word_am_lem"
```

Let's take a [closer look](./AuxiliaryTextMining/cleaning_words_abstract.R)

---

Taking a look at the processed data frame

```r
head(df_words)
```

```
## # A tibble: 6 x 4
##   status_id           word            word_am         word_am_lem    
##   <chr>               <chr>           <chr>           <chr>          
## 1 1338934511061622785 day             day             day            
## 2 1338934511061622785 datafesttbilisi datafesttbilisi datafesttbilisi
## 3 1338934511061622785 live            live            live           
## 4 1338437103848448003 meet            meet            meet           
## 5 1338437103848448003 human           human           human          
## 6 1338437103848448003 rights          rights          right
```

---

# Word frequency

Counting and taking relative frequencies

```r
word_freq <- df_words %>% count(word_am_lem)
word_freq <- word_freq %>% 
  mutate(prop = n/sum(n)) %>%
  arrange(desc(n))
head(word_freq)
```

```
## # A tibble: 6 x 3
##   word_am_lem         n   prop
##   <chr>           <int>  <dbl>
## 1 data              191 0.0445
## 2 datafesttbilisi   155 0.0361
## 3 talk              133 0.0310
## 4 track             124 0.0289
## 5 speaker            93 0.0217
## 6 meet               89 0.0207
```

---

# Word frequency

Making a table with the 8 most used words

```r
word_freq_most <- word_freq %>%
  mutate(perc = round(prop*100,2)) %>% 
  select(word_am_lem, n, perc) %>% 
  slice(1:8)
knitr::kable(word_freq_most, format = 'html')
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> word_am_lem </th>
   <th style="text-align:right;"> n </th>
   <th style="text-align:right;"> perc </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> data </td>
   <td style="text-align:right;"> 191 </td>
   <td style="text-align:right;"> 4.45 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> datafesttbilisi </td>
   <td style="text-align:right;"> 155 </td>
   <td style="text-align:right;"> 3.61 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> talk </td>
   <td style="text-align:right;"> 133 </td>
   <td style="text-align:right;"> 3.10 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> track </td>
   <td style="text-align:right;"> 124 </td>
   <td style="text-align:right;"> 2.89 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> speaker </td>
   <td style="text-align:right;"> 93 </td>
   <td style="text-align:right;"> 2.17 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> meet </td>
   <td style="text-align:right;"> 89 </td>
   <td style="text-align:right;"> 2.07 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> workshop </td>
   <td style="text-align:right;"> 74 </td>
   <td style="text-align:right;"> 1.72 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> design </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:right;"> 1.40 </td>
  </tr>
</tbody>
</table>
---

# Word cloud

We filter again based on frequency

```r
word_freq_most2 <- word_freq %>%
  filter(n > 5) # Larger clouds are difficult to read
```

And use `ggwordcloud` to make the cloud

```r
ggplot(data = word_freq_most2, aes(label = word_am_lem, size = prop, 
                                  color = prop)) +
  geom_text_wordcloud_area(eccentricity = 1,grid_margin = 0) + 
  scale_size_area(max_size = 30) +
  theme_bw() + scale_colour_gradientn(colors=
                                        c('#253494','palegoldenrod',
                                          'orangered'),
                                      values=c(0,.25,1))+
  theme(strip.text = element_text(size = 45),
        plot.margin = margin(t = 0, r = 0, b = 0, l = 0, 
                             unit = "pt"),
        strip.background = element_rect(fill = 'white'),
        panel.border = element_blank()) 
```

---

# Word cloud

---

# Topic modeling

* There are several ways to model topics.

* Here I show an example of Latent Dirichlet Allocation (LDA) models

* Bayesian mixture model:
  
    * Each topic is composed by a mixture of words
    
    * Each document (tweet) is composed of words, and a mixture of topics
    
  * Fixed number of topics

--
  
We will use another example, extracting tweets about Tbilisi.

```r
tbilisi <- search_tweets("tbilisi", n=10000, include_rts=TRUE, 
                         lang = "en")
```

```r
dim(tbilisi)
```

```
## [1] 1099   90
```

---

# Topic modeling

## Data cleaning

Selecting the columns of interest and cleaning the data

```r
text_tb <- tbilisi %>% select(text, status_id)

df_words <- cleaning_words_abstract(text_tb, path_tools)
```

---

# Topic modeling

## Data cleaning

To reduce noise for the model:

We filter out the words that appear only once throughout the data set

```r
terms_extract <- data.frame(word_am_lem = 
                              word_freq$word_am_lem[which(word_freq$n == 1)],
                            stringsAsFactors = FALSE)
df_words <- df_words %>% 
  anti_join(terms_extract)
```

And the word `tbilisi`, since it appears in all of the tweets

```r
terms_extract <- data.frame(word_am_lem = "tbilisi",
                            stringsAsFactors = FALSE)
df_words <- df_words %>% 
  anti_join(terms_extract)
```

---

# Topic modeling

## Formatting the cleaned data

```r
# Counting how many times a word appears in each tweet
datosFreq <- df_words %>%
  count(status_id, word_am_lem)
head(datosFreq)
```

```
## # A tibble: 6 x 3
##   status_id           word_am_lem      n
##   <chr>               <chr>        <int>
## 1 1335883242021543938 calculator       1
## 2 1335883242021543938 come             1
## 3 1335883242021543938 dec              1
## 4 1335883242021543938 event            1
## 5 1335883242021543938 facebook         1
## 6 1335883242021543938 georgialabor     1
```

```r
# Transforming into a Document Term Matrix class
data_dtm <- datosFreq %>%
  cast_dtm(document = status_id, term = word_am_lem, value = n)
```

---

# Topic modeling

Fitting an LDA with **4** hidden topics, with the `topicmodels` package.

```r
res1 <- LDA(x = data_dtm, k = 4, 
            control = list(alpha = 1)) 
```

An output of this model is the estimated distribution of words per topic (beta). 
We use these values to: create wordclouds per topic `$\rightarrow$` interpret topics

```r
# beta: the probability of that term being generated from that topic.
papers_beta <- tidytext::tidy(res1, matrix = "beta")
head(papers_beta)
```

```
## # A tibble: 6 x 3
##   topic term            beta
##   <int> <chr>          <dbl>
## 1     1 calculator 5.09e- 29
## 2     2 calculator 1.58e-138
## 3     3 calculator 2.21e-139
## 4     4 calculator 1.51e-  3
## 5     1 come       1.94e-  3
## 6     2 come       2.49e-133
```

---

# Topic modeling

We filter over beta to display few words in each cloud

```r
papers_beta$topic_lab <- as.factor(papers_beta$topic)

values_thresh = 0.003
topic_sample <- papers_beta %>% filter(beta > values_thresh) #%>%  select(term,beta)
```

```r
ggplot(topic_sample, aes(label = term, size = beta, color = beta)) +
  geom_text_wordcloud_area(eccentricity = 1,grid_margin = 0) + 
  scale_size_area(max_size = 40) +
  theme_bw() + scale_colour_gradientn(colors=c('#253494',
                                               'palegoldenrod',
                                               'orangered'),
                                      values=c(0,.25,1))+
  facet_wrap(~topic_lab, scales = "free",shrink = F) +
  theme(strip.text = element_text(size = 45),
        plot.margin = margin(t = 0, r = 0, b = 0, l = 0, 
                             unit = "pt"),
        strip.background = element_rect(fill = 'white')) 
```

---

# Topic modeling