Application of Natural Language Processing (NLP) in Transportation Studies

# Application of Natural Language Processing (NLP) in Transportation Studies
### Subasish Das Associate Transportation Researcher, TTI February 21, 2019
 <a href="https://twitter.com/subasish_das">@subasish_das</a> <a href="https://github.com/subasish">@subasish</a> <a href="mailto:s-das@tti.tamu.com" class="email">s-das@tti.tamu.com</a> <a href="http://subasish.github.io" class="uri">http://subasish.github.io</a> </a>

---

background-image: url(tti_lg.jpg)
background-size: 350px
background-position: 95% 5%
class: principles

### About me

* Started at TTI in August 2015
  * Member of the **Roadway Safety** team
  * Leading one of the four USDOT **Safety Data Initiative (SDI)** project
  * Passion: Interactive Data Visualization

* Previous Life
  * Ph.D. student for 5 years (2010-2015)
  * Roadway Engineer in Dubai, UAE (2008-2009)
  
* PhD in Systems Engineering at UL Lafayette (July 2015)
  * Fun Fact: Tried to get another PhD in Statistics. Failed to do so. Completed 16 Graduate level courses in Statistics.

---
background-image: url( )
background-size: 160px
background-position: 95% 5%

### What is NLP?

```r
library(tidytext)
text_df %>%
  unnest_tokens(word, text)
```

---

### Some of My NLP Papers

---

### Some of My NLP Papers

---

### Some of My NLP Papers

---

---
background-image: url(CC01.gif)
background-size: 280px
background-position: 95% 10%
class: inverse, bottom, principles

## Basic Workflow

* General guideline
  * Using *tidyverse*
  * Familiarity with dplyr, markdown, and git. 
  * Use 'RProject' from raw data to model delopement.
   
* dplyr Package
  * A great tool for data manipulation

* Markdown
  * Can create html and pdf files
  * Reproducibility in works

* Git
  * Share code with others
  * Can reach to a broad community

---

### dplyr

---

### Why RMD?

---

---

### Framework (for example: Twitter Mining)

---

### Framework (for example: Topic Modeling)

---
background-image: url( )
background-size: 160px
background-position: 95% 5%

### Token

text <- c("Because I could not stop for Death -",
 "He kindly stopped for me -",
 "The Carriage held but just Ourselves -",
 "and Immortality")

text_df <- tibble(line = 1:4, text = text)
text_df %>%
 unnest_tokens(word, text)

```
]

---
background-image: url( )
background-size: 160px
background-position: 95% 5%

### Most Frequent Word

tidy_books <- tidy_books %>%
* anti_join(stop_words)

tidy_books %>%
  count(word, sort = TRUE)

tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

```
]

---

### Uni-gram Example 1

---

### Uni-gram Example 2

---
background-image: url(cnarr.jpg)
background-size: 70%

### Raw Text

---

### Example from a Project

---

### Bi-gram

---

### Tri-gram

---

### Word Cloud

```r

library(wordcloud)

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
```

---

### Sentiment Analysis
---

### Related Codes

```r
* get_sentiments("afinn") ## From A. Finn's Senti-Lexicon

finn_joy <- get_sentiments("afinn") %>% 
 filter(sentiment == "joy")

tidy_books %>%
  filter(book == "MyBook1") %>%
*  inner_join(finn_joy) %>%
  count(word, sort = TRUE)

MyBook1_sentiment <- tidy_books %>%
* inner_join(get_sentiments("afinn")) %>%
* count(book, index = linenumber %/% 80, sentiment) %>%
 spread(sentiment, n, fill = 0) %>%
 mutate(sentiment = positive - negative)

```
---
### Package cleanNLP

```r

library(cleanNLP)
cnlp_get_token(sotu) %>%
  group_by(id) %>%
  summarize(n = n()) %>%
  left_join(cnlp_get_document(sotu)) %>%
  ggplot(aes(year, n)) +
    geom_line(color = grey(0.8)) +
*    geom_point(aes(color = sotu_type)) +
    geom_smooth()
```

---
### Principle Component Analysis

```r

library(cleanNLP)
pca <- cnlp_get_token(sotu) %>%
 filter(pos %in% c("NN", "NNS")) %>%
* cnlp_get_tfidf(min_df = 0.05, max_df = 0.95, type = "tfidf", tf_weight = "dnorm") %>%
 cnlp_pca(cnlp_get_document(sotu))
```

---
background-image: url(whyl.gif)
background-size: 400px
background-position: 95% 10%
class: inverse, bottom, principles

## What have we learnt so far?

* General rule of thumb: data cleaning, frequency, and knowledge extraction.

* Some new words: corpus, corpora, stop words, senti-lexicon
   
* Things you can do to be a proactive R coder:
   * Use *dplyr* and *tidyverse*
   * Create *.RMD* and .html for reproducibility
   * Use *git* to push your code

---

### Black Box Model= Artificial Intelligence

---

### Git Repository

---
background-image: url( )
background-size: 160px
background-position: 95% 5%

### Text Mining with Interpretable Machine Learning

```r
library(lime)
library(xgboost) # the classifier
library(text2vec) # used to build the BoW matrix
data(train_sentences)
data(test_sentences)

# Tokenize data
get_matrix <- function(text) {
* it <- itoken(text, progressbar = FALSE)
* create_dtm(it, vectorizer = hash_vectorizer())
}

# BoW matrix generation
dtm_train = get_matrix(train_sentences$text)
dtm_test = get_matrix(test_sentences$text)

```

1. The `"itoken"` is used to develop *document term matrix*
2. In built 'train' and 'test' data

---
### XGBoost Model

```r
# Create boosting model for binary classification (-> logistic loss)
# Other parameters are quite standard
param <- list(max_depth = 7, 
 eta = 0.1, 
 objective = "binary:logistic", 
 eval_metric = "error", 
 nthread = 1)

xgb_model <- xgb.train(
 param, 
 xgb.DMatrix(dtm_train, label = train_sentences$class.text == "OWNX"),
 nrounds = 50
# We use a (standard) threshold of 0.5
predictions <- predict(xgb_model, dtm_test) > 0.5
test_labels <- test_sentences$class.text == "OWNX"

# Accuracy
print(mean(predictions == test_labels))
# 0.84

```

1. Develop model from train data
2. Apply the model on test data

---
### Interpretation

.pull-left[
```r
sentence_to_explain <- head(test_sentences[test_labels,]$text, 5)
explainer <- lime(sentence_to_explain, model = xgb_model, 
 preprocess = get_matrix)
explanation <- explain(sentence_to_explain, explainer, n_labels = 1, 
 n_features = 2)
plot_features(explanation)

```
]

---
background-image: url( )
background-size: 160px
background-position: 95% 5%

### Codes from Rpubs 1
<iframe src="https://rpubs.com/subasish/94112" frameborder="0" scrolling="yes" height="100%" width="100%" marginheight="0" marginwidth="0"></iframe>

---
background-image: url( )
background-size: 160px
background-position: 95% 5%

### Codes from Rpubs 2
<iframe src="https://rpubs.com/subasish/469641" frameborder="0" scrolling="yes" height="100%" width="100%" marginheight="0" marginwidth="0"></iframe>

---
background-image: url( )
background-size: 160px
background-position: 95% 5%

### What about Punctuation? 
<iframe src="https://www.c82.net/work/?id=347" frameborder="0" scrolling="yes" height="100%" width="100%" marginheight="0" marginwidth="0"></iframe>

---
background-image: url(ayd.gif)
background-size: 350px
background-position: 95% 10%
class: inverse, bottom, principles

* Almost!

* Glimpse of the following:
   * text mining
   * n-gram, sentiment analysis
   * dplyr and rmd
  
* More in future:
   * tf-idf
   * structural topic modeling
   * keras and tensorflow

---
background-image: url(Ceric6.gif)
background-size: 25%
class: principles, center

## Thanks for listening! Questions?

<a href="https://twitter.com/subasish_das">@subasish_das</a> 
 <a href="https://github.com/subasish">@subasish</a> 
 <s-das@tti.tamu.com> 
 <http://subasish.github.io>