Analysis of Umbrella Academy's scripts - Part 1

23 Feb 2019

text mining / umbrella academy / series

tidytext

The bible of text mining with R (by Julia Silge and David Robinson) is very, very helpful !

I would like to see the evolution of the characters through the episodes but the scripts that I found no permit to do this because they don’t precise who is speaking 😞.

Well, you have the scripts in a data frame, but not very tidy.

library(tidyverse)
library(tidytext)

Words frequency

df <- readRDS("df.RDS") %>%
  mutate(episode_number = str_extract(episodes_titles, "\\d*"), # extract 1 chiffre ou plus
         episode_title = str_extract(episodes_titles, "(?<=\\.).*")) %>%     #, #extract tout .* ce qui suit un point \\.
        # script = str_remove_all(episodes_scripts, "\\[.*\\]")) %>% # enleve tous les [ blablabla ]
  mutate(script = str_remove(episodes_scripts, "1")) %>% # enleve le 
  select(-episodes_scripts, -episodes_titles) 

df <- df %>%
  unnest_tokens(word, script, token = "words", to_lower = TRUE) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  filter(n>20)

## Joining, by = "word"

ggplot(df[1:25,], aes(reorder(word,n), n, fill = "red")) + 
  geom_bar(stat = "identity") +
  #facet_wrap(~episode_number, nrow = 1) +
  geom_text(aes(label= as.character(n)), check_overlap = TRUE, size = 4) + 
  coord_flip() + 
  xlab(" ")

Yes, that’s all for today, tomorrow I’ll analyze them, if 👶 wants it 😀

Analysis of Umbrella Academy's scripts - Part 1

Words frequency

Share!