Analysis of Umbrella Academy's scripts - Part 1
The bible of text mining with R (by Julia Silge and David Robinson) is very, very helpful !
I would like to see the evolution of the characters through the episodes but the scripts that I found no permit to do this because they don’t precise who is speaking 😞.
Well, you have the scripts in a data frame, but not very tidy.
library(tidyverse)
library(tidytext)
Words frequency
df <- readRDS("df.RDS") %>%
mutate(episode_number = str_extract(episodes_titles, "\\d*"), # extract 1 chiffre ou plus
episode_title = str_extract(episodes_titles, "(?<=\\.).*")) %>% #, #extract tout .* ce qui suit un point \\.
# script = str_remove_all(episodes_scripts, "\\[.*\\]")) %>% # enleve tous les [ blablabla ]
mutate(script = str_remove(episodes_scripts, "1")) %>% # enleve le
select(-episodes_scripts, -episodes_titles)
df <- df %>%
unnest_tokens(word, script, token = "words", to_lower = TRUE) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(n>20)
## Joining, by = "word"
ggplot(df[1:25,], aes(reorder(word,n), n, fill = "red")) +
geom_bar(stat = "identity") +
#facet_wrap(~episode_number, nrow = 1) +
geom_text(aes(label= as.character(n)), check_overlap = TRUE, size = 4) +
coord_flip() +
xlab(" ")
Yes, that’s all for today, tomorrow I’ll analyze them, if 👶 wants it 😀