Scraping Umbrella Academy's scripts

Why Umbrella Academy ?

I’m a big fan of Ellen Page ! I really love her 😍. Juno is one of my favorite movies, no, actually it’s even my favorite movie. Besides, the soundtrack is my ⏰ 🎶.
So I had to watch this series, and it is awesome ! (Okay, maybe I’m a little biased). I’ve only seen the first five episodes so far. I’m taking my time because I know I’ll be super sad 😭 when I’m done with the season. I’ll try not to spoil myself too much by analyzing the scripts… challenging !


Scraping the scripts

To start, load the essential packages, tidyverse to do everything tidy, rvest to scrap web pages and glue to make tidy paste.


Then, find a web page with the scripts of the series, and put the url in url.

url <- ''


I played a little with SelectorGadget to find the class of what I wanted, .season-episode-title seemed a good one, so let’s see what it contains :

url %>%
  read_html() %>%
## {xml_nodeset (10)}
##  [1] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
##  [2] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
##  [3] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
##  [4] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
##  [5] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
##  [6] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
##  [7] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
##  [8] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
##  [9] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...
## [10] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019&amp; ...

Good, we can get episodes titles and href attribute gives the end the url of each episode, we just have to paste it to “".

#titles of the episodes
episodes_titles <- url %>%
  read_html() %>%
  html_nodes('.season-episode-title') %>%

#url of the episodes
episodes_urls <- url %>%
  read_html() %>%
  html_nodes('.season-episode-title') %>%
  html_attr("href") %>%

If you go on the web page of the first episode for example, you’ll note that the script is in the class .scrolling-script-container. So I made a little function to get the script of an episode, then I applied it to the episodes urls list with map.

#function to get the script of an episode
get_script <- function(url){
  url %>%
    read_html() %>%
    html_nodes(".scrolling-script-container") %>%

episodes_scripts <- episodes_urls %>%
  map(~get_script(.)) %>%

Then I put titles and scripts in a tibble.

df <- tibble(episodes_titles, episodes_scripts)

saveRDS(df, "df.RDS")

Now we have a tidy data frame with all the scripts…


Yes, that’s all for today, tomorrow I’ll analyze them, if 👶 wants it 😀


comments powered by Disqus