Scraping Umbrella Academy's scripts
Why Umbrella Academy ?
I’m a big fan of Ellen Page ! I really love her 😍. Juno is one of my favorite movies, no, actually it’s even my favorite movie. Besides, the soundtrack is my ⏰ 🎶.
So I had to watch this series, and it is awesome ! (Okay, maybe I’m a little biased). I’ve only seen the first five episodes so far. I’m taking my time because I know I’ll be super sad 😭 when I’m done with the season. I’ll try not to spoil myself too much by analyzing the scripts… challenging !
![](https://tel.img.pmdstatic.net/fit/http.3A.2F.2Fprd2-bone-image.2Es3-website-eu-west-1.2Eamazonaws.2Ecom.2Ftel.2F2018.2F10.2F08.2Ffc869c14-0bff-4651-aa98-37775ce1695f.2Ejpeg/540x400/quality/80/thumbnail.jpeg)
Scraping the scripts
To start, load the essential packages, tidyverse to do everything tidy, rvest to scrap web pages and glue to make tidy paste.
library(tidyverse)
library(rvest)
library(glue)
Then, find a web page with the scripts of the series, and put the url in url
.
url <- 'https://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=the-umbrella-academy-2019&season=1'
![](https://media.giphy.com/media/3fibASpOsR4s94M1v4/giphy.gif)
I played a little with SelectorGadget to find the class of what I wanted, .season-episode-title
seemed a good one, so let’s see what it contains :
url %>%
read_html() %>%
html_nodes('.season-episode-title')
## {xml_nodeset (10)}
## [1] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [2] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [3] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [4] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [5] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [6] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [7] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [8] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [9] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
## [10] <a href="view_episode_scripts.php?tv-show=the-umbrella-academy-2019& ...
Good, we can get episodes titles and href
attribute gives the end the url of each episode, we just have to paste it to “https://www.springfieldspringfield.co.uk/".
#titles of the episodes
episodes_titles <- url %>%
read_html() %>%
html_nodes('.season-episode-title') %>%
html_text()
#url of the episodes
episodes_urls <- url %>%
read_html() %>%
html_nodes('.season-episode-title') %>%
html_attr("href") %>%
map(~glue("https://www.springfieldspringfield.co.uk/{.}"))
If you go on the web page of the first episode for example, you’ll note that the script is in the class .scrolling-script-container
. So I made a little function to get the script of an episode, then I applied it to the episodes urls list with map
.
#function to get the script of an episode
get_script <- function(url){
url %>%
read_html() %>%
html_nodes(".scrolling-script-container") %>%
html_text()
}
episodes_scripts <- episodes_urls %>%
map(~get_script(.)) %>%
unlist()
Then I put titles and scripts in a tibble.
df <- tibble(episodes_titles, episodes_scripts)
saveRDS(df, "df.RDS")
Now we have a tidy data frame with all the scripts…
![](https://media.giphy.com/media/28FpALDPHljpalhdRa/giphy.gif)
Yes, that’s all for today, tomorrow I’ll analyze them, if 👶 wants it 😀