Long time no see, R…. Well, not really, but I’ve been heads down working on some larger projects lately and didn’t have the time to blog. 1 of the projects should be published soon (plenty of hill-shading, maps & data viz), 1 is still in the making (something JS-intense with Google Apps Script & clasp), but 1 - which is my biggest gig so far - is online now with all features (incl. my first-ever scrollytelling implementation) and will serve as my use case in this post.

2 Theory: Crawl - Extract - Archive

In theory, the approach is a simple 3-step process:

  1. crawl all (or a subset of) pages of a single website / web project (let’s focus on HTML content firat; JS reuires more work but there do exist solutions for scraping dynamic / JS-rendered content)
  2. extract all (or a subset of) external / outbound URLs (http/https)
  3. archive each external URL with a POST request to the Wayback Machine API

Fortunately, #Rstats-Twitter is always there to help out and Peter Meissner hinted at me with Salim Khalil’s RCrawler package (cf. Khalil/Fakir 2017).

3 “Minimum Viable Demo” - fetchURLs(target) & archiveURLs(URLs)

Rcrawler’s syntax might seem a bit unorthodox but the package is really feature-rich and we can actually solve Step 1 & 2 with it (and purrr). Having crawled all external link, all we have to do is to figure out how to send a POST request to the Wayback API, and can then work on refinement and parallelization.

3.1 Setup

1
2
3
4
5
6
library(tidyverse)
library(Rcrawler)
# use all CPU cores (~threads) minus 1 to parallelize scraping
cores = (parallel::detectCores() - 1)
print(paste0("CPU threads available: ", cores))
1
## [1] "CPU threads available: 7"

3.2 fetchURLs(url) - Function to crawl & scrape a single Website

Rcrawler() implicitly returns an INDEX objects to the global environment. INDEX$Url contains a list of URLs vectors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
fetchURLs <- function(url, ignoreStrings = NULL) {
# Crawl our target website's pages
Rcrawler(
Website = url,
no_cores = cores,
no_conn = 8, # don't abuse this ;)
saveOnDisk = FALSE # we don't the the HTML files of our own website
)
# extract external URLs from each scraped page
urls <- purrr::map(INDEX$Url, ~LinkExtractor(.x,
ExternalLInks = TRUE,
removeAllparams = TRUE))
# only keep external links
urls_df <- urls %>%
rvest::pluck("ExternalLinks") %>%
map_df(tibble)
# helper: optional vector with string/Regex expressions to filter out specific unwanted links
if (!is.null(ignoreStrings)) {
urls_df <- urls_df %>%
filter(!str_detect(`<chr>`, ignoreStrings))
}
urls_df <- urls_df %>%
select(`<chr>`) %>% # the column which contains the links
pull() %>% # we just need a single vector of URLs
unique() %>% # keep only unique
str_replace("http:", "https:") # make all links https
return(urls_df)
}

3.3 archiveURL() - Function to archive a single URL with a POST request

TODO: Check first with internetarchive package whether a link has already been archived, and if true archive only if archived link isn’t n days old.

TODO: return 2 items: originalURL, archivedURL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
archiveURL <- function(target_url) {
# Communiy-built Wayback Machine API / endpoint
endpoint <- "https://pragma.archivelab.org"
# Alterntive Approach might be to use a quick & dirty GET request
# https://github.com/motherboardgithub/mass_archive/blob/master/mass_archive.py
# the POST request to the API
response <- httr::POST(url = endpoint,
body = list(url = target_url),
encode = "json")
status_code <- response$status_code
# Only archive if POST request was succesfull
# TODO: needs more robustness, i.e. some returned paths are not valid
# Hypothesis: Not valid because URL was archived recently
# TODO: add check whether URL has been archived already
if (status_code == 200) {
wayback_path <- httr::content(response)$wayback_id # returns path
wayback_url <- paste0("https://web.archive.org", wayback_path)
print(paste0("Success - Archived: ", target_url))
} else {
wayback_url <- NULL
print(paste0("Error Code: ", status_code, " - couldn't archive ", target_url))
}
return(wayback_url)
}

3.4 Wrapper function to archive a vector of URLs:

1
2
3
4
5
archiveURLs <- function(urls) {
result <- purrr::map_chr(urls, archiveURL) %>%
tibble(wayback_url = .)
return(result)
}

3.5 Proof-of-Concept - Execution

4 Questions/TBD

  • How to parallelize archival (async calls are not supported by httr it seems, only curl?
  • How to obey quota limitations (with ~setTimeout())?
  • maybe use internetarchive to check first whether a URL has already been archived
  • or use generic API GET requests to retrieve URL, status and ID https://archive.readme.io/docs/website-snapshots

That’s it for today. Hope this is useful for some! If you have a proof-of-concept for making parallel async GET/POST requests, hit me up!