(cache)Search | DeepWiki

explain every line of code in spn.sh

Fast

I'll explain the spn.sh script line by line. This is a bash script for automating submissions to the Internet Archive's Save Page Now (SPN) service. spn.sh:1

Script Header and Signal Handling

The script starts with a shebang to run with bash. spn.sh:1

It sets up a trap to catch interrupt signals (Ctrl+C) and termination signals, calling the abort function when triggered. spn.sh:3

The abort function prints the data folder location and exits with status 1. spn.sh:5-12

Variable Initialization

Lines 14-28 initialize all script variables with default values: spn.sh:14-28

auth: S3 API authentication credentials
curl_args: Additional curl arguments
post_data: Capture request options
custom_dir: Custom data folder location
dir_suffix: Suffix for data folder name
no_errors: Flag to exclude errors from archiving
outlinks: Flag to save detected outlinks
parallel: Maximum parallel jobs (default 20)
quiet: Flag to discard JSON logs
resume: Folder path for resuming sessions
ssl_only: Flag to force HTTPS
list_update_rate: Seconds between list updates (default 3600)
capture_job_rate: Seconds between job starts (default 2.5)
include_pattern and exclude_pattern: Regex patterns for outlinks

Usage Information

The print_usage function displays help text explaining all command-line options. spn.sh:30-70

Command-Line Argument Parsing

Lines 72-92 use getopts to parse command-line flags and set corresponding variables. spn.sh:72-92

Resume Mode Logic

Lines 94-158 handle resuming an aborted session: spn.sh:94-158

Validates the resume folder exists and contains required files (index.txt, success.log)
If outlinks.txt exists, merges it with index.txt and removes already-captured URLs
Converts URLs to HTTPS if ssl_only is set
Restores outlink patterns if not overridden
Exits if no URLs remain to process

URL List Construction

For new sessions, the script determines if the argument is a file or direct URLs. spn.sh:146-158

Data Folder Setup

Lines 160-226 create the data folder: spn.sh:160-226

If custom folder specified with -f, uses that location
Otherwise, creates folder in platform-specific location (macOS: ~/Library/spn-data, Linux: ~/.local/share/spn-data or ~/spn-data)
Folder name includes timestamp to avoid collisions
Adds random sleep delays to prevent race conditions

URL Processing

Converts URLs to HTTPS if ssl_only flag is set. spn.sh:228-231

Sets POST data options, adding capture_all=on unless no_errors is specified. spn.sh:233-241

Data File Creation

Creates tracking files and deduplicates the URL list. spn.sh:243-262

Capture Function

The capture function (lines 265-547) is the core logic: spn.sh:265-547

It submits a URL to SPN and monitors the job status through multiple retry attempts. spn.sh:270-543

The function handles various error conditions including rate limits, timeouts, and server errors. spn.sh:296-397

On success, it logs the capture and extracts outlinks if enabled. spn.sh:460-483

Get List Function

The get_list function processes failed captures and new outlinks, returning URLs to retry. spn.sh:549-587

Main Processing Loops

Lines 589-725 implement parallel processing mode when parallel >= 2: spn.sh:597-725

Launches capture jobs in background with rate limiting
Monitors child processes and adjusts parallelism
Handles daily capture limits and authentication checks
Periodically updates the URL list with failures and outlinks

Lines 727-776 implement linear (sequential) processing mode: spn.sh:732-776

Processes URLs one at a time with rate limiting
Similar list update logic as parallel mode

Cleanup

The script removes temporary files and prints completion message. spn.sh:778-796

Notes

This script is designed for bulk archiving of URLs to the Internet Archive's Wayback Machine. It handles rate limiting, retries, parallel processing, and can resume interrupted sessions. The script creates various log files to track successes, failures, and captured outlinks.

overcast07/wayback-machine-spn-scripts spn.sh

1#!/bin/bash

3trap "abort" SIGINT SIGTERM

5function abort(){
6	echo "
7
8== Aborting $(basename "$0") ==
9Data folder: $dir
10"
11	exit 1
12}

14auth=''
15curl_args=()
16post_data=''
17custom_dir=''
18dir_suffix=''
19no_errors=''
20outlinks=''
21parallel='20'
22quiet=''
23resume=''
24ssl_only=''
25list_update_rate='3600'
26capture_job_rate='2.5'
27include_pattern=''
28exclude_pattern=''

30print_usage() {
31	echo "Usage: $(basename "$0") [options] file
32       $(basename "$0") [options] url [url]...
33       $(basename "$0") [options] -r folder
34
35Options:
36 -a auth        S3 API keys, in the form accesskey:secret
37                (get account keys at https://archive.org/account/s3.php)
38
39 -c args        pass additional arguments to curl
40
41 -d data        capture request options, or other arbitrary POST data
42
43 -f folder      use a custom location for the data folder
44                (some files will be overwritten or deleted during the session)
45
46 -i suffix      add a suffix to the name of the data folder
47                (if -f is used, -i is ignored)
48
49 -n             tell Save Page Now not to save errors into the Wayback Machine
50
51 -o pattern     save detected capture outlinks matching regex (ERE) pattern
52
53 -p N           run at most N capture jobs in parallel (default: 20)
54
55 -q             discard JSON for completed jobs instead of writing to log file
56
57 -r folder      resume with the remaining URLs of an aborted session
58                (settings are not carried over, except for outlinks options)
59
60 -s             use HTTPS for all captures and change HTTP input URLs to HTTPS
61
62 -t N           wait at least N seconds before updating the main list of URLs
63                with outlinks and failed capture jobs (default: 3600)
64
65 -w N           wait at least N seconds after starting a capture job before
66                starting another capture job (default: 2.5)
67
68 -x pattern     save detected capture outlinks not matching regex (ERE) pattern
69                (if -o is also used, outlinks are filtered using both regexes)"
70}

72while getopts 'a:c:d:f:i:no:p:qr:st:w:x:' flag; do
73	case "${flag}" in
74		a)	auth="$OPTARG" ;;
75		c)	declare -a "curl_args=($OPTARG)" ;;
76		d)	post_data="$OPTARG" ;;
77		f)	custom_dir="$OPTARG" ;;
78		i)	dir_suffix="-$OPTARG" ;;
79		n)	no_errors='true' ;;
80		o)	outlinks='true'; include_pattern="$OPTARG" ;;
81		p)	parallel="$OPTARG" ;;
82		q)	quiet='true' ;;
83		r)	resume="$OPTARG" ;;
84		s)	ssl_only='true' ;;
85		t)	list_update_rate="$OPTARG" ;;
86		w)	capture_job_rate="$OPTARG" ;;
87		x)	outlinks='true'; exclude_pattern="$OPTARG" ;;
88		*)	print_usage
89			exit 1 ;;
90	esac
91done
92shift "$((OPTIND-1))"

94if [[ -n "$resume" ]]; then
95	# There should not be any arguments
96	if [[ -n "$1" ]]; then
97		print_usage
98		exit 1
99	fi
100	# Get list
101	# List will be constructed from the specified folder
102	if [[ ! -d "$resume" ]]; then
103		echo "The folder $resume could not be found"
104		exit 1
105	fi
106	cd "$resume"
107	if ! [[ -f "index.txt" && -f "success.log" ]]; then
108		echo "Could not resume session; required files not found"
109		exit 1
110	fi
111	if [[ -f "outlinks.txt" ]]; then
112		# Index will also include successful redirects, which should be logged in captures.log
113		if [[ -f "captures.log" ]]; then
114			success=$(cat success.log captures.log | sed -Ee 's|^/web/[0-9]+/||g')
115		else
116			success=$(<success.log)
117		fi
118		index=$(cat index.txt outlinks.txt)
119		# Convert links to HTTPS
120		if [[ -n "$ssl_only" ]]; then
121			index=$(echo "$index" | sed -Ee 's|^[[:blank:]]*(https?://)?[[:blank:]]*([^[:blank:]]+)|https://\2|g;s|^https://ftp://|ftp://|g')
122			success=$(echo "$success" | sed -Ee 's|^[[:blank:]]*(https?://)?[[:blank:]]*([^[:blank:]]+)|https://\2|g;s|^https://ftp://|ftp://|g')
123		fi
124
125		# Remove duplicate lines from new index
126		index=$(awk '!seen [$0]++' <<< "$index")
127		# Remove links that are in success.log and captures.log from new index
128		list=$(awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 <(echo "$success") f=2 <(echo "$index"))
129
130		# If -o and -x are not specified, then retain original values
131		if [[ -z "$outlinks" ]]; then
132			outlinks='true'
133			include_pattern=$(<include_pattern.txt)
134			exclude_pattern=$(<exclude_pattern.txt)
135		fi
136	else
137		# Remove links that are in success.log from index.txt
138		list=$(awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 success.log f=2 index.txt)
139	fi
140	if [[ -z "$list" ]]; then
141		echo "Session already complete; not resuming"
142		exit 1
143	fi
144	cd
145else
146	# File or at least one URL must be provided
147	if [[ -z "$1" ]]; then
148		print_usage
149		exit 1
150	fi
151	# Get list
152	# Treat as filename if only one argument and file exists, and as URLs otherwise
153	if [[ -n "$2" || ! -f "$1" ]]; then
154		list=$(for i in "$@"; do echo "$i"; done)
155	else
156		list=$(<"$1")
157	fi
158fi

160if [[ -n "$custom_dir" ]]; then
161	f="-$$"
162	dir="$custom_dir"
163	if [[ ! -d "$dir" ]]; then
164		mkdir "$dir" || { echo "The folder $dir could not be created"; exit 1; }
165		echo "
166
167== Starting $(basename "$0") ==
168Data folder: $dir
169"
170	else
171		echo "
172
173== Starting $(basename "$0") ==
174Using existing data folder: $dir
175"
176	fi
177	cd "$dir"
178
179	for i in max_parallel_jobs$f.txt status_rate$f.txt list_update_rate$f.txt capture_job_rate$f.txt lock$f.txt daily_limit$f.txt quit$f.txt; do
180		if [[ -f "$i" ]]; then
181			rm "$i"
182		fi
183	done
184else
185	f=''
186	# Setting base directory on parent variable allows discarding redundant '~/' expansions
187	if [ "$(uname)" == "Darwin" ]; then
188		# macOS platform
189		parent="${HOME}/Library/spn-data"
190	else
191		# Use XDG directory specification; if variable is not set then default to ~/.local/share/spn-data
192		parent="${XDG_DATA_HOME:-$HOME/.local/share}/spn-data"
193		# If the folder doesn't exist, use ~/spn-data instead
194		if [[ ! -d "${XDG_DATA_HOME:-$HOME/.local/share}" ]]; then
195			parent="${HOME}/spn-data"
196		fi
197	fi
198
199	month=$(date -u +%Y-%m)
200	now=$(date +%s)
201
202	for i in "$parent" "$parent/$month"; do
203		if [[ ! -d "$i" ]]; then
204			mkdir "$i" || { echo "The folder $i could not be created"; exit 1; }
205		fi
206	done
207
208	# Wait between 0 and 0.07 seconds to try to avoid a collision, in case another session is started at exactly the same time
209	sleep ".0$((RANDOM % 8))"
210
211	# Wait between 0.1 and 0.73 seconds if the folder already exists
212	while [[ -d "$parent/$month/$now$dir_suffix" ]]; do
213		sleep ".$((10 + RANDOM % 64))"
214		now=$(date +%s)
215	done
216	dir="$parent/$month/$now$dir_suffix"
217
218	# Try to create the folder
219	mkdir "$dir" || { echo "The folder $dir could not be created"; exit 1; }
220	echo "
221
222== Starting $(basename "$0") ==
223Data folder: $dir
224"
225	cd "$dir"
226fi

228# Convert links to HTTPS
229if [[ -n "$ssl_only" ]]; then
230	list=$(echo "$list" | sed -Ee 's|^[[:blank:]]*(https?://)?[[:blank:]]*([^[:blank:]]+)|https://\2|g;s|^https://ftp://|ftp://|g')
231fi

233# Set POST options
234# The web form sets capture_all=on by default; this replicates the default behavior
235if [[ -z "$no_errors" ]]; then
236	if [[ -n "$post_data" ]]; then
237		post_data="${post_data}&capture_all=on"
238	else
239		post_data="capture_all=on"
240	fi
241fi

243# Create data files
244# max_parallel_jobs.txt and status_rate.txt are created later
245touch failed.txt
246echo "$list_update_rate" > list_update_rate$f.txt
247echo "$capture_job_rate" > capture_job_rate$f.txt
248# Add successful capture URLs from previous session, if any, to the index and the list of captures
249# This is to prevent redundant captures in the current session and in future ones
250if [[ -n "$success" ]]; then
251	success=$(echo "$success" | awk '!seen [$0]++')
252	echo "$success" >> index.txt
253	echo "$success" >> success.log
254fi
255# Dedupe list, then send to index.txt
256list=$(awk '!seen [$0]++' <<< "$list") && echo "$list" >> index.txt
257if [[ -n "$outlinks" ]]; then
258	touch outlinks.txt
259	# Create both files even if one of them would be empty
260	echo "$include_pattern" > include_pattern.txt
261	echo "$exclude_pattern" > exclude_pattern.txt
262fi

263
264# Submit a URL to Save Page Now and check the result

265function capture(){
266	local tries="0"
267	local request
268	local job_id
269	local message
270	while ((tries < 3)); do
271		# Submit
272		local lock_wait=0
273		local start_time=`date +%s`
274		while :; do
275			if (( $(date +%s) - start_time > 300 )); then
276				break 2
277			fi
278			if [[ -n "$auth" ]]; then
279				request=$(curl "${curl_args[@]}" -s -m 60 -X POST --data-urlencode "url=${1}" -d "${post_data}" -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/")
280				job_id=$(echo "$request" | grep -Eo '"job_id":"([^"\\]|\\["\\])*"' | head -1 | sed -Ee 's/"job_id":"(.*)"/\1/g')
281				if [[ -n "$job_id" ]]; then
282					break
283				fi
284				echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Request failed] $1"
285				message=$(echo "$request" | grep -Eo '"message":"([^"\\]|\\["\\])*"' | sed -Ee 's/"message":"(.*)"/\1/g')
286			else
287				request=$(curl "${curl_args[@]}" -s -m 60 -X POST --data-urlencode "url=${1}" -d "${post_data}" "https://web.archive.org/save/")
288				job_id=$(echo "$request" | grep -E 'spn\.watchJob\(' | sed -Ee 's/^.*spn\.watchJob\("([^"]*).*$/\1/g' | head -1)
289				if [[ -n "$job_id" ]]; then
290					break
291				fi
292				echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Request failed] $1"
293				message=$(echo "$request" | grep -E -A 2 '</?h2( [^>]*)?>' | grep -E '</?p( [^>]*)?>' | sed -Ee 's| *</?p> *||g')
294			fi
295			if [[ -z "$message" ]]; then
296				if [[ "$request" =~ "429 Too Many Requests" ]] || [[ "$request" == "" ]]; then
297					echo "$request"
298					if [[ ! -f lock$f.txt ]]; then
299						touch lock$f.txt
300						sleep 20
301						rm lock$f.txt
302					else
303						break 2
304					fi
305				elif [[ "$request" =~ "400 Bad Request" ]]; then
306					echo "$request"
307					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
308					echo "$(date -u '+%Y-%m-%d %H:%M:%S') $1" >> invalid.log
309					echo "$request" >> invalid.log
310					return 1
311				else
312					sleep 5
313				fi
314			else
315				echo "        $message"
316				if ! [[ "$message" =~ "You have already reached the limit" || "$message" =~ "Cannot start capture" || "$message" =~ "The server encountered an internal error and was unable to complete your request" || "$message" =~ "Crawling this host is paused" ]]; then
317					if [[ "$message" =~ "You have reached your daily not-logged-in captures limit of" || "$message" =~ "You cannot make more than "[1-9][0-9,]*" captures per day" ]]; then
318						touch daily_limit$f.txt
319						break 2
320					else
321						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
322						echo "$(date -u '+%Y-%m-%d %H:%M:%S') $1" >> invalid.log
323						echo "$message" >> invalid.log
324						return 1
325					fi
326				fi
327				if [[ ! -f lock$f.txt ]]; then
328					touch lock$f.txt
329					while [[ -f lock$f.txt ]]; do
330						# Retry the request until either the job is submitted or a different error is received
331						sleep 2
332						if [[ -n "$auth" ]]; then
333							# If logged in, then check if the server-side limit for captures has been reached
334							while :; do
335								request=$(curl "${curl_args[@]}" -s -m 60 -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/status/user")
336								available=$(echo "$request" | grep -Eo '"available":[0-9]*' | head -1)
337								if [[ "$available" != '"available":0' ]]; then
338									break
339								else
340									sleep 5
341								fi
342							done
343							request=$(curl "${curl_args[@]}" -s -m 60 -X POST --data-urlencode "url=${1}" -d "${post_data}" -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/")
344							job_id=$(echo "$request" | grep -Eo '"job_id":"([^"\\]|\\["\\])*"' | head -1 | sed -Ee 's/"job_id":"(.*)"/\1/g')
345							if [[ -n "$job_id" ]]; then
346								rm lock$f.txt
347								break 2
348							fi
349							echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Request failed] $1"
350							message=$(echo "$request" | grep -Eo '"message":"([^"\\]|\\["\\])*"' | sed -Ee 's/"message":"(.*)"/\1/g')
351						else
352							request=$(curl "${curl_args[@]}" -s -m 60 -X POST --data-urlencode "url=${1}" -d "${post_data}" "https://web.archive.org/save/")
353							job_id=$(echo "$request" | grep -E 'spn\.watchJob\(' | sed -Ee 's/^.*spn\.watchJob\("([^"]*).*$/\1/g' | head -1)
354							if [[ -n "$job_id" ]]; then
355								rm lock$f.txt
356								break 2
357							fi
358							echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Request failed] $1"
359							message=$(echo "$request" | grep -E -A 2 '</?h2( [^>]*)?>' | grep -E '</?p( [^>]*)?>' | sed -Ee 's| *</?p> *||g')
360						fi
361						if [[ -z "$message" ]]; then
362							if [[ "$request" =~ "429 Too Many Requests" ]] || [[ "$request" == "" ]]; then
363								echo "$request"
364								sleep 20
365							else
366								sleep 5
367								rm lock$f.txt
368								break
369							fi
370						else
371							echo "        $message"
372							if [[ "$message" =~ "You have already reached the limit" || "$message" =~ "Cannot start capture" || "$message" =~ "The server encountered an internal error and was unable to complete your request" || "$message" =~ "Crawling this host is paused" ]]; then
373								:
374							elif [[ "$message" =~ "You have reached your daily not-logged-in captures limit of" || "$message" =~ "You cannot make more than "[1-9][0-9,]*" captures per day" ]]; then
375								rm lock$f.txt
376								touch daily_limit$f.txt
377								break 3
378							else
379								rm lock$f.txt
380								echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
381								echo "$(date -u '+%Y-%m-%d %H:%M:%S') $1" >> invalid.log
382								echo "$message" >> invalid.log
383								return 1
384							fi
385						fi
386					done
387				else
388					# If another process has already created lock.txt, wait for the other process to remove it
389					while [[ -f lock$f.txt ]]; do
390						sleep 5
391						((lock_wait+=5))
392						if ((lock_wait > 120)); then
393							break 3
394						fi
395					done
396				fi
397			fi
398		done
399		echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job submitted] $1"
400		
401		# Check if there's a message
402		if [[ -n "$auth" ]]; then
403			message=$(echo "$request" | grep -Eo '"message":"([^"\\]|\\["\\])*"' | sed -Ee 's/"message":"(.*)"/\1/g')
404		else
405			message=$(echo "$request" | grep -E -A 2 '</?h2( [^>]*)?>' | grep -E '</?p( [^>]*)?>' | sed -Ee 's| *</?p> *||g')
406		fi
407		if [[ -n "$message" ]]; then
408			echo "        $message"
409			
410			# Extract the delay, if any, from the message
411			delay=$(echo "$message" | grep -Eo 'capture will start in')
412			if [[ -n "$delay" ]]; then
413				delay_hours=$(echo "$message" | grep -Eo "[0-9]+ hour" | grep -Eo "[0-9]*")
414				delay_minutes=$(echo "$message" | grep -Eo "[0-9]+ minute" | grep -Eo "[0-9]*")
415				delay_seconds=$(echo "$message" | grep -Eo "[0-9]+ second" | grep -Eo "[0-9]*")
416				
417				# If the values are not integers, set them to 0
418				[[ $delay_hours =~ ^[0-9]+$ ]] || delay_hours="0"
419				[[ $delay_minutes =~ ^[0-9]+$ ]] || delay_minutes="0"
420				[[ $delay_seconds =~ ^[0-9]+$ ]] || delay_seconds="0"
421				
422				delay_seconds=$((delay_hours * 3600 + delay_minutes * 60 + delay_seconds))
423				sleep $delay_seconds
424			fi
425		fi
426		local start_time=`date +%s`
427		local status
428		local status_ext
429		while :; do
430			sleep "$(<status_rate$f.txt)"
431			request=$(curl "${curl_args[@]}" -s -m 60 "https://web.archive.org/save/status/$job_id")
432			status=$(echo "$request" | grep -Eo '"status":"([^"\\]|\\["\\])*"' | head -1)
433			if [[ -z "$status" ]]; then
434				echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Status request failed] $1"
435				if [[ "$request" =~ "429 Too Many Requests" ]] || [[ "$request" == "" ]]; then
436					echo "$request"
437					sleep 20
438				fi
439				sleep "$(<status_rate$f.txt)"
440				request=$(curl "${curl_args[@]}" -s -m 60 "https://web.archive.org/save/status/$job_id")
441				status=$(echo "$request" | grep -Eo '"status":"([^"\\]|\\["\\])*"' | head -1)
442				if [[ -z "$status" ]]; then
443					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Status request failed] $1"
444					if [[ "$request" =~ "429 Too Many Requests" ]] || [[ "$request" == "" ]]; then
445						echo "$request"
446						sleep 20
447						status='"status":"pending"'
448						# Fake status response to allow while loop to continue
449					else
450						echo "$request" >> unknown-json.log
451						break 2
452					fi
453				fi
454			fi
455			if [[ -z "$status" ]]; then
456				echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Unknown error] $1"
457				echo "$request" >> unknown-json.log
458				break 2
459			fi
460			if [[ "$status" == '"status":"success"' ]]; then
461				if [[ "$request" =~ '"first_archive":true' ]]; then
462					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job completed] [First archive] $1"
463				else
464					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job completed] $1"
465				fi
466				echo "$1" >> success.log
467				timestamp=$(echo "$request" | grep -Eo '"timestamp":"[0-9]*"' | sed -Ee 's/^"timestamp":"(.*)"/\1/g')
468				url=$(echo "$request" | grep -Eo '"original_url":"([^"\\]|\\["\\])*"' | sed -Ee 's/^"original_url":"(.*)"/\1/g;s/\\(["\\])/\1/g')
469				echo "/web/$timestamp/$url" >> captures.log
470				if [[ -z "$quiet" ]]; then
471					echo "$request" >> success-json.log
472				fi
473				if [[ -n "$outlinks" ]]; then
474					if [[ "$url" != "$1" ]]; then
475						# Prevent the URL from being submitted twice
476						echo "$url" >> index.txt
477					fi
478					# grep matches array of strings (most special characters are converted server-side, but not square brackets)
479					# sed transforms the array into just the URLs separated by line breaks
480					echo "$request" | grep -Eo '"outlinks":\["([^"\\]|\\["\\])*"(,"([^"\\]|\\["\\])*")*\]' | sed -Ee 's/"outlinks":\["(.*)"\]/\1/g;s/(([^"\\]|\\["\\])*)","/\1\
481/g;s/\\(["\\])/\1/g' | { [[ -n "$(<exclude_pattern.txt)" ]] && { [[ -n "$(<include_pattern.txt)" ]] && grep -E "$(<include_pattern.txt)" | grep -Ev "$(<exclude_pattern.txt)" || grep -Ev "$(<exclude_pattern.txt)"; } || grep -E "$(<include_pattern.txt)"; } >> outlinks.txt
482				fi
483				return 0
484			elif [[ "$status" == '"status":"pending"' ]]; then
485				new_download_size=$(echo "$request" | grep -Eo '"download_size":[0-9]*' | head -1)
486				if [[ -n "$new_download_size" ]]; then
487					if [[ "$new_download_size" == "$download_size" ]]; then
488						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [File download stalled] $1"
489						break 2
490					else
491						download_size="$new_download_size"
492					fi
493				fi
494				if (( $(date +%s) - start_time > 1200 )); then
495					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job timed out] $1"
496					break 2
497				fi
498			elif [[ "$status" == '"status":"error"' ]]; then
499				echo "$request" >> error-json.log
500				status_ext=$(echo "$request" | grep -Eo '"status_ext":"([^"\\]|\\["\\])*"' | head -1 | sed -Ee 's/"status_ext":"(.*)"/\1/g')
501				if [[ -z "$status_ext" ]]; then
502					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Unknown error] $1"
503					break 2
504				fi
505				if [[ "$status_ext" == 'error:filesize-limit' ]]; then
506					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [File size limit of 2 GB exceeded] $1"
507					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
508					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$status_ext] $1" >> failed.log
509					return 1
510				elif [[ "$status_ext" == 'error:proxy-error' ]]; then
511					echo "$(date -u '+%Y-%m-%d %H:%M:%S') [SPN proxy error] $1"
512				else
513					message=$(echo "$request" | grep -Eo '"message":"([^"\\]|\\["\\])*"' | sed -Ee 's/"message":"(.*)"/\1/g')
514					if [[ -z "$message" ]]; then
515						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Unknown error: $status_ext] $1"
516						break 2
517					fi
518					if [[ "$message" == "Live page is not available: chrome-error://chromewebdata/" ]]; then
519						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [SPN internal error] $1"
520					elif [[ "$message" =~ ' (HTTP status='(40[89]|429|50[023478])').'$ ]] || [[ "$message" =~ "The server didn't respond in time" ]]; then
521						# HTTP status 408, 409, 429, 500, 502, 503, 504, 507 or 508, or didn't respond in time
522						# URL may become available later
523						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$message] $1"
524						break 2
525					elif [[ "$message" =~ ' (HTTP status='[45][0-9]*').'$ ]]; then
526						# HTTP error; assume the URL cannot be archived
527						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$message] $1"
528						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
529						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$status_ext] $1" >> failed.log
530						return 1
531					else
532						echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$message] $1"
533						break 2
534					fi
535				fi
536				break
537			else
538				echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Unknown error] $1"
539				break 2
540			fi
541		done
542		((tries++))
543	done
544	echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
545	echo "$1" >> failed.txt
546	return 1
547}

549function get_list(){
550	local failed_file=failed-$(date +%s).txt
551	mv failed.txt $failed_file
552	touch failed.txt
553	local failed_list=$(<$failed_file)
554
555	if [[ -n "$outlinks" ]]; then
556		local outlinks_file=outlinks-$(date +%s).txt
557		mv outlinks.txt $outlinks_file
558		touch outlinks.txt
559		# Remove duplicate lines; reading into string prevents awk from emptying the file
560		awk '!seen [$0]++' <<< "$(<$outlinks_file)" > $outlinks_file
561		# Convert links to HTTPS
562		if [[ -n "$ssl_only" ]]; then
563			sed -Ee 's|^[[:blank:]]*(https?://)?[[:blank:]]*([^[:blank:]]+)|https://\2|g;s|^https://ftp://|ftp://|g' <<< "$(<$outlinks_file)" > $outlinks_file
564		fi
565		# Remove lines that are already in index.txt
566		local outlinks_list=$(awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 index.txt f=2 $outlinks_file)
567
568		if [[ -n "$outlinks_list" ]]; then
569			echo "$outlinks_list" >> index.txt
570
571			if [[ -n "$failed_list" ]]; then
572				echo "$failed_list
573$outlinks_list"
574			else
575				echo "$outlinks_list"
576			fi
577		fi
578		if [[ -z "$(<$outlinks_file)" ]]; then
579			rm $outlinks_file
580		fi
581	else
582		echo "$failed_list"
583	fi
584	if [[ -z "$failed_list" ]]; then
585		rm $failed_file
586	fi
587}

588
589# Track the number of loops in which no URLs from the list are archived
590repeats=0

594	unset parallel
595fi
596

597# Parallel loop
598if [[ -n "$parallel" ]]; then
599	if ((parallel > 60)); then
600		parallel=60
601		echo "Setting maximum parallel jobs to 60"
602	fi
603	echo "$parallel" > max_parallel_jobs$f.txt
604	# Overall request rate stays at around 60 per minute
605	echo "$parallel" > status_rate$f.txt
606	while [[ ! -f quit$f.txt ]]; do
607		(
608		time_since_start="$SECONDS"
609		while IFS='' read -r line || [[ -n "$line" ]]; do
610			capture "$line" & sleep $(<capture_job_rate$f.txt)
611			children_wait=0
612			children=`jobs -p | wc -l`
613			while ! (( children < $(<max_parallel_jobs$f.txt) )); do
614				sleep 1
615				((children_wait++))
616				if ((children_wait < 600)); then
617					children=`jobs -p | wc -l`
618				else
619					# Wait is longer than 600 seconds; something might be wrong
620					# Increase limit and ignore the problem for now
621					children=0
622					echo $(( $(<max_parallel_jobs$f.txt) + 1 )) > max_parallel_jobs$f.txt
623				fi
624			done
625			lock_wait=0
626			while [[ -f lock$f.txt ]]; do
627				sleep 2
628				((lock_wait+=2))
629				if ((lock_wait > 300)); then
630					rm lock$f.txt
631				fi
632			done
633			if [[ -f daily_limit$f.txt ]]; then
634				echo "$(date -u '+%Y-%m-%d %H:%M:%S') Pausing for $(( (3600 - $(date +%s) % 3600) / 60 )) minutes"
635				sleep $(( 3600 - $(date +%s) % 3600 ))
636				rm daily_limit$f.txt
637			fi
638			# If logged in, then check if the server-side limit for captures has been reached
639			if [[ -n "$auth" ]] && (( children > 4 )); then
640				while :; do
641					request=$(curl "${curl_args[@]}" -s -m 60 -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/status/user")
642					available=$(echo "$request" | grep -Eo '"available":[0-9]*' | head -1)
643					if [[ "$available" != '"available":0' ]]; then
644						break
645					else
646						sleep 5
647					fi
648				done
649			fi
650			# Check failures and outlinks regularly
651			if (( SECONDS - time_since_start > $(<list_update_rate$f.txt) )) && [[ ! -f quit$f.txt ]] ; then
652				time_since_start="$SECONDS"
653				new_list=$(get_list)
654				if [[ -n "$new_list" ]]; then
655					while IFS='' read -r line2 || [[ -n "$line2" ]]; do
656						capture "$line2" & sleep $(<capture_job_rate$f.txt)
657						children_wait=0
658						children=`jobs -p | wc -l`
659						while ! ((children < $(<max_parallel_jobs$f.txt) )); do
660							sleep 1
661							((children_wait++))
662							if ((children_wait < 600)); then
663								children=`jobs -p | wc -l`
664							else
665								# Wait is longer than 600 seconds; something might be wrong
666								# Increase limit and ignore the problem for now
667								children=0
668								echo $(( $(<max_parallel_jobs$f.txt) + 1 )) > max_parallel_jobs$f.txt
669							fi
670						done
671						lock_wait=0
672						while [[ -f lock$f.txt ]]; do
673							sleep 2
674							((lock_wait+=2))
675							if ((lock_wait > 300)); then
676								rm lock$f.txt
677							fi
678						done
679						if [[ -f daily_limit$f.txt ]]; then
680							echo "$(date -u '+%Y-%m-%d %H:%M:%S') Pausing for $(( (3600 - $(date +%s) % 3600) / 60 )) minutes"
681							sleep $(( 3600 - $(date +%s) % 3600 ))
682							rm daily_limit$f.txt
683						fi
684						# If logged in, then check if the server-side limit for captures has been reached
685						if [[ -n "$auth" ]] && (( children > 4 )); then
686							while :; do
687								request=$(curl "${curl_args[@]}" -s -m 60 -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/status/user")
688								available=$(echo "$request" | grep -Eo '"available":[0-9]*' | head -1)
689								if [[ "$available" != '"available":0' ]]; then
690									break
691								else
692									sleep 5
693								fi
694							done
695						fi
696					done <<< "$new_list"
697					unset new_list
698				fi
699			fi
700		done <<< "$list"
701
702		for job in `jobs -p`; do wait $job; done
703		)
704
705		new_list=$(get_list)
706		if [[ "$new_list" == "$list" ]]; then
707			((repeats++))
708			if ((repeats > 1)); then
709				if ((repeats > 3)); then
710					break
711				else
712					echo "$(date -u '+%Y-%m-%d %H:%M:%S') Pausing for 30 minutes"
713					sleep 1800
714				fi
715			fi
716		fi
717		list="$new_list"
718		unset new_list
719		if [[ -z "$list" && -z "$(<failed.txt)" ]]; then
720			# No more URLs
721			touch quit$f.txt
722			rm failed.txt
723		fi
724	done
725fi

726
727if [[ ! -f quit$f.txt ]]; then
728	echo "2" > status_rate$f.txt
729fi
730
731# Linear loop

732while [[ ! -f quit$f.txt ]]; do
733	time_since_start="$SECONDS"
734	while IFS='' read -r line || [[ -n "$line" ]]; do
735		time_since_capture_start="$SECONDS"
736		capture "$line"
737		if [[ $(bc <<< "$SECONDS - $time_since_capture_start < $(<capture_job_rate$f.txt)") == "1" ]]; then
738			sleep $(bc <<< "$(<capture_job_rate$f.txt) - ($SECONDS - $time_since_capture_start)")
739		fi
740		# Check failures and outlinks regularly
741		if (( SECONDS - time_since_start > $(<list_update_rate$f.txt) )) && [[ ! -f quit$f.txt ]] ; then
742			time_since_start="$SECONDS"
743			new_list=$(get_list)
744			if [[ -n "$new_list" ]]; then
745				while IFS='' read -r line2 || [[ -n "$line2" ]]; do
746					time_since_capture_start="$SECONDS"
747					capture "$line2"
748					if [[ $(bc <<< "$SECONDS - $time_since_capture_start < $(<capture_job_rate$f.txt)") == "1" ]]; then
749						sleep $(bc <<< "$(<capture_job_rate$f.txt) - ($SECONDS - $time_since_capture_start)")
750					fi
751				done <<< "$new_list"
752			fi
753			unset new_list
754		fi
755	done <<< "$list"
756	new_list=$(get_list)
757	if [[ "$new_list" == "$list" ]]; then
758		((repeats++))
759		if ((repeats > 1)); then
760			if ((repeats > 4)); then
761				# Give up
762				touch quit$f.txt
763			else
764				echo "$(date -u '+%Y-%m-%d %H:%M:%S') Pausing for 30 minutes"
765				sleep 1800
766			fi
767		fi
768	fi
769	list="$new_list"
770	unset new_list
771	if [[ -z "$list" && -z "$(<failed.txt)" ]]; then
772		# No more URLs
773		touch quit$f.txt
774		rm failed.txt
775	fi
776done

778if [[ -n "$custom_dir" ]]; then
779	for i in max_parallel_jobs$f.txt status_rate$f.txt list_update_rate$f.txt lock$f.txt daily_limit$f.txt quit$f.txt; do
780		if [[ -f "$i" ]]; then
781			rm "$i"
782		fi
783	done
784
785	echo "
786
787== Ending $(basename "$0") ==
788Data folder: $dir
789"
790else
791	echo "
792
793== Ending $(basename "$0") ==
794Data folder: $dir
795"