I'll explain the spn.sh script line by line. This is a bash script for automating submissions to the Internet Archive's Save Page Now (SPN) service. spn.sh:1
The script starts with a shebang to run with bash. spn.sh:1
It sets up a trap to catch interrupt signals (Ctrl+C) and termination signals, calling the abort function when triggered. spn.sh:3
The abort function prints the data folder location and exits with status 1. spn.sh:5-12
Lines 14-28 initialize all script variables with default values: spn.sh:14-28
auth: S3 API authentication credentialscurl_args: Additional curl argumentspost_data: Capture request optionscustom_dir: Custom data folder locationdir_suffix: Suffix for data folder nameno_errors: Flag to exclude errors from archivingoutlinks: Flag to save detected outlinksparallel: Maximum parallel jobs (default 20)quiet: Flag to discard JSON logsresume: Folder path for resuming sessionsssl_only: Flag to force HTTPSlist_update_rate: Seconds between list updates (default 3600)capture_job_rate: Seconds between job starts (default 2.5)include_pattern and exclude_pattern: Regex patterns for outlinksThe print_usage function displays help text explaining all command-line options. spn.sh:30-70
Lines 72-92 use getopts to parse command-line flags and set corresponding variables. spn.sh:72-92
Lines 94-158 handle resuming an aborted session: spn.sh:94-158
index.txt, success.log)outlinks.txt exists, merges it with index.txt and removes already-captured URLsssl_only is setFor new sessions, the script determines if the argument is a file or direct URLs. spn.sh:146-158
Lines 160-226 create the data folder: spn.sh:160-226
-f, uses that location~/Library/spn-data, Linux: ~/.local/share/spn-data or ~/spn-data)Converts URLs to HTTPS if ssl_only flag is set. spn.sh:228-231
Sets POST data options, adding capture_all=on unless no_errors is specified. spn.sh:233-241
Creates tracking files and deduplicates the URL list. spn.sh:243-262
The capture function (lines 265-547) is the core logic: spn.sh:265-547
It submits a URL to SPN and monitors the job status through multiple retry attempts. spn.sh:270-543
The function handles various error conditions including rate limits, timeouts, and server errors. spn.sh:296-397
On success, it logs the capture and extracts outlinks if enabled. spn.sh:460-483
The get_list function processes failed captures and new outlinks, returning URLs to retry. spn.sh:549-587
Lines 589-725 implement parallel processing mode when parallel >= 2: spn.sh:597-725
Lines 727-776 implement linear (sequential) processing mode: spn.sh:732-776
The script removes temporary files and prints completion message. spn.sh:778-796
This script is designed for bulk archiving of URLs to the Internet Archive's Wayback Machine. It handles rate limiting, retries, parallel processing, and can resume interrupted sessions. The script creates various log files to track successes, failures, and captured outlinks.
1#!/bin/bash2 3trap "abort" SIGINT SIGTERM4 5function abort(){
6 echo "
7
8== Aborting $(basename "$0") ==
9Data folder: $dir
10"
11 exit 1
12}13 14auth=''
15curl_args=()
16post_data=''
17custom_dir=''
18dir_suffix=''
19no_errors=''
20outlinks=''
21parallel='20'
22quiet=''
23resume=''
24ssl_only=''
25list_update_rate='3600'
26capture_job_rate='2.5'
27include_pattern=''
28exclude_pattern=''29 30print_usage() {
31 echo "Usage: $(basename "$0") [options] file
32 $(basename "$0") [options] url [url]...
33 $(basename "$0") [options] -r folder
34
35Options:
36 -a auth S3 API keys, in the form accesskey:secret
37 (get account keys at https://archive.org/account/s3.php)
38
39 -c args pass additional arguments to curl
40
41 -d data capture request options, or other arbitrary POST data
42
43 -f folder use a custom location for the data folder
44 (some files will be overwritten or deleted during the session)
45
46 -i suffix add a suffix to the name of the data folder
47 (if -f is used, -i is ignored)
48
49 -n tell Save Page Now not to save errors into the Wayback Machine
50
51 -o pattern save detected capture outlinks matching regex (ERE) pattern
52
53 -p N run at most N capture jobs in parallel (default: 20)
54
55 -q discard JSON for completed jobs instead of writing to log file
56
57 -r folder resume with the remaining URLs of an aborted session
58 (settings are not carried over, except for outlinks options)
59
60 -s use HTTPS for all captures and change HTTP input URLs to HTTPS
61
62 -t N wait at least N seconds before updating the main list of URLs
63 with outlinks and failed capture jobs (default: 3600)
64
65 -w N wait at least N seconds after starting a capture job before
66 starting another capture job (default: 2.5)
67
68 -x pattern save detected capture outlinks not matching regex (ERE) pattern
69 (if -o is also used, outlinks are filtered using both regexes)"
70}71 72while getopts 'a:c:d:f:i:no:p:qr:st:w:x:' flag; do
73 case "${flag}" in
74 a) auth="$OPTARG" ;;
75 c) declare -a "curl_args=($OPTARG)" ;;
76 d) post_data="$OPTARG" ;;
77 f) custom_dir="$OPTARG" ;;
78 i) dir_suffix="-$OPTARG" ;;
79 n) no_errors='true' ;;
80 o) outlinks='true'; include_pattern="$OPTARG" ;;
81 p) parallel="$OPTARG" ;;
82 q) quiet='true' ;;
83 r) resume="$OPTARG" ;;
84 s) ssl_only='true' ;;
85 t) list_update_rate="$OPTARG" ;;
86 w) capture_job_rate="$OPTARG" ;;
87 x) outlinks='true'; exclude_pattern="$OPTARG" ;;
88 *) print_usage
89 exit 1 ;;
90 esac
91done
92shift "$((OPTIND-1))"93 94if [[ -n "$resume" ]]; then
95 # There should not be any arguments
96 if [[ -n "$1" ]]; then
97 print_usage
98 exit 1
99 fi
100 # Get list
101 # List will be constructed from the specified folder
102 if [[ ! -d "$resume" ]]; then
103 echo "The folder $resume could not be found"
104 exit 1
105 fi
106 cd "$resume"
107 if ! [[ -f "index.txt" && -f "success.log" ]]; then
108 echo "Could not resume session; required files not found"
109 exit 1
110 fi
111 if [[ -f "outlinks.txt" ]]; then
112 # Index will also include successful redirects, which should be logged in captures.log
113 if [[ -f "captures.log" ]]; then
114 success=$(cat success.log captures.log | sed -Ee 's|^/web/[0-9]+/||g')
115 else
116 success=$(<success.log)
117 fi
118 index=$(cat index.txt outlinks.txt)
119 # Convert links to HTTPS
120 if [[ -n "$ssl_only" ]]; then
121 index=$(echo "$index" | sed -Ee 's|^[[:blank:]]*(https?://)?[[:blank:]]*([^[:blank:]]+)|https://\2|g;s|^https://ftp://|ftp://|g')
122 success=$(echo "$success" | sed -Ee 's|^[[:blank:]]*(https?://)?[[:blank:]]*([^[:blank:]]+)|https://\2|g;s|^https://ftp://|ftp://|g')
123 fi
124
125 # Remove duplicate lines from new index
126 index=$(awk '!seen [$0]++' <<< "$index")
127 # Remove links that are in success.log and captures.log from new index
128 list=$(awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 <(echo "$success") f=2 <(echo "$index"))
129
130 # If -o and -x are not specified, then retain original values
131 if [[ -z "$outlinks" ]]; then
132 outlinks='true'
133 include_pattern=$(<include_pattern.txt)
134 exclude_pattern=$(<exclude_pattern.txt)
135 fi
136 else
137 # Remove links that are in success.log from index.txt
138 list=$(awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 success.log f=2 index.txt)
139 fi
140 if [[ -z "$list" ]]; then
141 echo "Session already complete; not resuming"
142 exit 1
143 fi
144 cd
145else
146 # File or at least one URL must be provided
147 if [[ -z "$1" ]]; then
148 print_usage
149 exit 1
150 fi
151 # Get list
152 # Treat as filename if only one argument and file exists, and as URLs otherwise
153 if [[ -n "$2" || ! -f "$1" ]]; then
154 list=$(for i in "$@"; do echo "$i"; done)
155 else
156 list=$(<"$1")
157 fi
158fi159 160if [[ -n "$custom_dir" ]]; then
161 f="-$$"
162 dir="$custom_dir"
163 if [[ ! -d "$dir" ]]; then
164 mkdir "$dir" || { echo "The folder $dir could not be created"; exit 1; }
165 echo "
166
167== Starting $(basename "$0") ==
168Data folder: $dir
169"
170 else
171 echo "
172
173== Starting $(basename "$0") ==
174Using existing data folder: $dir
175"
176 fi
177 cd "$dir"
178
179 for i in max_parallel_jobs$f.txt status_rate$f.txt list_update_rate$f.txt capture_job_rate$f.txt lock$f.txt daily_limit$f.txt quit$f.txt; do
180 if [[ -f "$i" ]]; then
181 rm "$i"
182 fi
183 done
184else
185 f=''
186 # Setting base directory on parent variable allows discarding redundant '~/' expansions
187 if [ "$(uname)" == "Darwin" ]; then
188 # macOS platform
189 parent="${HOME}/Library/spn-data"
190 else
191 # Use XDG directory specification; if variable is not set then default to ~/.local/share/spn-data
192 parent="${XDG_DATA_HOME:-$HOME/.local/share}/spn-data"
193 # If the folder doesn't exist, use ~/spn-data instead
194 if [[ ! -d "${XDG_DATA_HOME:-$HOME/.local/share}" ]]; then
195 parent="${HOME}/spn-data"
196 fi
197 fi
198
199 month=$(date -u +%Y-%m)
200 now=$(date +%s)
201
202 for i in "$parent" "$parent/$month"; do
203 if [[ ! -d "$i" ]]; then
204 mkdir "$i" || { echo "The folder $i could not be created"; exit 1; }
205 fi
206 done
207
208 # Wait between 0 and 0.07 seconds to try to avoid a collision, in case another session is started at exactly the same time
209 sleep ".0$((RANDOM % 8))"
210
211 # Wait between 0.1 and 0.73 seconds if the folder already exists
212 while [[ -d "$parent/$month/$now$dir_suffix" ]]; do
213 sleep ".$((10 + RANDOM % 64))"
214 now=$(date +%s)
215 done
216 dir="$parent/$month/$now$dir_suffix"
217
218 # Try to create the folder
219 mkdir "$dir" || { echo "The folder $dir could not be created"; exit 1; }
220 echo "
221
222== Starting $(basename "$0") ==
223Data folder: $dir
224"
225 cd "$dir"
226fi227 228# Convert links to HTTPS
229if [[ -n "$ssl_only" ]]; then
230 list=$(echo "$list" | sed -Ee 's|^[[:blank:]]*(https?://)?[[:blank:]]*([^[:blank:]]+)|https://\2|g;s|^https://ftp://|ftp://|g')
231fi232 233# Set POST options
234# The web form sets capture_all=on by default; this replicates the default behavior
235if [[ -z "$no_errors" ]]; then
236 if [[ -n "$post_data" ]]; then
237 post_data="${post_data}&capture_all=on"
238 else
239 post_data="capture_all=on"
240 fi
241fi242 243# Create data files
244# max_parallel_jobs.txt and status_rate.txt are created later
245touch failed.txt
246echo "$list_update_rate" > list_update_rate$f.txt
247echo "$capture_job_rate" > capture_job_rate$f.txt
248# Add successful capture URLs from previous session, if any, to the index and the list of captures
249# This is to prevent redundant captures in the current session and in future ones
250if [[ -n "$success" ]]; then
251 success=$(echo "$success" | awk '!seen [$0]++')
252 echo "$success" >> index.txt
253 echo "$success" >> success.log
254fi
255# Dedupe list, then send to index.txt
256list=$(awk '!seen [$0]++' <<< "$list") && echo "$list" >> index.txt
257if [[ -n "$outlinks" ]]; then
258 touch outlinks.txt
259 # Create both files even if one of them would be empty
260 echo "$include_pattern" > include_pattern.txt
261 echo "$exclude_pattern" > exclude_pattern.txt
262fi263
264# Submit a URL to Save Page Now and check the result265function capture(){
266 local tries="0"
267 local request
268 local job_id
269 local message
270 while ((tries < 3)); do
271 # Submit
272 local lock_wait=0
273 local start_time=`date +%s`
274 while :; do
275 if (( $(date +%s) - start_time > 300 )); then
276 break 2
277 fi
278 if [[ -n "$auth" ]]; then
279 request=$(curl "${curl_args[@]}" -s -m 60 -X POST --data-urlencode "url=${1}" -d "${post_data}" -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/")
280 job_id=$(echo "$request" | grep -Eo '"job_id":"([^"\\]|\\["\\])*"' | head -1 | sed -Ee 's/"job_id":"(.*)"/\1/g')
281 if [[ -n "$job_id" ]]; then
282 break
283 fi
284 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Request failed] $1"
285 message=$(echo "$request" | grep -Eo '"message":"([^"\\]|\\["\\])*"' | sed -Ee 's/"message":"(.*)"/\1/g')
286 else
287 request=$(curl "${curl_args[@]}" -s -m 60 -X POST --data-urlencode "url=${1}" -d "${post_data}" "https://web.archive.org/save/")
288 job_id=$(echo "$request" | grep -E 'spn\.watchJob\(' | sed -Ee 's/^.*spn\.watchJob\("([^"]*).*$/\1/g' | head -1)
289 if [[ -n "$job_id" ]]; then
290 break
291 fi
292 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Request failed] $1"
293 message=$(echo "$request" | grep -E -A 2 '</?h2( [^>]*)?>' | grep -E '</?p( [^>]*)?>' | sed -Ee 's| *</?p> *||g')
294 fi
295 if [[ -z "$message" ]]; then
296 if [[ "$request" =~ "429 Too Many Requests" ]] || [[ "$request" == "" ]]; then
297 echo "$request"
298 if [[ ! -f lock$f.txt ]]; then
299 touch lock$f.txt
300 sleep 20
301 rm lock$f.txt
302 else
303 break 2
304 fi
305 elif [[ "$request" =~ "400 Bad Request" ]]; then
306 echo "$request"
307 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
308 echo "$(date -u '+%Y-%m-%d %H:%M:%S') $1" >> invalid.log
309 echo "$request" >> invalid.log
310 return 1
311 else
312 sleep 5
313 fi
314 else
315 echo " $message"
316 if ! [[ "$message" =~ "You have already reached the limit" || "$message" =~ "Cannot start capture" || "$message" =~ "The server encountered an internal error and was unable to complete your request" || "$message" =~ "Crawling this host is paused" ]]; then
317 if [[ "$message" =~ "You have reached your daily not-logged-in captures limit of" || "$message" =~ "You cannot make more than "[1-9][0-9,]*" captures per day" ]]; then
318 touch daily_limit$f.txt
319 break 2
320 else
321 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
322 echo "$(date -u '+%Y-%m-%d %H:%M:%S') $1" >> invalid.log
323 echo "$message" >> invalid.log
324 return 1
325 fi
326 fi
327 if [[ ! -f lock$f.txt ]]; then
328 touch lock$f.txt
329 while [[ -f lock$f.txt ]]; do
330 # Retry the request until either the job is submitted or a different error is received
331 sleep 2
332 if [[ -n "$auth" ]]; then
333 # If logged in, then check if the server-side limit for captures has been reached
334 while :; do
335 request=$(curl "${curl_args[@]}" -s -m 60 -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/status/user")
336 available=$(echo "$request" | grep -Eo '"available":[0-9]*' | head -1)
337 if [[ "$available" != '"available":0' ]]; then
338 break
339 else
340 sleep 5
341 fi
342 done
343 request=$(curl "${curl_args[@]}" -s -m 60 -X POST --data-urlencode "url=${1}" -d "${post_data}" -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/")
344 job_id=$(echo "$request" | grep -Eo '"job_id":"([^"\\]|\\["\\])*"' | head -1 | sed -Ee 's/"job_id":"(.*)"/\1/g')
345 if [[ -n "$job_id" ]]; then
346 rm lock$f.txt
347 break 2
348 fi
349 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Request failed] $1"
350 message=$(echo "$request" | grep -Eo '"message":"([^"\\]|\\["\\])*"' | sed -Ee 's/"message":"(.*)"/\1/g')
351 else
352 request=$(curl "${curl_args[@]}" -s -m 60 -X POST --data-urlencode "url=${1}" -d "${post_data}" "https://web.archive.org/save/")
353 job_id=$(echo "$request" | grep -E 'spn\.watchJob\(' | sed -Ee 's/^.*spn\.watchJob\("([^"]*).*$/\1/g' | head -1)
354 if [[ -n "$job_id" ]]; then
355 rm lock$f.txt
356 break 2
357 fi
358 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Request failed] $1"
359 message=$(echo "$request" | grep -E -A 2 '</?h2( [^>]*)?>' | grep -E '</?p( [^>]*)?>' | sed -Ee 's| *</?p> *||g')
360 fi
361 if [[ -z "$message" ]]; then
362 if [[ "$request" =~ "429 Too Many Requests" ]] || [[ "$request" == "" ]]; then
363 echo "$request"
364 sleep 20
365 else
366 sleep 5
367 rm lock$f.txt
368 break
369 fi
370 else
371 echo " $message"
372 if [[ "$message" =~ "You have already reached the limit" || "$message" =~ "Cannot start capture" || "$message" =~ "The server encountered an internal error and was unable to complete your request" || "$message" =~ "Crawling this host is paused" ]]; then
373 :
374 elif [[ "$message" =~ "You have reached your daily not-logged-in captures limit of" || "$message" =~ "You cannot make more than "[1-9][0-9,]*" captures per day" ]]; then
375 rm lock$f.txt
376 touch daily_limit$f.txt
377 break 3
378 else
379 rm lock$f.txt
380 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
381 echo "$(date -u '+%Y-%m-%d %H:%M:%S') $1" >> invalid.log
382 echo "$message" >> invalid.log
383 return 1
384 fi
385 fi
386 done
387 else
388 # If another process has already created lock.txt, wait for the other process to remove it
389 while [[ -f lock$f.txt ]]; do
390 sleep 5
391 ((lock_wait+=5))
392 if ((lock_wait > 120)); then
393 break 3
394 fi
395 done
396 fi
397 fi
398 done
399 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job submitted] $1"
400
401 # Check if there's a message
402 if [[ -n "$auth" ]]; then
403 message=$(echo "$request" | grep -Eo '"message":"([^"\\]|\\["\\])*"' | sed -Ee 's/"message":"(.*)"/\1/g')
404 else
405 message=$(echo "$request" | grep -E -A 2 '</?h2( [^>]*)?>' | grep -E '</?p( [^>]*)?>' | sed -Ee 's| *</?p> *||g')
406 fi
407 if [[ -n "$message" ]]; then
408 echo " $message"
409
410 # Extract the delay, if any, from the message
411 delay=$(echo "$message" | grep -Eo 'capture will start in')
412 if [[ -n "$delay" ]]; then
413 delay_hours=$(echo "$message" | grep -Eo "[0-9]+ hour" | grep -Eo "[0-9]*")
414 delay_minutes=$(echo "$message" | grep -Eo "[0-9]+ minute" | grep -Eo "[0-9]*")
415 delay_seconds=$(echo "$message" | grep -Eo "[0-9]+ second" | grep -Eo "[0-9]*")
416
417 # If the values are not integers, set them to 0
418 [[ $delay_hours =~ ^[0-9]+$ ]] || delay_hours="0"
419 [[ $delay_minutes =~ ^[0-9]+$ ]] || delay_minutes="0"
420 [[ $delay_seconds =~ ^[0-9]+$ ]] || delay_seconds="0"
421
422 delay_seconds=$((delay_hours * 3600 + delay_minutes * 60 + delay_seconds))
423 sleep $delay_seconds
424 fi
425 fi
426 local start_time=`date +%s`
427 local status
428 local status_ext
429 while :; do
430 sleep "$(<status_rate$f.txt)"
431 request=$(curl "${curl_args[@]}" -s -m 60 "https://web.archive.org/save/status/$job_id")
432 status=$(echo "$request" | grep -Eo '"status":"([^"\\]|\\["\\])*"' | head -1)
433 if [[ -z "$status" ]]; then
434 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Status request failed] $1"
435 if [[ "$request" =~ "429 Too Many Requests" ]] || [[ "$request" == "" ]]; then
436 echo "$request"
437 sleep 20
438 fi
439 sleep "$(<status_rate$f.txt)"
440 request=$(curl "${curl_args[@]}" -s -m 60 "https://web.archive.org/save/status/$job_id")
441 status=$(echo "$request" | grep -Eo '"status":"([^"\\]|\\["\\])*"' | head -1)
442 if [[ -z "$status" ]]; then
443 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Status request failed] $1"
444 if [[ "$request" =~ "429 Too Many Requests" ]] || [[ "$request" == "" ]]; then
445 echo "$request"
446 sleep 20
447 status='"status":"pending"'
448 # Fake status response to allow while loop to continue
449 else
450 echo "$request" >> unknown-json.log
451 break 2
452 fi
453 fi
454 fi
455 if [[ -z "$status" ]]; then
456 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Unknown error] $1"
457 echo "$request" >> unknown-json.log
458 break 2
459 fi
460 if [[ "$status" == '"status":"success"' ]]; then
461 if [[ "$request" =~ '"first_archive":true' ]]; then
462 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job completed] [First archive] $1"
463 else
464 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job completed] $1"
465 fi
466 echo "$1" >> success.log
467 timestamp=$(echo "$request" | grep -Eo '"timestamp":"[0-9]*"' | sed -Ee 's/^"timestamp":"(.*)"/\1/g')
468 url=$(echo "$request" | grep -Eo '"original_url":"([^"\\]|\\["\\])*"' | sed -Ee 's/^"original_url":"(.*)"/\1/g;s/\\(["\\])/\1/g')
469 echo "/web/$timestamp/$url" >> captures.log
470 if [[ -z "$quiet" ]]; then
471 echo "$request" >> success-json.log
472 fi
473 if [[ -n "$outlinks" ]]; then
474 if [[ "$url" != "$1" ]]; then
475 # Prevent the URL from being submitted twice
476 echo "$url" >> index.txt
477 fi
478 # grep matches array of strings (most special characters are converted server-side, but not square brackets)
479 # sed transforms the array into just the URLs separated by line breaks
480 echo "$request" | grep -Eo '"outlinks":\["([^"\\]|\\["\\])*"(,"([^"\\]|\\["\\])*")*\]' | sed -Ee 's/"outlinks":\["(.*)"\]/\1/g;s/(([^"\\]|\\["\\])*)","/\1\
481/g;s/\\(["\\])/\1/g' | { [[ -n "$(<exclude_pattern.txt)" ]] && { [[ -n "$(<include_pattern.txt)" ]] && grep -E "$(<include_pattern.txt)" | grep -Ev "$(<exclude_pattern.txt)" || grep -Ev "$(<exclude_pattern.txt)"; } || grep -E "$(<include_pattern.txt)"; } >> outlinks.txt
482 fi
483 return 0
484 elif [[ "$status" == '"status":"pending"' ]]; then
485 new_download_size=$(echo "$request" | grep -Eo '"download_size":[0-9]*' | head -1)
486 if [[ -n "$new_download_size" ]]; then
487 if [[ "$new_download_size" == "$download_size" ]]; then
488 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [File download stalled] $1"
489 break 2
490 else
491 download_size="$new_download_size"
492 fi
493 fi
494 if (( $(date +%s) - start_time > 1200 )); then
495 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job timed out] $1"
496 break 2
497 fi
498 elif [[ "$status" == '"status":"error"' ]]; then
499 echo "$request" >> error-json.log
500 status_ext=$(echo "$request" | grep -Eo '"status_ext":"([^"\\]|\\["\\])*"' | head -1 | sed -Ee 's/"status_ext":"(.*)"/\1/g')
501 if [[ -z "$status_ext" ]]; then
502 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Unknown error] $1"
503 break 2
504 fi
505 if [[ "$status_ext" == 'error:filesize-limit' ]]; then
506 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [File size limit of 2 GB exceeded] $1"
507 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
508 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$status_ext] $1" >> failed.log
509 return 1
510 elif [[ "$status_ext" == 'error:proxy-error' ]]; then
511 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [SPN proxy error] $1"
512 else
513 message=$(echo "$request" | grep -Eo '"message":"([^"\\]|\\["\\])*"' | sed -Ee 's/"message":"(.*)"/\1/g')
514 if [[ -z "$message" ]]; then
515 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Unknown error: $status_ext] $1"
516 break 2
517 fi
518 if [[ "$message" == "Live page is not available: chrome-error://chromewebdata/" ]]; then
519 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [SPN internal error] $1"
520 elif [[ "$message" =~ ' (HTTP status='(40[89]|429|50[023478])').'$ ]] || [[ "$message" =~ "The server didn't respond in time" ]]; then
521 # HTTP status 408, 409, 429, 500, 502, 503, 504, 507 or 508, or didn't respond in time
522 # URL may become available later
523 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$message] $1"
524 break 2
525 elif [[ "$message" =~ ' (HTTP status='[45][0-9]*').'$ ]]; then
526 # HTTP error; assume the URL cannot be archived
527 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$message] $1"
528 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
529 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$status_ext] $1" >> failed.log
530 return 1
531 else
532 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [$message] $1"
533 break 2
534 fi
535 fi
536 break
537 else
538 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Unknown error] $1"
539 break 2
540 fi
541 done
542 ((tries++))
543 done
544 echo "$(date -u '+%Y-%m-%d %H:%M:%S') [Job failed] $1"
545 echo "$1" >> failed.txt
546 return 1
547}548 549function get_list(){
550 local failed_file=failed-$(date +%s).txt
551 mv failed.txt $failed_file
552 touch failed.txt
553 local failed_list=$(<$failed_file)
554
555 if [[ -n "$outlinks" ]]; then
556 local outlinks_file=outlinks-$(date +%s).txt
557 mv outlinks.txt $outlinks_file
558 touch outlinks.txt
559 # Remove duplicate lines; reading into string prevents awk from emptying the file
560 awk '!seen [$0]++' <<< "$(<$outlinks_file)" > $outlinks_file
561 # Convert links to HTTPS
562 if [[ -n "$ssl_only" ]]; then
563 sed -Ee 's|^[[:blank:]]*(https?://)?[[:blank:]]*([^[:blank:]]+)|https://\2|g;s|^https://ftp://|ftp://|g' <<< "$(<$outlinks_file)" > $outlinks_file
564 fi
565 # Remove lines that are already in index.txt
566 local outlinks_list=$(awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 index.txt f=2 $outlinks_file)
567
568 if [[ -n "$outlinks_list" ]]; then
569 echo "$outlinks_list" >> index.txt
570
571 if [[ -n "$failed_list" ]]; then
572 echo "$failed_list
573$outlinks_list"
574 else
575 echo "$outlinks_list"
576 fi
577 fi
578 if [[ -z "$(<$outlinks_file)" ]]; then
579 rm $outlinks_file
580 fi
581 else
582 echo "$failed_list"
583 fi
584 if [[ -z "$failed_list" ]]; then
585 rm $failed_file
586 fi
587}588
589# Track the number of loops in which no URLs from the list are archived
590repeats=0594 unset parallel
595fi
596597# Parallel loop
598if [[ -n "$parallel" ]]; then
599 if ((parallel > 60)); then
600 parallel=60
601 echo "Setting maximum parallel jobs to 60"
602 fi
603 echo "$parallel" > max_parallel_jobs$f.txt
604 # Overall request rate stays at around 60 per minute
605 echo "$parallel" > status_rate$f.txt
606 while [[ ! -f quit$f.txt ]]; do
607 (
608 time_since_start="$SECONDS"
609 while IFS='' read -r line || [[ -n "$line" ]]; do
610 capture "$line" & sleep $(<capture_job_rate$f.txt)
611 children_wait=0
612 children=`jobs -p | wc -l`
613 while ! (( children < $(<max_parallel_jobs$f.txt) )); do
614 sleep 1
615 ((children_wait++))
616 if ((children_wait < 600)); then
617 children=`jobs -p | wc -l`
618 else
619 # Wait is longer than 600 seconds; something might be wrong
620 # Increase limit and ignore the problem for now
621 children=0
622 echo $(( $(<max_parallel_jobs$f.txt) + 1 )) > max_parallel_jobs$f.txt
623 fi
624 done
625 lock_wait=0
626 while [[ -f lock$f.txt ]]; do
627 sleep 2
628 ((lock_wait+=2))
629 if ((lock_wait > 300)); then
630 rm lock$f.txt
631 fi
632 done
633 if [[ -f daily_limit$f.txt ]]; then
634 echo "$(date -u '+%Y-%m-%d %H:%M:%S') Pausing for $(( (3600 - $(date +%s) % 3600) / 60 )) minutes"
635 sleep $(( 3600 - $(date +%s) % 3600 ))
636 rm daily_limit$f.txt
637 fi
638 # If logged in, then check if the server-side limit for captures has been reached
639 if [[ -n "$auth" ]] && (( children > 4 )); then
640 while :; do
641 request=$(curl "${curl_args[@]}" -s -m 60 -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/status/user")
642 available=$(echo "$request" | grep -Eo '"available":[0-9]*' | head -1)
643 if [[ "$available" != '"available":0' ]]; then
644 break
645 else
646 sleep 5
647 fi
648 done
649 fi
650 # Check failures and outlinks regularly
651 if (( SECONDS - time_since_start > $(<list_update_rate$f.txt) )) && [[ ! -f quit$f.txt ]] ; then
652 time_since_start="$SECONDS"
653 new_list=$(get_list)
654 if [[ -n "$new_list" ]]; then
655 while IFS='' read -r line2 || [[ -n "$line2" ]]; do
656 capture "$line2" & sleep $(<capture_job_rate$f.txt)
657 children_wait=0
658 children=`jobs -p | wc -l`
659 while ! ((children < $(<max_parallel_jobs$f.txt) )); do
660 sleep 1
661 ((children_wait++))
662 if ((children_wait < 600)); then
663 children=`jobs -p | wc -l`
664 else
665 # Wait is longer than 600 seconds; something might be wrong
666 # Increase limit and ignore the problem for now
667 children=0
668 echo $(( $(<max_parallel_jobs$f.txt) + 1 )) > max_parallel_jobs$f.txt
669 fi
670 done
671 lock_wait=0
672 while [[ -f lock$f.txt ]]; do
673 sleep 2
674 ((lock_wait+=2))
675 if ((lock_wait > 300)); then
676 rm lock$f.txt
677 fi
678 done
679 if [[ -f daily_limit$f.txt ]]; then
680 echo "$(date -u '+%Y-%m-%d %H:%M:%S') Pausing for $(( (3600 - $(date +%s) % 3600) / 60 )) minutes"
681 sleep $(( 3600 - $(date +%s) % 3600 ))
682 rm daily_limit$f.txt
683 fi
684 # If logged in, then check if the server-side limit for captures has been reached
685 if [[ -n "$auth" ]] && (( children > 4 )); then
686 while :; do
687 request=$(curl "${curl_args[@]}" -s -m 60 -H "Accept: application/json" -H "Authorization: LOW ${auth}" "https://web.archive.org/save/status/user")
688 available=$(echo "$request" | grep -Eo '"available":[0-9]*' | head -1)
689 if [[ "$available" != '"available":0' ]]; then
690 break
691 else
692 sleep 5
693 fi
694 done
695 fi
696 done <<< "$new_list"
697 unset new_list
698 fi
699 fi
700 done <<< "$list"
701
702 for job in `jobs -p`; do wait $job; done
703 )
704
705 new_list=$(get_list)
706 if [[ "$new_list" == "$list" ]]; then
707 ((repeats++))
708 if ((repeats > 1)); then
709 if ((repeats > 3)); then
710 break
711 else
712 echo "$(date -u '+%Y-%m-%d %H:%M:%S') Pausing for 30 minutes"
713 sleep 1800
714 fi
715 fi
716 fi
717 list="$new_list"
718 unset new_list
719 if [[ -z "$list" && -z "$(<failed.txt)" ]]; then
720 # No more URLs
721 touch quit$f.txt
722 rm failed.txt
723 fi
724 done
725fi726
727if [[ ! -f quit$f.txt ]]; then
728 echo "2" > status_rate$f.txt
729fi
730
731# Linear loop732while [[ ! -f quit$f.txt ]]; do
733 time_since_start="$SECONDS"
734 while IFS='' read -r line || [[ -n "$line" ]]; do
735 time_since_capture_start="$SECONDS"
736 capture "$line"
737 if [[ $(bc <<< "$SECONDS - $time_since_capture_start < $(<capture_job_rate$f.txt)") == "1" ]]; then
738 sleep $(bc <<< "$(<capture_job_rate$f.txt) - ($SECONDS - $time_since_capture_start)")
739 fi
740 # Check failures and outlinks regularly
741 if (( SECONDS - time_since_start > $(<list_update_rate$f.txt) )) && [[ ! -f quit$f.txt ]] ; then
742 time_since_start="$SECONDS"
743 new_list=$(get_list)
744 if [[ -n "$new_list" ]]; then
745 while IFS='' read -r line2 || [[ -n "$line2" ]]; do
746 time_since_capture_start="$SECONDS"
747 capture "$line2"
748 if [[ $(bc <<< "$SECONDS - $time_since_capture_start < $(<capture_job_rate$f.txt)") == "1" ]]; then
749 sleep $(bc <<< "$(<capture_job_rate$f.txt) - ($SECONDS - $time_since_capture_start)")
750 fi
751 done <<< "$new_list"
752 fi
753 unset new_list
754 fi
755 done <<< "$list"
756 new_list=$(get_list)
757 if [[ "$new_list" == "$list" ]]; then
758 ((repeats++))
759 if ((repeats > 1)); then
760 if ((repeats > 4)); then
761 # Give up
762 touch quit$f.txt
763 else
764 echo "$(date -u '+%Y-%m-%d %H:%M:%S') Pausing for 30 minutes"
765 sleep 1800
766 fi
767 fi
768 fi
769 list="$new_list"
770 unset new_list
771 if [[ -z "$list" && -z "$(<failed.txt)" ]]; then
772 # No more URLs
773 touch quit$f.txt
774 rm failed.txt
775 fi
776done777 778if [[ -n "$custom_dir" ]]; then
779 for i in max_parallel_jobs$f.txt status_rate$f.txt list_update_rate$f.txt lock$f.txt daily_limit$f.txt quit$f.txt; do
780 if [[ -f "$i" ]]; then
781 rm "$i"
782 fi
783 done
784
785 echo "
786
787== Ending $(basename "$0") ==
788Data folder: $dir
789"
790else
791 echo "
792
793== Ending $(basename "$0") ==
794Data folder: $dir
795"