Internet Archive Tasks API
On this page
Internet Archive Tasks API#
What is a Task?#
Items are updated via tasks. Uploading new content, changing metadata, renaming and deleting files, transforming original content into derivatives—all are performed with tasks.
Tasks are queued in a system known as catalog
. Only one task is permitted to execute on an item at a time. catalog
schedules tasks and ensures they run in a timely manner. When a task completes, it is stored as item history. When a task runs, its output is stored in a log text file. The Changes API allows for tracking completed task across all items as they complete.
The Tasks API exposes running, pending, and historical tasks to callers with optional criteria to filter results. It also offers an interface for submitting a subset of the available tasks to items the user owns or has permission to edit. It also offers an access method for each task’s textual log.
Tasks API is a REST-like interface using JSON to transmit and receive payloads.
The API requires IA-S3 keys for user authentication. It currently does not accept Internet Archive cookies.
Service Overview#
The Tasks API is available at https://archive.org/services/tasks.php Task logs are available at https://catalogd.archive.org/services/tasks.php
These URLs accept only two HTTP methods: GET
and POST
. GET
is used for task listing and tasks logs, and POST
for submitting new tasks.
For more information on how to interact with the Tasks API, see Internet Archive RESTful microservices.
JSON Lines#
In one case (explained in “Streaming the Task List” below) the server may respond with JSON-L (JSON Lines). The Content-Type
will be application/json-l
for this type of response.
The HTTP status code should be used to determine success or failure. On success (200 OK
) the response will be separate JSON objects, one per line. Each JSON object is delimited with a newline (\n
) character. Otherwise the usual JSON response format is returned with success
, error
, and value
.
See jsonlines.org for more information.
Task Listings#
Pending and historical tasks may be read via an HTTP GET
. In addition to the optional version
parameter, Tasks API accepts any combination of category and criteria parameters.
Categories#
Three categories may be requested:
summary
- Total counts of catalog tasks meeting all criteria (explained below) organized by run status (queued, running, error, and paused). Historical totals are currently unavailable
catalog
- A list of all active tasks (queued, running, error, or paused) matching the supplied criteria
history
- A list of all completed tasks matching the supplied criteria
Any combination of the above may be specified:
summary
is the default category. It’s always included unless explicitly excluded (with summary=0
).
Criteria#
Multiple criteria may be requested:
identifier
(string) - item identifier (may be wildcarded, see below)
task_id
(integer) - task identifier
server
(string) - IA node task will or was executed upon (may be wildcarded)
cmd
(string) - Task command (e.g.,archive.php
,modify_xml.php
, etc.; may be wildcarded)
args
(string) - Argument list (see below; may be wildcarded)
submitter
(string) - User submitting task (may be wildcarded)
priority
(integer) - Generally from 10 to -10
wait_admin
(integer) - Task run state (see “wait_admin and Run States”, below)
or
status
(string) orcolor
(string)
And four criteria for searching submittime
by range:
submittime>
submittime<
submittime>=
submittime<=
The submittime
parameters may be any recognizable date/time as a string; the Tasks API will attempt to parse it into the internal stored format:
This will search for all tasks submitted on or after 1 January, 2018 00:00:00 UTC. (Note that the “>=” of the submittime parameter has been encoded as %3E%3D.)
Any combination of criteria may be specified along with categories:
Currently only one (1) criteria of each type is respected in the search. Specifying the same criteria more than once (such as two identifiers) will lead to unspecified results.
Logically, all criteria are AND-ed when searching. There is currently no support for other logical operators (OR, NOT, etc.)
Limits and the Cursor#
Task listings, especially history, may be voluminous. The Tasks API will only return a limited number of tasks in a single request. The current default is 50 tasks per request, but the caller may request more with the limit
parameter:
The current maximum limit is 500 tasks. Values outside this range are clamped to the server maximum.
If Tasks API reaches the limit maximum and can continue, it will return a cursor
value in the JSON response. The cursor may be included in a subsequent request to the Tasks API to continue listing.
For example, consider this request:
This will return the most recent ten tasks from the history of the opensource
collection. The result will also include a cursor
field with an opaque string (for example, “c:123456”). In order for the client to continue iterating the list of historical tasks, it should call:
(Note that the colon in “c:123456” has been encoded as %3A.) This will continue listing and return the next ten tasks from the item’s history. Tasks API will again return a “cursor” which should be included in the next request. The caller should continue until no cursor
is returned, indicating the list is completed.
The caller must include all criteria and categories from the original request in subsequent calls. (In other words, it should simply send the same URL but with the cursor
parameter included.) Sending different criteria or categories will result in undefined behavior.
Streaming the Task List#
If the caller wishes to receive all tasks in a single round-trip, they may set limit=0
in the request query. If successful, the response will be in JSON-L format (explained in “JSON Lines”) with a single JSON object per line. Otherwise a normal JSON response is returned with error information.
In addition to the usual per-task information returned, another field is included: category
. It holds one of three values: summary
(only one per request), catalog
, or history
. This indicates which type of JSON object is being returned for that line.
Wildcards#
Both *
(asterisk) and %
(percentage sign) may be used as wildcard characters within criteria that accept them:
(Note the wildcard *
is encoded as %2A
here.) This will return a summary of all pending tasks for identifiers starting with podcast-
.
wait_admin and Run States#
Four official run states are supported. Each run state has a number (wait_admin
), a color (for descriptive purposes), and a human-readable label:
0: Queued (green)
1: Running (blue)
2: Error (red)
9: Paused (brown)
The Tasks API accepts wait_admin
, color
, or status
as request criteria:
Searching Task History#
Due to internal limitations, the history
category may only be searched if identifier
or task_id
is specified. Other criteria may be included, but if neither identifier
nor task_id
are present, the request will fail.
When searching category history
, identifier
may not include wildcards.
Response Value#
The task list is a complex tree structure with up to four top-level keys. Each key corresponds to one of the three categories requested (summary
, catalog
, and history
) as well as a cursor
value if the listing is truncated (explained above).
Task Logs#
A task’s raw log file may be accessed by the Tasks API. Unlike the Task Listing and Task Submission interfaces, no JSON is returned. The response is merely the log file as plain text.
Request#
The task log is accessed by GET
with the task_log
query parameter set to the task identifier the client wishes to read. All other query parameters are ignored.
Response#
On success the task log is returned as a text/plain
payload with an HTTP 200 OK
response status.
Redirects#
Task logs are only available if the Tasks API is accessed from the catalog server (catalogd.archive.org
). Any attempt to access a task log from a web server (archive.org
) will result in an HTTP 301 Moved Permanently
directing the client to the catalog server.
Clients are recommended to direct their request to catalogd.archive.org
for all task log requests (but not the other requests).
Caching#
The Tasks API will include a Last-Modified
HTTP header in its response. The client can use this timestamp for caching the task log locally and detecting changes.
See the HTTP specification for more information, especially the sections on Last-Modified and If-Modified-Since.
Task Submission#
Tasks may be submitted to items the user either uploaded to Internet Archive or has permissions to edit. Tasks are submitted by HTTP POST
to the service endpoint.
Request Entity#
Unlike the GET
interface, POST
expects a JSON entity (payload) describing which task to submit, item identifier, and command-specific arguments. The supplied JSON object should have the following fields:
identifier
(string) - Item identifier
cmd
(string) - Task command to submit (see below)
args
(array) - Map of key-value pairs (see below)
priority
(integer, optional) - Task priority from 10 to -10 (default: 0)
See “Custom headers” for special request and response headers Tasks API supports when submitting tasks.
Supported Tasks#
Currently the following tasks are supported for submission:
book_op.php
bup.php
delete.php
derive.php
fixer.php
make_dark.php
make_undark.php
rename.php
book_op.php#
Various book operations are available. Each are activated by including an argument named op#
(where the # is replaced by a numeral) with a value specifying which book operation should be performed. Please contact IA for more information.
bup.php#
Schedule a task to backup the PRIMARY copy of the item to its SECONDARY server. Generally, this is not required, as all tasks will perform this backup when finished.
delete.php#
Delete the item. This removes all files from the IA servers.
WARNING: This is not reversible. Once data has been deleted, it cannot be restored.
derive.php#
Deriving performs content transformation(s) on the files in the item. The nature and scope of those transformations is too broad to list here.
The caller may specify already-derived files be removed prior to running the derive. Use the remove_derived
argument to specify a filename or a file specification, e.g.,`remove_derived=*.jpg`:code: (Original files uploaded to the item are not deleted, even if they match the file specification.)
fixer.php#
A fixer op is a miscellaneous operation being performed on the item, usually to correct an issue with it. Various fixer ops are available, each activated by including its name as an argument name. Please contact IA for more information.
make_dark.php/make_undark.php#
Darking an item is making it unavailable to any user, including the item owner. The item’s contents are unavailable to IA’s internal subsystems as well (include the Metadata API and the search engine).
A darked item may be undarked later.
Both tasks require a single argument: comment
. The caller should provide some reasonable explanation for why the item is being darked or undarked.
rename.php#
Rename the item’s identifier. A new_identifier
argument must be specified as the destination identifier.
If the new_identifier
is already present, a 409 Conflict
is returned.
WARNING: Many IA features (especially for derive.php
) rely on the files in an item to have a matching name as the item identifier itself. (For example, an item named foobar
may have an image stack named foobar_images.zip
.) rename.php
will attempt to rename item files as well, but it’s possible for certain cases to be missed.
Often a better solution is to create a new item with the desired identifier and, once ready, delete the old item.
Authorization#
The user must either have uploaded the item or have permission to edit it.
Response Value#
If success
is true
, the returned value
will include two fields:
task_id
(integer) - The task identifier of the pending task
log
(string) - A URL to the task log (written to when the task starts running)
Rate Limits#
Users are limited to the number of tasks they may submit over a period of time. Clients should be prepared to receive a 429 Too Many Requests
response, indicating the user’s threshold has been reached. The client should either report this error or sleep for a period of time before retrying.
The server may return a Retry-After
header with the response. The client may use this value as a suggestion for the amount of time to pause.
Rate limit reporting#
Tasks API may be queried for the client’s current threshold and active task counts. Limits are determined per-task-type.
The client should use GET
with two query values: rate_limits=1
and cmd=<cmd>.php
. The server will respond with the user’s current task limit, the number of inflight tasks for that command (tasks queued plus tasks running), and the number of tasks for that command currently blocked by OFFLINE nodes (indicating the server is unavailable or in service).
Note that IA may adjust rate limiting thresholds and policies at any time.
Rerunning a task#
If a task fails (wait_admin
of 2 or “red”), the user may rerun the task. This is useful if the task failed due to a transient error (network failure, etc.)
Request Entity#
As with task submission, PUT
expects a JSON entity describing which task to rerun. The supplied JSON object should have the following fields:
op
(string) - ‘rerun’
task_id
(int) - Task identifier
Response Value#
If success
is true
, the returned value
will be a JSON object with the task identifier (as a string, not an integer) and the item identifier. (This format is used in case support for multiple task reruns is included in the future.)
Examples#
These examples show HTTP request and responses with extraneous headers trimmed.
In these examples although the entity may be compressed, the uncompressed data is shown.
Listing a User’s Tasks#
To view all pending tasks for user@example.com
:
GET /services/tasks.php?submitter=user%40example.com&catalog=1 HTTP/1.1
Host: archive.org
Authorization: LOW <s3-access>:<s3-secret>
Accept-Encoding: deflate, gzip
The response may look something like this:
HTTP/1.1 200 OK
Host: archive.org
Content-Type: application/json
Transfer-Encoding: chunked
Content-Encoding: gzip
{"success":true,"value":{"summary":{"queued":0,"running":0,"error":0,"paused":0},"catalog":[]}}
Accessing a Log File#
To view the log file for completed task #1230802773 from catalog (catalogd.archive.org
):
GET /services/tasks.php?task_log=1230802773 HTTP/1.1
Host: catalogd.archive.org
Authorization: LOW <s3-access>:<s3-secret>
Accept-Encoding: deflate, gzip
Results in this response:
HTTP/1.1 200 OK
Content-Type: text/plain;charset=UTF-8
Last-Modified: Fri, 14 Jun 2019 18:06:40 GMT
-------------------------------------------------------
Task started at: UTC: 2019-06-14 18:06:36 ( PDT: 2019-06-14 11:06:36)
Task pid: 19879
**... log file continues...**
Darking an Item#
To dark item did_not_mean_to_upload
, the caller could submit a request as so:
POST /services/tasks.php HTTP/1.1
Host: archive.org
Authorization: LOW <s3-access>:<s3-secret>
Content-Type: application/json
Content-Length: 93
Accept-Encoding: deflate, gzip
{"identifier":"did_not_mean_to_upload","cmd":"make_dark.php","args":{"comment":"my mistake"}}
Note that args
provides an object within the enclosing object.
The response may look something like this:
HTTP/1.1 200 OK
Host: archive.org
Content-Type: application/json
Transfer-Encoding: chunked
Content-Encoding: gzip
{"success":true,"value":{"task_id":1234567,"log":"https://catalogd.archive.org/log/1234567"}}
Rerunning a task#
To rerun a stopped task:
PUT /services/tasks.php HTTP/1.1
Host: archive.org
Authorization: LOW <s3-access>:<s3-secret>
Content-Type: application/json
{"op":"rerun","task_id":1234}
A successful response would look like this:
HTTP/1.1 200 OK
Host: archive.org
Content-Type: application/json
{"success":true,"value":{"1234":"my_ia_item"}}
Accessing the user’s task submission limits#
To get a report of the user’s current rate limit situation:
GET /services/tasks.php?rate_limits=1&cmd=modify_xml.php HTTP/1.1
Host: archive.org
Authorization: LOW <s3-access>:<s3-secret>
A successful response may be:
HTTP/1.1 200 OK
Host: archive.org
Content-Type: application/json
{"success":true,"value":{"cmd":"modify_xml.php","task_limits":500,"tasks_inflight":120,"tasks_blocked_by_offline":0}}
This indicates the user has 120 modify_xml.php
tasks (queued + running) outstanding, zero of them are blocked by an OFFLINE node, and is operating under a limit of 500 modify_xml.php
tasks total.