API for discovering changes to archive.org items
On this page
API for discovering changes to archive.org items#
At the core of the archive.org content management system there is a task manager which changes the items in the archive.
This API provides access to the list of identifiers which have changed since a particular time.
This API also supports a cold-start mode, which will enumerate all the items in the archive, in addition to providing the change stream.
Changes API#
URL
https://be-api.us.archive.org/changes/v1Method
POSTArguments:
accessS3 access key from https://archive.org/account/s3.phpsecretS3 secret key from https://archive.org/account/s3.phpstart_dateif present sets the beginning point for the list of changed identifiers.a date stamp of the form
YYYYMMDD0to indicate a cold start, which will cause all identifiers to be returned, followed by any changes.
tokenoptional, a value from thenext_tokenfield from a previous call to this API. When present, the API returns the next set of results in the sequence of changes to the petabox.
In order to access the API, the account holding the secret and access keys requires / privs, or the see_all_catalog_changes privilege. Please contact a member of the collections team to arrange this access. Access control is required because the changes API reveals the presence of noindex items, which are otherwise unknown to the general public.
Starting from now, see the changes in the archive#
ACCESS=<ias3 access key>
SECRET=<ias3 secret key>
curl --data-urlencode access="$ACCESS" --data-urlencode secret="$SECRET" \
https://be-api.us.archive.org/changes/v1 | jq .
Returns:
{
"estimated_distance_from_head": 0,
"do_sleep_before_returning": true,
"changes": [],
"next_token": "eyJmaW5pc2hlZF9tYXJrZXIiOjUzMjI3NjI3MH0="
}
Then to see the next set of changes, make the request again with the next_token:
curl --data-urlencode access="$ACCESS" --data-urlencode secret="$SECRET" \
--data-urlencode token="eyJmaW5pc2hlZF9tYXJrZXIiOjUzMjI3NjI3MH0=" \
https://archive.org/services/changes.php | jq .
Returns:
{
"estimated_distance_from_head": 0,
"do_sleep_before_returning": true,
"changes": [
{
"identifier": "TELECONGO_20171026_223000"
},
...
{
"identifier": "ARCHIVEIT-6334-CDL-20140408-00002"
}
],
"next_token": "eyJmaW5pc2hlZF9tYXJrZXIiOjUzMjI3Nzg2M30="
}
Each time, the call is made with the next_token, the next batch of results is returned. The fields returned are:
estimated_distance_from_headThis value does not always appear, when it does, it is the number of tasks on the done list, beyond the end of the current data set.changesA list of records of items which have changed. This may contain more data about the item and change in the future.next_tokenThe token which, when used on a subsequent call will return the batch of changes following the ones given in thechangeslist.do_sleep_before_returningA boolean which advises the caller to either take a break before the next call to the endpoint (because this data is very near the most recent possible), or if there is still much more change data to be consumed after this batch.
The api supports two other starting modes.
Starting from a date, see changes made after midnight on that UTC day#
curl --data-urlencode access="$ACCESS" --data-urlencode secret="$SECRET" \
--data-urlencode start_date=20170401
https://be-api.us.archive.org/changes/v1 | jq .
Returns:
{
"estimated_distance_from_head": 90828632,
"do_sleep_before_returning": false,
"changes": [
{
"identifier": "in.ernet.dli.2015.550031"
},
...
{
"identifier": "jamaica-gleaner.com-20160320-230158"
}
],
"next_token": "eyJmaW5pc2hlZF9tYXJrZXIiOjQ0MTQ1Mjc2Nn0="
}
To fetch more results, make a request using the next_token, like so:
curl --data-urlencode access="$ACCESS" --data-urlencode secret="$SECRET" \
--data-urlencode token="eyJmaW5pc2hlZF9tYXJrZXIiOjQ0MTQ1Mjc2Nn0=" \
https://be-api.us.archive.org/changes/v1 | jq .
Returns:
{
"estimated_distance_from_head": 90779995,
"do_sleep_before_returning": false,
"changes": [
{
"identifier": "china.org.cn-20170101-092922"
},
...
{
"identifier": "GurudevQuotes"
}
],
"next_token": "eyJmaW5pc2hlZF9tYXJrZXIiOjQ0MTUwMjc2Nn0="
}
Starting from nothing, load all the identifiers, then pick up changes#
This API supports a cold start mode where all items are dumped out, then changes since the full item dump time are sent. This is done by setting start_date to 0.
For example:
curl --data-urlencode access="$ACCESS" --data-urlencode secret="$SECRET" \
--data-urlencode start_date=0
https://be-api.us.archive.org/changes/v1 | jq .
Returns:
{
"do_sleep_before_returning": false,
"changes": [
{
"identifier": "0----------"
},
...
{
"identifier": "008MRAnonymous"
}
],
"next_token": "eyJmaW5pc2hlZF9tYXJrZXIiOiI1MzIyODQzOTAiLCJzY2FuX3N0YXJ0IjoiMDA4TVJBbm9ueW1vdXMifQ=="
}
Continuing calls with token set to next_token will eventually return all changes up until the present.