Internet Archive Census #20150304
Item Preview
Share or Embed This Item
This item contains scripts and output from a census done on 2015-03-04 to determine the size of all of the Internet Archive's non-derivative files from public items.
metamgr-norm-ids-20150304205357.txt.gz is an itemlist containing all public Archive.org items as of 2015-03-04T20:53:57. This list was used as input for the following command:
./ia-mine-0.5-py3.3.pex metamgr-norm-ids-20150304205357.txt --workers 600 2>/dev/null | pv -lacbrN 'mine' | ./parallel-chunks.sh jq -c -r -f get_file_size_md.jq | pv -lacbrN 'parse' | gzip > public-file-size-md_20150304205357.json.gz
public-file-size-md_20150304205357.json.gz is a compressed line delimited JSON file. Each record contains the "id", "collection" (either a string or an array of collections), "total_size" (Note: the total size of all non-derivative files found in the given item. Not the total size of the item), and certain "files" metadata for a single item. The "files" metadata includes the "size", "name", "md5", and "format". Note: the output has been filtered to not include items that contain access-restricted files.
An example record:
{
"files": [
{
"size": 3227238,
"format": "VBR MP3",
"name": "Sabeeluna-al-jeehed.mp3",
"md5": "3dba3ab6f2c077d2be399f51d0c96db2"
},
{
"size": 3737995,
"format": "VBR MP3",
"name": "Dilon_kay_Hukmaran.mp3",
"md5": "2e2c6df7123315c9cd22d375180ef637"
},
{
"size": 3520052,
"format": "VBR MP3",
"name": "ShaheediJawaan.mp3",
"md5": "dd460965fa4c49318ed7f6fd585cf973"
},
{
"size": 2748544,
"format": "VBR MP3",
"name": "MeriAmmi.mp3",
"md5": "ee64d30e80694015a910015a9546d925"
},
{
"size": 1410576,
"format": "VBR MP3",
"name": "ek-sitara-tha-main.mp3",
"md5": "f81d9aed3375f1249ee68fde5bf9be1d"
},
{
"size": 1874048,
"format": "VBR MP3",
"name": "ab-ehd-e-ghulami.mp3",
"md5": "153473b7747c9fd7a1b68aa1aba2d478"
},
{
"size": 2435200,
"format": "VBR MP3",
"name": "al-madat-al-madat.mp3",
"md5": "d6a9856207e118584740c04b18a18f71"
},
{
"size": 3859017,
"format": "VBR MP3",
"name": "BaharoonSePehlay.mp3",
"md5": "b32cb821ff8ae32de67990a77c665984"
},
{
"size": 3883259,
"format": "VBR MP3",
"name": "Jeenay-Ka-haq.mp3",
"md5": "dd29b48a92af20515bbe9885a5cc083a"
},
{
"size": 3651709,
"format": "VBR MP3",
"name": "INTIFAZA.mp3",
"md5": "c019efe3099d6b1b8652b74354f5ad38"
},
{
"size": 781,
"format": "Metadata",
"name": "AansoonAurAhoon-MP3_meta.xml",
"md5": "bc8438a5cd38a117d16e28c0d552ebb9"
},
{
"size": 8831,
"format": "Archive BitTorrent",
"name": "AansoonAurAhoon-MP3_archive.torrent",
"md5": "4ac3a250ef9bf94dcd3cb80616bfc31c"
},
{
"size": 0,
"format": "Metadata",
"name": "AansoonAurAhoon-MP3_files.xml",
"md5": "3b49eeba7c86a1c31734f4200fed8783"
}
],
"total_size": 30357250,
"id": "Urdu-Trana-001",
"collection": [
"iraq_middleeast",
"iraq_war",
"newsandpublicaffairs"
]
}
all-ids-got-sorted.txt.gz is a compressed list of identifiers of all of the items that were successfully retrieved from metamgr-norm-ids-20150304205357.txt.gz. unretrievable-items.txt is a list of identifiers of all of the items that were not retrieved successfully.
The following command was then used to calculate the total size of all non-derivative files in the 14,926,080 items successfully retrieved:
jq '.total_size' < public-file-size-md_20150304205357.json.gz | pv -larcb | perl -nle '$sum += $_ } END { print $sum'
The total size is 14225047435566359 bytes, or 14.23 petabytes.
The ia-mine binary used to generate this command is available at archive.org/download/iamine-pex/ia-mine-0.5-py3.3.pex. Other ia-mine binaries can be found at archive.org/details/iamine-pex. The other source files are available at https://gist.github.com/jjjake/161b318d9d5114051cd6
- Addeddate
- 2015-03-06 06:18:05
- Identifier
- ia-bak-census_20150304
- Noindex
- true
- Scanner
- Internet Archive Python library 0.7.7
- Year
- 2015
comment
Reviews
939 Views
3 Favorites
DOWNLOAD OPTIONS
IN COLLECTIONS
Data CollectionUploaded by jakej on
SIMILAR ITEMS (based on metadata)
Topics: Internet, Census, Carna, Botnet, Port scan
Topic: 20150304 100000
Topic: NurayAydinogluElvanCantekinArgunYumGurhanErtur
Topic: PaylinYetvartTomasyan
Topics: tech, news
eye 65,814
Topic: IBHK