Internet Archive Census #20150304
Item Preview
Share or Embed This Item
This item contains scripts and output from a census done on 2015-03-04 to determine the size of all of the Internet Archive's non-derivative files from public items.
metamgr-norm-ids-20150304205357.txt.gz is an itemlist containing all public Archive.org items as of 2015-03-04T20:53:57. This list was used as input for the following command:
./ia-mine-0.5-py3.3.pex metamgr-norm-ids-20150304205357.txt --workers 600 2>/dev/null | pv -lacbrN 'mine' | ./parallel-chunks.sh jq -c -r -f get_file_size_md.jq | pv -lacbrN 'parse' | gzip > public-file-size-md_20150304205357.json.gz
public-file-size-md_20150304205357.json.gz is a compressed line delimited JSON file. Each record contains the "id", "collection" (either a string or an array of collections), "total_size" (Note: the total size of all non-derivative files found in the given item. Not the total size of the item), and certain "files" metadata for a single item. The "files" metadata includes the "size", "name", "md5", and "format". Note: the output has been filtered to not include items that contain access-restricted files.
An example record:
{ "files": [ { "size": 3227238, "format": "VBR MP3", "name": "Sabeeluna-al-jeehed.mp3", "md5": "3dba3ab6f2c077d2be399f51d0c96db2" }, { "size": 3737995, "format": "VBR MP3", "name": "Dilon_kay_Hukmaran.mp3", "md5": "2e2c6df7123315c9cd22d375180ef637" }, { "size": 3520052, "format": "VBR MP3", "name": "ShaheediJawaan.mp3", "md5": "dd460965fa4c49318ed7f6fd585cf973" }, { "size": 2748544, "format": "VBR MP3", "name": "MeriAmmi.mp3", "md5": "ee64d30e80694015a910015a9546d925" }, { "size": 1410576, "format": "VBR MP3", "name": "ek-sitara-tha-main.mp3", "md5": "f81d9aed3375f1249ee68fde5bf9be1d" }, { "size": 1874048, "format": "VBR MP3", "name": "ab-ehd-e-ghulami.mp3", "md5": "153473b7747c9fd7a1b68aa1aba2d478" }, { "size": 2435200, "format": "VBR MP3", "name": "al-madat-al-madat.mp3", "md5": "d6a9856207e118584740c04b18a18f71" }, { "size": 3859017, "format": "VBR MP3", "name": "BaharoonSePehlay.mp3", "md5": "b32cb821ff8ae32de67990a77c665984" }, { "size": 3883259, "format": "VBR MP3", "name": "Jeenay-Ka-haq.mp3", "md5": "dd29b48a92af20515bbe9885a5cc083a" }, { "size": 3651709, "format": "VBR MP3", "name": "INTIFAZA.mp3", "md5": "c019efe3099d6b1b8652b74354f5ad38" }, { "size": 781, "format": "Metadata", "name": "AansoonAurAhoon-MP3_meta.xml", "md5": "bc8438a5cd38a117d16e28c0d552ebb9" }, { "size": 8831, "format": "Archive BitTorrent", "name": "AansoonAurAhoon-MP3_archive.torrent", "md5": "4ac3a250ef9bf94dcd3cb80616bfc31c" }, { "size": 0, "format": "Metadata", "name": "AansoonAurAhoon-MP3_files.xml", "md5": "3b49eeba7c86a1c31734f4200fed8783" } ], "total_size": 30357250, "id": "Urdu-Trana-001", "collection": [ "iraq_middleeast", "iraq_war", "newsandpublicaffairs" ] }
all-ids-got-sorted.txt.gz is a compressed list of identifiers of all of the items that were successfully retrieved from metamgr-norm-ids-20150304205357.txt.gz. unretrievable-items.txt is a list of identifiers of all of the items that were not retrieved successfully.
The following command was then used to calculate the total size of all non-derivative files in the 14,926,080 items successfully retrieved:
jq '.total_size' < public-file-size-md_20150304205357.json.gz | pv -larcb | perl -nle '$sum += $_ } END { print $sum'
The total size is 14225047435566359 bytes, or 14.23 petabytes.
The ia-mine binary used to generate this command is available at archive.org/download/iamine-pex/ia-mine-0.5-py3.3.pex. Other ia-mine binaries can be found at archive.org/details/iamine-pex. The other source files are available at https://gist.github.com/jjjake/161b318d9d5114051cd6
- Addeddate
- 2015-03-06 06:18:05
- Identifier
- ia-bak-census_20150304
- Noindex
- true
- Scanner
- Internet Archive Python library 0.7.7
- Year
- 2015
comment
Reviews
939 Views
3 Favorites
DOWNLOAD OPTIONS
IN COLLECTIONS
Data CollectionUploaded by jakej on
SIMILAR ITEMS (based on metadata)
Topics: Internet, Census, Carna, Botnet, Port scan
Topic: 20150304 100000
Topic: NurayAydinogluElvanCantekinArgunYumGurhanErtur
Topic: PaylinYetvartTomasyan
Topics: tech, news
eye 65,814
Topic: IBHK