Internet Archive Census #20150304
Item Preview
Share or Embed This Item
This item contains scripts and output from a census done on 2015-03-04 to determine the size of all of the Internet Archive's non-derivative files from public items.
metamgr-norm-ids-20150304205357.txt.gz is an itemlist containing all public Archive.org items as of 2015-03-04T20:53:57. This list was used as input for the following command:
./ia-mine-0.5-py3.3.pex metamgr-norm-ids-20150304205357.txt --workers 600 2>/dev/null | pv -lacbrN 'mine' | ./parallel-chunks.sh jq -c -r -f get_file_size_md.jq | pv -lacbrN 'parse' | gzip > public-file-size-md_20150304205357.json.gz
public-file-size-md_20150304205357.json.gz is a compressed line delimited JSON file. Each record contains the "id", "collection" (either a string or an array of collections), "total_size" (Note: the total size of all non-derivative files found in the given item. Not the total size of the item), and certain "files" metadata for a single item. The "files" metadata includes the "size", "name", "md5", and "format". Note: the output has been filtered to not include items that contain access-restricted files.
An example record:
{
  "files": [
    {
      "size": 3227238,
      "format": "VBR MP3",
      "name": "Sabeeluna-al-jeehed.mp3",
      "md5": "3dba3ab6f2c077d2be399f51d0c96db2"
    },
    {
      "size": 3737995,
      "format": "VBR MP3",
      "name": "Dilon_kay_Hukmaran.mp3",
      "md5": "2e2c6df7123315c9cd22d375180ef637"
    },
    {
      "size": 3520052,
      "format": "VBR MP3",
      "name": "ShaheediJawaan.mp3",
      "md5": "dd460965fa4c49318ed7f6fd585cf973"
    },
    {
      "size": 2748544,
      "format": "VBR MP3",
      "name": "MeriAmmi.mp3",
      "md5": "ee64d30e80694015a910015a9546d925"
    },
    {
      "size": 1410576,
      "format": "VBR MP3",
      "name": "ek-sitara-tha-main.mp3",
      "md5": "f81d9aed3375f1249ee68fde5bf9be1d"
    },
    {
      "size": 1874048,
      "format": "VBR MP3",
      "name": "ab-ehd-e-ghulami.mp3",
      "md5": "153473b7747c9fd7a1b68aa1aba2d478"
    },
    {
      "size": 2435200,
      "format": "VBR MP3",
      "name": "al-madat-al-madat.mp3",
      "md5": "d6a9856207e118584740c04b18a18f71"
    },
    {
      "size": 3859017,
      "format": "VBR MP3",
      "name": "BaharoonSePehlay.mp3",
      "md5": "b32cb821ff8ae32de67990a77c665984"
    },
    {
      "size": 3883259,
      "format": "VBR MP3",
      "name": "Jeenay-Ka-haq.mp3",
      "md5": "dd29b48a92af20515bbe9885a5cc083a"
    },
    {
      "size": 3651709,
      "format": "VBR MP3",
      "name": "INTIFAZA.mp3",
      "md5": "c019efe3099d6b1b8652b74354f5ad38"
    },
    {
      "size": 781,
      "format": "Metadata",
      "name": "AansoonAurAhoon-MP3_meta.xml",
      "md5": "bc8438a5cd38a117d16e28c0d552ebb9"
    },
    {
      "size": 8831,
      "format": "Archive BitTorrent",
      "name": "AansoonAurAhoon-MP3_archive.torrent",
      "md5": "4ac3a250ef9bf94dcd3cb80616bfc31c"
    },
    {
      "size": 0,
      "format": "Metadata",
      "name": "AansoonAurAhoon-MP3_files.xml",
      "md5": "3b49eeba7c86a1c31734f4200fed8783"
    }
  ],
  "total_size": 30357250,
  "id": "Urdu-Trana-001",
  "collection": [
    "iraq_middleeast",
    "iraq_war",
    "newsandpublicaffairs"
  ]
}
all-ids-got-sorted.txt.gz is a compressed list of identifiers of all of the items that were successfully retrieved from metamgr-norm-ids-20150304205357.txt.gz. unretrievable-items.txt is a list of identifiers of all of the items that were not retrieved successfully.
The following command was then used to calculate the total size of all non-derivative files in the 14,926,080 items successfully retrieved:
jq '.total_size' < public-file-size-md_20150304205357.json.gz | pv -larcb | perl -nle '$sum += $_ } END { print $sum'
The total size is 14225047435566359 bytes, or 14.23 petabytes.
The ia-mine binary used to generate this command is available at archive.org/download/iamine-pex/ia-mine-0.5-py3.3.pex. Other ia-mine binaries can be found at archive.org/details/iamine-pex. The other source files are available at https://gist.github.com/jjjake/161b318d9d5114051cd6
- Addeddate
- 2015-03-06 06:18:05
- Identifier
- ia-bak-census_20150304
- Noindex
- true
- Scanner
- Internet Archive Python library 0.7.7
- Year
- 2015
 
 comment
 Reviews
 
 
 
 939 Views
3 Favorites
DOWNLOAD OPTIONS
IN COLLECTIONS
Data CollectionUploaded by jakej on
SIMILAR ITEMS (based on metadata)
Topics: Internet, Census, Carna, Botnet, Port scan
Topic: 20150304 100000
Topic: NurayAydinogluElvanCantekinArgunYumGurhanErtur
Topic: PaylinYetvartTomasyan
Topics: tech, news
 eye 65,814 
 
 
 Topic: IBHK