Skip to content

Instantly share code, notes, and snippets.

@sneakers-the-rat
Last activeFebruary 18, 2022 15:54
    Elsevier PDF "hashes"
    [
    "FCi27mtaKod38ztmGndn-y8NNz.r.lt6SndqGztz_ztr-ngqQm9aMo9eOnMeJntuNntu",
    "D2ei2mgqJz9b-m.mGmPqRyLNNnwmOlt7.ywiGmt-Kndr9otqRywv8o9ePmtiNmd2Sn92Tma",
    "6U7vcmPuOn9uLnMaGyM7-nLNNntv9lt6RmtaGmweOyMmJnMmSmgmOo9eOnM6LnMaRmM-Tma",
    "lXLf8owyQztiMzwqGnMz7zcNNotb7lwf.m9qGzt6Km.qMngqLndqLo9eOotaNm96Mmt6Tma",
    "FCi27y9qOnd-Ny96GmPmOmcNNzwf-lwj-m9mGztz7ytaMnM78n9v-o9ePmM6Rm9-Qn9eTma",
    "XlEDumMz7nM7-m9iGogmRmLNNyt_8lwiKz9eGm9-Pm.v7ztiLztz_o9eOnMeQnd-Sodm",
    "lXLf8yt-JywmNmPeGm9n9n8NNzgn.lt_8zwqGogz7zgn7zt6SyPr-o9eOnM6Pot2Mn9qTma",
    "FCi27zgf8mdqMmMeGnMmMy8NNz9eQlweNy.eGmMiMm96Qmgr9nMb-o9ePmtuRmt6JotmTma",
    "FCi27nwmKnMeSodeGm.z.y8NNntz.lt-PywmGy9__ngqQmtiPmtb7o9ePmteJotyJoduTma",
    "HIoniz.qOnd-Nmt-GmteNn8NNot7.lt-QndaGnPv.mdaMmt6RnMqMo9ePmdmOmdiKod-Tma",
    "ZtV1wntuPyPn9z.qGyPv7msNNytz7lwiKyM6GntmJnt_-nteRm.mRo9eOnM6Pot2MnMyTma",
    "d2UUdywiJmtz7zt-Gm9eQmcNNzt2Qlwf7m9uGzd_7zdf7owr9yMqOo9ePmtaKnM2NmduTma",
    "tprDsnMeJn9iOnweGnPuQnsNNz.eMlt-Qm.mGotz.ytiNz.yRmd-Mo9eOnM6Pot2OmM6Tma",
    "tprDsyPiNn9iQn9-GmMiSy8NNn96Llwf9owiGowqQyMiRzwv_ngqPo9eOnM6Pot2OndyTma",
    "ZIFNOztmRotn9owiGzduNmsNNnd-Rlt_8otiGot-Oy92QnMeSyMqKo9eOnM6Pot2OntaTma",
    "D2ei2nMb_zwmSowyGzwv8mLNNotj8lt-My9yGmtaModaNm92RytySo9ePmtaKn92Qmt2Tma",
    "d2UUdot__owr-y9mGodqLocNNn.eOlwmPmtaGmgj7ndn_nMiMndiNo9ePmdiLnMmPotmTmq",
    "6U7vcmtuSndmSntqGmdiMy8NNnPz7lt_7ndeGmtv7n9eLndj_zduJo9ePmtiOntmNntmTma",
    "ZtV1wn9mMnd2MzwiGz9eRysNNmgySlt7_ot-Gy97.mgiKotqKnt_.o9eOnM6Pot2Mn96Tma",
    "XlEDuyweNmtz9ntqGm9aMocNNodr9lt__z9iGmdj_n9yNnt6Sm9-Lo9ePmd6KotmRnM2Tma",
    "HIonintn-z9uPogmGnMeSzsNNogf-lwj.z.qGmgqSn9yPndf7mdmLo9eOotuLm9aNodqTma",
    "ZlkjsyMj7mPr.ndiGowuMmcNNy.mNlwj9m.yGmtb7z.qRz.iKyt38o9eOnM6Pot2MnMeTma",
    "Dpairmdj9mPr8nwmGn.r7z8NNnMb7lwj8otiGyt-MzwuKzd__nt39o9ePmtaPotaJm9-Tma",
    "6mIUqngiNzduNn9iGmgeJnsNNot2Rlt-SzguGzt2Oodf_n.eNodz.o9eOn9mQnMqOm9e",
    "FCi27mwr_mPn-m.mGmPuKncNNmduOlweOytuGogj.yMv-z92Pyt6Mo9eOnM6Pot2Mn9yTma",
    "6U7vcngj-zt2Ln.uGodr8mcNNmdeSlweKmd2Gzdz9nM3_mgf7yt2Ro9ePmt6Sn9qLntyTma",
    "zjJBNmPn.mdiRntiGzgmPnLNNmM2Klt6JmMqGy9aNz9aMmdv_mwuNo9ePm96Qm9iRndiTma",
    "FCi27mPmRnPiKngeGngqJzcNNogj8lwj-zwiGnPiLmtb7y9qKzgeMo9eOnMeLn9aNm9m"
    ]
    import exiftool
    from pathlib import Path
    import json
    import pdb
    import re
    paper_root = Path().home() / 'location/of/papers'
    hashes = []
    get_n = 100
    processed = 0
    rehash = re.compile(r'<([0-9A-Za-z_.-]{40,})/>')
    try:
    with exiftool.ExifTool() as et:
    for path in paper_root.glob('**/*.pdf'):
    md = et.execute(b'-b', b'-xmp', str(path).encode('utf-8'))
    try:
    md = md.decode('utf-8')
    except UnicodeDecodeError:
    print(f'Couldnt decode {path}')
    continue
    ahash = rehash.findall(md)
    hashes.extend(ahash)
    if len(ahash)>0:
    processed += 1
    finally:
    with open('elsev_hashes.json', 'w') as hashfile:
    json.dump(hashes, hashfile, indent=2)
    print(f'processed {processed} files')
    @sneakers-the-rat
    Copy link
    Author

    Updated after
    https://twitter.com/horsemankukka/status/1486268962119761924?s=20

    let me know that the tags were being parsed incorrectly. Rescanned and found a few more. Also attaching the v simple code so you can check my work.

    @cbandy
    Copy link

    cbandy commented Jan 27, 2022

    The few I downloaded from open access were visible to grep; usually toward the end of the file in an XML stream:

    grep -Ena '<[^/]{50,}/>' *.pdf

    A variation on https://twitter.com/Jofkos/status/1486244612960366593.

    @Aariq
    Copy link

    Some more examples here with associated DOIs: https://gist.github.com/Aariq/a23958e168e347f1bacf9dfa777b911f

    @rgrunbla
    Copy link

    rgrunbla commented Jan 30, 2022

    I managed to get hashes that are very close on the same paper ( https://doi.org/10.1016/j.ijhydene.2021.11.149 ) :

    lXLf8 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMoti Tma
    FCi27 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMot2 Tma
    LMfns mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmdq Tma
    w8arl mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmd- Tma
    

    I put some spaces in the hashes, because I think there are some patterns at such positions.

    Later obtained hashes seem very different, still.

    Here are some informations regarding the files, in the same order than the hashes :

      File: 1-s2.0-S0360319921045377-main.pdf
      Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
    Device: 0,37	Inode: 1067528     Links: 1
    Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
    Access: 2022-01-29 14:19:53.072211357 +0100
    Modify: 2022-01-29 14:19:53.185217711 +0100
    Change: 2022-01-29 14:19:53.325225583 +0100
     Birth: 2022-01-29 14:19:53.072211357 +0100
      File: 1-s2.0-S0360319921045377-main(1).pdf
      Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
    Device: 0,37	Inode: 1067359     Links: 1
    Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
    Access: 2022-01-29 14:19:57.310442520 +0100
    Modify: 2022-01-29 14:19:57.493452096 +0100
    Change: 2022-01-29 14:19:57.539454503 +0100
     Birth: 2022-01-29 14:19:57.310442520 +0100
      File: 1-s2.0-S0360319921045377-main(2).pdf
      Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
    Device: 0,37	Inode: 1067360     Links: 1
    Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
    Access: 2022-01-29 14:20:04.484795768 +0100
    Modify: 2022-01-29 14:20:04.608801481 +0100
    Change: 2022-01-29 14:20:04.663804016 +0100
     Birth: 2022-01-29 14:20:04.484795768 +0100
      File: 1-s2.0-S0360319921045377-main(3).pdf
      Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
    Device: 0,37	Inode: 1067005     Links: 1
    Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
    Access: 2022-01-29 14:20:09.293007869 +0100
    Modify: 2022-01-29 14:20:09.448014381 +0100
    Change: 2022-01-29 14:20:09.492016229 +0100
     Birth: 2022-01-29 14:20:09.293007869 +0100
    

    @sneakers-the-rat
    Copy link
    Author

    WOW that looks like they might just be timestamps, that is LAZY on their part. I'll try and systematically sample across time and see if i can get repeating patterns/match subsections with times. I think you're right, those do seem to be independent and repeatable sections.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment