[ | |
"FCi27mtaKod38ztmGndn-y8NNz.r.lt6SndqGztz_ztr-ngqQm9aMo9eOnMeJntuNntu", | |
"D2ei2mgqJz9b-m.mGmPqRyLNNnwmOlt7.ywiGmt-Kndr9otqRywv8o9ePmtiNmd2Sn92Tma", | |
"6U7vcmPuOn9uLnMaGyM7-nLNNntv9lt6RmtaGmweOyMmJnMmSmgmOo9eOnM6LnMaRmM-Tma", | |
"lXLf8owyQztiMzwqGnMz7zcNNotb7lwf.m9qGzt6Km.qMngqLndqLo9eOotaNm96Mmt6Tma", | |
"FCi27y9qOnd-Ny96GmPmOmcNNzwf-lwj-m9mGztz7ytaMnM78n9v-o9ePmM6Rm9-Qn9eTma", | |
"XlEDumMz7nM7-m9iGogmRmLNNyt_8lwiKz9eGm9-Pm.v7ztiLztz_o9eOnMeQnd-Sodm", | |
"lXLf8yt-JywmNmPeGm9n9n8NNzgn.lt_8zwqGogz7zgn7zt6SyPr-o9eOnM6Pot2Mn9qTma", | |
"FCi27zgf8mdqMmMeGnMmMy8NNz9eQlweNy.eGmMiMm96Qmgr9nMb-o9ePmtuRmt6JotmTma", | |
"FCi27nwmKnMeSodeGm.z.y8NNntz.lt-PywmGy9__ngqQmtiPmtb7o9ePmteJotyJoduTma", | |
"HIoniz.qOnd-Nmt-GmteNn8NNot7.lt-QndaGnPv.mdaMmt6RnMqMo9ePmdmOmdiKod-Tma", | |
"ZtV1wntuPyPn9z.qGyPv7msNNytz7lwiKyM6GntmJnt_-nteRm.mRo9eOnM6Pot2MnMyTma", | |
"d2UUdywiJmtz7zt-Gm9eQmcNNzt2Qlwf7m9uGzd_7zdf7owr9yMqOo9ePmtaKnM2NmduTma", | |
"tprDsnMeJn9iOnweGnPuQnsNNz.eMlt-Qm.mGotz.ytiNz.yRmd-Mo9eOnM6Pot2OmM6Tma", | |
"tprDsyPiNn9iQn9-GmMiSy8NNn96Llwf9owiGowqQyMiRzwv_ngqPo9eOnM6Pot2OndyTma", | |
"ZIFNOztmRotn9owiGzduNmsNNnd-Rlt_8otiGot-Oy92QnMeSyMqKo9eOnM6Pot2OntaTma", | |
"D2ei2nMb_zwmSowyGzwv8mLNNotj8lt-My9yGmtaModaNm92RytySo9ePmtaKn92Qmt2Tma", | |
"d2UUdot__owr-y9mGodqLocNNn.eOlwmPmtaGmgj7ndn_nMiMndiNo9ePmdiLnMmPotmTmq", | |
"6U7vcmtuSndmSntqGmdiMy8NNnPz7lt_7ndeGmtv7n9eLndj_zduJo9ePmtiOntmNntmTma", | |
"ZtV1wn9mMnd2MzwiGz9eRysNNmgySlt7_ot-Gy97.mgiKotqKnt_.o9eOnM6Pot2Mn96Tma", | |
"XlEDuyweNmtz9ntqGm9aMocNNodr9lt__z9iGmdj_n9yNnt6Sm9-Lo9ePmd6KotmRnM2Tma", | |
"HIonintn-z9uPogmGnMeSzsNNogf-lwj.z.qGmgqSn9yPndf7mdmLo9eOotuLm9aNodqTma", | |
"ZlkjsyMj7mPr.ndiGowuMmcNNy.mNlwj9m.yGmtb7z.qRz.iKyt38o9eOnM6Pot2MnMeTma", | |
"Dpairmdj9mPr8nwmGn.r7z8NNnMb7lwj8otiGyt-MzwuKzd__nt39o9ePmtaPotaJm9-Tma", | |
"6mIUqngiNzduNn9iGmgeJnsNNot2Rlt-SzguGzt2Oodf_n.eNodz.o9eOn9mQnMqOm9e", | |
"FCi27mwr_mPn-m.mGmPuKncNNmduOlweOytuGogj.yMv-z92Pyt6Mo9eOnM6Pot2Mn9yTma", | |
"6U7vcngj-zt2Ln.uGodr8mcNNmdeSlweKmd2Gzdz9nM3_mgf7yt2Ro9ePmt6Sn9qLntyTma", | |
"zjJBNmPn.mdiRntiGzgmPnLNNmM2Klt6JmMqGy9aNz9aMmdv_mwuNo9ePm96Qm9iRndiTma", | |
"FCi27mPmRnPiKngeGngqJzcNNogj8lwj-zwiGnPiLmtb7y9qKzgeMo9eOnMeLn9aNm9m" | |
] |
import exiftool | |
from pathlib import Path | |
import json | |
import pdb | |
import re | |
paper_root = Path().home() / 'location/of/papers' | |
hashes = [] | |
get_n = 100 | |
processed = 0 | |
rehash = re.compile(r'<([0-9A-Za-z_.-]{40,})/>') | |
try: | |
with exiftool.ExifTool() as et: | |
for path in paper_root.glob('**/*.pdf'): | |
md = et.execute(b'-b', b'-xmp', str(path).encode('utf-8')) | |
try: | |
md = md.decode('utf-8') | |
except UnicodeDecodeError: | |
print(f'Couldnt decode {path}') | |
continue | |
ahash = rehash.findall(md) | |
hashes.extend(ahash) | |
if len(ahash)>0: | |
processed += 1 | |
finally: | |
with open('elsev_hashes.json', 'w') as hashfile: | |
json.dump(hashes, hashfile, indent=2) | |
print(f'processed {processed} files') |
The few I downloaded from open access were visible to grep
; usually toward the end of the file in an XML stream:
grep -Ena '<[^/]{50,}/>' *.pdf
A variation on https://twitter.com/Jofkos/status/1486244612960366593.
Aariq commented Jan 28, 2022
Some more examples here with associated DOIs: https://gist.github.com/Aariq/a23958e168e347f1bacf9dfa777b911f
I managed to get hashes that are very close on the same paper ( https://doi.org/10.1016/j.ijhydene.2021.11.149 ) :
lXLf8 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMoti Tma
FCi27 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMot2 Tma
LMfns mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmdq Tma
w8arl mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmd- Tma
I put some spaces in the hashes, because I think there are some patterns at such positions.
Later obtained hashes seem very different, still.
Here are some informations regarding the files, in the same order than the hashes :
File: 1-s2.0-S0360319921045377-main.pdf
Size: 5225391 Blocks: 7833 IO Block: 131072 regular file
Device: 0,37 Inode: 1067528 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ remy) Gid: ( 100/ users)
Access: 2022-01-29 14:19:53.072211357 +0100
Modify: 2022-01-29 14:19:53.185217711 +0100
Change: 2022-01-29 14:19:53.325225583 +0100
Birth: 2022-01-29 14:19:53.072211357 +0100
File: 1-s2.0-S0360319921045377-main(1).pdf
Size: 5225391 Blocks: 7833 IO Block: 131072 regular file
Device: 0,37 Inode: 1067359 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ remy) Gid: ( 100/ users)
Access: 2022-01-29 14:19:57.310442520 +0100
Modify: 2022-01-29 14:19:57.493452096 +0100
Change: 2022-01-29 14:19:57.539454503 +0100
Birth: 2022-01-29 14:19:57.310442520 +0100
File: 1-s2.0-S0360319921045377-main(2).pdf
Size: 5225391 Blocks: 7833 IO Block: 131072 regular file
Device: 0,37 Inode: 1067360 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ remy) Gid: ( 100/ users)
Access: 2022-01-29 14:20:04.484795768 +0100
Modify: 2022-01-29 14:20:04.608801481 +0100
Change: 2022-01-29 14:20:04.663804016 +0100
Birth: 2022-01-29 14:20:04.484795768 +0100
File: 1-s2.0-S0360319921045377-main(3).pdf
Size: 5225391 Blocks: 7833 IO Block: 131072 regular file
Device: 0,37 Inode: 1067005 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ remy) Gid: ( 100/ users)
Access: 2022-01-29 14:20:09.293007869 +0100
Modify: 2022-01-29 14:20:09.448014381 +0100
Change: 2022-01-29 14:20:09.492016229 +0100
Birth: 2022-01-29 14:20:09.293007869 +0100
WOW that looks like they might just be timestamps, that is LAZY on their part. I'll try and systematically sample across time and see if i can get repeating patterns/match subsections with times. I think you're right, those do seem to be independent and repeatable sections.
Updated after
https://twitter.com/horsemankukka/status/1486268962119761924?s=20
let me know that the tags were being parsed incorrectly. Rescanned and found a few more. Also attaching the v simple code so you can check my work.