Social.coop @SocialCoop

Recent searches

Search options

Only available when logged in.

More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.

[A list of metadata for a PDF, the important field being two "Unknown:<long random character string>" fields that are color coded to indicate that they have been changed between versions.

Jan 26, 2022, 08:36 AM··Mastodon Twitter Crossposter

917boosts·704favorites

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

You can see for yourself using exiftool.
To remove all of the top-level metadata, you can use exiftool and qpdf:

exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>

To remove *all* metadata, you can use dangerzone or mat2

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

Also present in the metadata are NISO tags for document status indicating the "final published version" (VoR), and limits on what domains it should be present on. Elsevier scans for PDFs with this metadata, so good idea to strip it any time you're sharing a copy.

Internet takedown programs

Elsevier partners with a technology vendor to continuously search the Internet for unauthorized posting of our book and journal content. In accordance with the Digital Millennium Copyright Act (DMCA), we issue “takedown” notices to the operators of websites hosting such unauthorized content. To complement this automated searching, Elsevier maintains online tools for staff to report an infringed work. Our partner then works to expedite reporting, investigation, and removal of specific infringing content. If you discover, or learn about pirated content online, don’t hesitate to let your contact at Elsevier know about it; he or she can use our internal systems to make sure the problem is quickly addressed.

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

Links:
exiftool: https://www.exiftool.org/
qpdf: https://qpdf.sourceforge.io/
dangerzone (GUI, render PDF as images, then re-OCR everything): https://dangerzone.rocks/
mat2 (render PDF as images, don't OCR): https://0xacab.org/jvoisin/mat2

www.exiftool.orgExifTool by Phil HarveyA command-line application and Perl library for reading and writing EXIF, GPS, IPTC, XMP, makernotes and other meta information in image, audio and video files. For Windows, MacOS, and Unix systems.

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
https://gist.github.com/sneakers-the-rat/172e8679b824a3871decd262ed3f59c6

$[Screenshot of code at URL in tweet, the script first uses "find" to locate the files, and passes them to a while loop. It creates a clean PDF at a temporary file, removing it if one exists already. Code follows] # Color Codes so that warnings/errors stick out GREEN="\e[32m" RED="\e[31m" CLEAR="\e[0m" # loop through all PDFs in first argument ($1), # or use '.' (this directory) if not given DIR="${1:-.}" echo "Cleaning PDFs in directory $DIR" # use find to locate files, pip to while read to get the # whole line instead of space delimited # Note -- this will find pdfs recursively!! find $DIR -type f -name "*.pdf" | while read -r i do # output file as original filename with suffix _clean.pdf TMP=${i%.*}_clean.pdf # remove the temporary file if it already exists if [ -f "$TMP" ]; then rm "$TMP"; fi exiftool -q -q -all:all= "$i" -o "$TMP" qpdf --linearize --replace-input "$TMP" echo -e $(printf "${GREEN}Processed ${RED}${i} ${CLEAR}as ${GREEN}${TMP}${CLEAR}"$

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
https://twitter.com/json_dirs/status/1486135162505072641?t=Wg5XAzujycz79Cop_ap8vQ&s=19

Twitter𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜 on TwitterBy 𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
https://gist.github.com/sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f

gist.github.comElsevier PDF "hashes"Elsevier PDF "hashes". GitHub Gist: instantly share code, notes, and snippets.

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

https://twitter.com/kmagnacca/status/1486209676979032064?t=GT8fV5QG-4SGTkLadYpCNQ&s=19

TwitterKarl Magnacca on TwitterBy Karl Magnacca

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

https://twitter.com/SchmiegSophie/status/1486206774159970305?t=GT8fV5QG-4SGTkLadYpCNQ&s=19

TwitterSophie, indistinguishable from random noise on TwitterBy Sophie, indistinguishable from random noise

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.
https://twitter.com/horsemankukka/status/1486268962119761924?s=20

TwitterKukka de Bierguirb Häst on TwitterBy Kukka de Bierguirb Häst

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout...

[top comment on HN thread]

So just take pics of the pages and convert the pics back to a PDF

[first sub-comment]

A motivated publisher could embed codes by altering in subtle ways the differences in distances or color between adjacent characters, so that they would survive most color or grey scale conversions; a seemingly innocuous frame drawn around a photo could be either larger or smaller by say one millimeter, representing de facto a bit, therefore using enough pages they could identify a book among billions. Unfortunately there's no way to be 100% sure that a complex document doesn't contain some form of embedded code.

[second sub-comment]

Easier to just strip out the metadata

I don't really know what I'm looking at so I can't really describe it. There's a top part that says "Suspicious elements: /OpenAction" and then when I list its properties there is an access to the metadata, some changes to a crop box, etc.

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.
https://twitter.com/json_dirs/status/1486289288115359747?t=QwmBvbOgh2fCkjSOZSh3Fw&s=19

Twitter𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜 on TwitterBy 𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

you go to school to study "the brain" and then the next thing you know you're learning how to debug surveillance in PDF rendering to understand how publishers have so contorted the practice of science for profit. how can there be "normal science" when this is normal?

**jonny** @jonny · Jan 27, 2022

Jan 27, 2022

jonny @jonny

follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work)

**jonny** @jonny · Jan 27, 2022

Jan 27, 2022

jonny @jonny

which is surprising to me, so I'm a little hesitant to make that as a general claim

**серафими многоꙮчитїи** @derwinmcgeary@octodon.social · Jan 26, 2022

Jan 26, 2022

серафими многоꙮчитїи @derwinmcgeary@octodon.social

@jonny they look kind of meaningful. Not base64. Any ideas what could be in there?

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

@derwinmcgeary
yeah, I thought so too but don't know where to start reverse engineering it :/

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

@derwinmcgeary
it decodes with base85, but it's not Unicode. not sure if that's meaningful

**Old Tom** @beckett · Jan 26, 2022

Jan 26, 2022

Old Tom @beckett

@jonny I do not have any IT skills, but if I did I’d love to write a script to remove metadata from PDFs. Adobe has them wrapped up pretty well.

**Seachaint** @seachaint@hackers.town · Jan 26, 2022

Jan 26, 2022

Seachaint @seachaint@hackers.town

@beckett @jonny "PDFparanoia" was a project for exactly this - to strip identifying watermarks and metadata from shared academic PDFs. But it fell victim to the Python 2 to 3 transition and the mess of the PDF libraries in particular, and then fell to bitrot. Would be nice to see it brought back to health.

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

@seachaint
@beckett
yes, lives on in mat2 and I think pdfparanoia specifically redirects to dangerzone

**Seachaint** @seachaint@hackers.town · Jan 26, 2022

Jan 26, 2022

Seachaint @seachaint@hackers.town

@jonny @beckett Oh, I missed that, thanks. :)

**prplecake** @matthew@jrgnsn.social · Jan 26, 2022

Jan 26, 2022

prplecake @matthew@jrgnsn.social

@jonny how easy is that to strip?

**jonny** @jonny · Jan 26, 2022

Jan 26, 2022

jonny @jonny

@matthew
pretty straightforward to get at least the top level metadata
https://social.coop/@jonny/107686442819944047

**Emma H** @emma@magicalgirl.party · Jan 27, 2022

Jan 27, 2022

Emma H @emma@magicalgirl.party

@jonny

The cognitive load of constantly dealing with greedy corporate rentiers is exhausting.

**Stanley Black-Decker** @pleaseclap@urbanists.social · Jun 10, 2024

Jun 10, 2024

Stanley Black-Decker @pleaseclap@urbanists.social

@jonny

It's a couple things:

a) Elsevier's vendor's tool only has to be good enough to impress Elsevier

b) Deterrence being more efficient than prevention

**Dr. Eric J. Fielding, PhD** @EricFielding@mastodon.social · Jun 11, 2024

Jun 11, 2024

Dr. Eric J. Fielding, PhD @EricFielding@mastodon.social

@jonny Some other publishers render a unique PDF that has the download date, user, and institution. It is not in the metadata but a visible watermark on the edge of the pages.

**Leonardo Ferreira Fontenelle** @lffontenelle@mastodon.social · Jun 12, 2024

Jun 12, 2024

Leonardo Ferreira Fontenelle @lffontenelle@mastodon.social

@EricFielding @jonny I have that in a stats book I bought from Springer

**shusha** @shusha@post.lurk.org · Jan 27, 2022

Jan 27, 2022

shusha @shusha@post.lurk.org

@jonny for the normativity of science see the discourse of STS (science and technology studien), great field!

**jonny** @jonny · Jan 27, 2022

Jan 27, 2022

jonny @jonny

@shusha
yes definitely, love it and spend basically all my time reading it nowadays

**robryk** @robryk@qoto.org · Jan 27, 2022

Jan 27, 2022

robryk @robryk@qoto.org

@jonny I wonder whether uploading every paper to sci-hub twice would be feasible (i.e. would we still have enough people do that). (If we did so, then it would allow sci-hub to verify with reasonable certainty that whatever watermark-removal method they would use still works.)

**jonny** @jonny · Jan 27, 2022

Jan 27, 2022

jonny @jonny

@robryk
I think it may be easier to scrub it server side, like to have admins clean the PDFs they have. I don't know of any crowdsourced sci-hub-like projects. scrubbing metadata does seem to render the PDFs identical

**robryk** @robryk@qoto.org · Jan 27, 2022

Jan 27, 2022

robryk @robryk@qoto.org

@jonny And then obviously the watermarking techniques will adapt. Asking for two copies is a way to ensure that whatever we are doing still manages to scrub the watermark (they should be identical after scrubbing).

**jonny** @jonny · Jan 27, 2022

Jan 27, 2022

jonny @jonny

@robryk
yes, definitely. all of the above. fix what you have now, adapt to changes, making double grabs part of the protocol makes sense :)

**Advanced Persistent Teapot** @http_error_418@hachyderm.io · Jun 10, 2024

Jun 10, 2024

Advanced Persistent Teapot @http_error_418@hachyderm.io

@jonny hmmm, alternatively start inserting copies of that metadata into blank or template PDFs. Send 'em chasing wild geese and make them look at Lorem Ipsum 50 times a day

**Thai Thien** @ThienThaiMeow@masto.ai · Jun 10, 2024

Jun 10, 2024

Thai Thien @ThienThaiMeow@masto.ai

@http_error_418 @jonny Very well.
Do you have code to insert those code ? I would like to help.
Btw, you can use mathgen https://thatsmathematics.com/mathgen/ to make meaningless paper to upload on scribd

thatsmathematics.comMathgen: Randomly generated math papers

**Skyflare** @skyflare@fosstodon.org · Jun 11, 2024

Jun 11, 2024

Skyflare @skyflare@fosstodon.org

@jonny Aren't these the same guys who charges you (or your institution) a sh*tload of money to publish these works in the first place?

**‍fuchsiaaaaaaaaaaaaaaaaa** @f0x@pixie.town · Jan 26, 2022

Jan 26, 2022

‍fuchsiaaaaaaaaaaaaaaaaa @f0x@pixie.town

@jonny word of caution is that while removing exif is good, knowing publishers there's a bunch of other ways they'd directly include such trackers into the file, in a less human/machine readable spot than EXIF. so be careful

**marius851000** @marius851000@framapiaf.org · Jan 26, 2022

Jan 26, 2022

marius851000 @marius851000@framapiaf.org

@f0x
I suspect it should be a good idea to compare two PDF from two different source. If the hash match, it's all good. If the it doesn't, strip the EXIF. If it still doesn't match... find the difference somehow.
@jonny

**yetzt** @yetzt@vis.social · Jun 10, 2024

Jun 10, 2024

yetzt @yetzt@vis.social

@jonny

meme, bad idea: "removing hashes", good idea: "replacing them with longer hashes that cause buffer overflows in their scrapers"

**Wojtek Sychut** @wojtek@fedi.sysartist.com · Jun 11, 2024

Jun 11, 2024

Wojtek Sychut @wojtek@fedi.sysartist.com

@jonny (1/2) Why remove metadata if you could overwrite crucial parts with some similarly looking crap?

We use computationally expensive hashes and ciphers not to punish users for mistyped passwords with delaying next attempt, but because it transfers the brute forcing costs to attackers.

Don't make it immediately obvious to bad actors that their methods have been figured out

**Wojtek Sychut** @wojtek@fedi.sysartist.com · Jun 11, 2024 *

Jun 11, 2024 *

Wojtek Sychut @wojtek@fedi.sysartist.com

@jonny (2/2) Plant mindfucks - garbage hashes meeting original format, randomized dates (reasonable timeframe - not earlier that potential earliest records, not earlier timestamp than given document version etc), and any kind of poisoning that will make the (meta)data unobviously worthless.
Use ethical methods to fuck with unlawful actors. You won't beat the bigger guy using the same weapon.

**jonny** @jonny · Jun 11, 2024

Jun 11, 2024

jonny @jonny

@wojtek
Be my guest!

**Wojtek Sychut** @wojtek@fedi.sysartist.com · Jun 11, 2024

Jun 11, 2024

Wojtek Sychut @wojtek@fedi.sysartist.com

@jonny Well, this is the first time I read the name "Elsevier" and I can't ask you to throw any copyright-protected files my way. Treat it as a hint to improve your script :-)

**Zudlig Ravel Annon** @zudlig@expired.mentality.rip · Jan 26, 2022

Jan 26, 2022

Zudlig Ravel Annon @zudlig@expired.mentality.rip

@jonny I often worry momentarily about that sort of thing happening whenever I download an image or a pdf from the net, even if I'm not planning to share it. I thought I was being too paranoid.

**Stewart Russell** @scruss@mastodon.social · Jan 26, 2022

Jan 26, 2022

Stewart Russell @scruss@mastodon.social

@jonny they're almost getting to the level of ISO standards for metadata f'wittery.

For a while, many ISO standards that you bought (for $$$$) looked like a bad photocopy. If you zoomed in really close to the marks on the page, they were made up of a pattern of punctuation characters. Totally screwed up any screen reading, though

**Nice Micro** @nicemicro@distrotoot.com · Jan 26, 2022

Jan 26, 2022

Nice Micro @nicemicro@distrotoot.com

@jonny Adding unique identifiers on stuff you distribute to be able to trace where it gets copied is hardly a new thing, and I don't think it is good terminology to call it "surveillance". As the hash is a passive part of the document, it is not used (and possibly can't be used) to spy on you.
I don't think it is productive to call this practice "surveillance", as it just make it more difficult for the readers to differentiate between levels of threats to their privacy.

**Orca | | |** @Orca@nya.one · Jan 26, 2022

Jan 26, 2022

Orca | | | @Orca@nya.one

@jonny@social.coop seems like some countermeasures against scihub, libgen and other shadow libraries that provide those PDFs for free

**KawaiiPunk** @kawaiipunk@sunbeam.city · Jan 26, 2022

Jan 26, 2022

KawaiiPunk @kawaiipunk@sunbeam.city

@jonny this is the same technique that was being used in the OS designed in North Korea called Red Star OS. It was in the Chaos Congress talk about it.

**jonny** @jonny · Jan 27, 2022

Jan 27, 2022

jonny @jonny

@kawaiipunk
interesting ... will take a look.

**neo** @neo@pl.comfysnug.space · Jan 31, 2022

Jan 31, 2022

neo @neo@pl.comfysnug.space

@jonny is there any easy way to modify this?

**neo** @neo@pl.comfysnug.space · Jan 31, 2022

Jan 31, 2022

neo @neo@pl.comfysnug.space

@jonny ignore me im retarded and only saw one post.

**Gord** @grs@infosec.exchange · Jun 10, 2024

Jun 10, 2024

Gord @grs@infosec.exchange

@jonny they really are a right bunch of bastards, aren’t they?

**Lyle Solla-Yates** @Lyle@cville.online · Jun 10, 2024

Jun 10, 2024

Lyle Solla-Yates @Lyle@cville.online

@jonny this makes me think some horrible things are going to happen to students because of this but I can’t quickly think of an example

**person** @cmkobel@genomic.social · Jun 10, 2024

Jun 10, 2024

person @cmkobel@genomic.social

@jonny But what meaningful data can they attach to that unique ID? The IP adress? Assume a user is not logged in, has cleared tracking cookies and is using a VPN.
Wait a sec. That is why we need open access.

**jonny** @jonny · Jun 10, 2024

Jun 10, 2024

jonny @jonny

@cmkobel
Browser fingerprinting is p robust.
https://www.amiunique.org/fingerprint
And even those mitigations wont be taken by 99.999% of visitors.

www.amiunique.orgMy Fingerprint- Am I Unique ?Check if your browser has a unique fingerprint, how identifiable you are on the Internet

**James Hawley, PhD** @jrhawley@scholar.social · Jun 10, 2024

Jun 10, 2024

James Hawley, PhD @jrhawley@scholar.social

@jonny You know how AAAS and T&F have some "Preparing your PDF" loading screen when you click "Download article"? I'm pretty that's what's happening, on the fly.

You don't need those loading pages for static assets, like PDFs. You would want loading screens like that, though, to smooth over the creation of a unique PDF with identifiable metadata in it every time someone clicks that button.

Good advice on how to view and/or remove the hash

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back