social.coop is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Fediverse instance for people interested in cooperative and collective projects. If you are interested in joining our community, please apply at https://join.social.coop/registration-form.html.

Administered by:

Server stats:

473
active users

jonny

More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.

You can see for yourself using exiftool.
To remove all of the top-level metadata, you can use exiftool and qpdf:

exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>

To remove *all* metadata, you can use dangerzone or mat2

Also present in the metadata are NISO tags for document status indicating the "final published version" (VoR), and limits on what domains it should be present on. Elsevier scans for PDFs with this metadata, so good idea to strip it any time you're sharing a copy.

here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
gist.github.com/sneakers-the-r

The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
twitter.com/json_dirs/status/1

Twitter𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜 on TwitterBy 𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
gist.github.com/sneakers-the-r

gist.github.comElsevier PDF "hashes"Elsevier PDF "hashes". GitHub Gist: instantly share code, notes, and snippets.

this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.
twitter.com/horsemankukka/stat

TwitterKukka de Bierguirb Häst on TwitterBy Kukka de Bierguirb Häst

of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout...

updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.
twitter.com/json_dirs/status/1

Twitter𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜 on TwitterBy 𝚓𝚘𝚗𝚗𝚢﹏𝚜𝚊𝚞𝚗𝚍𝚎𝚛𝚜

you go to school to study "the brain" and then the next thing you know you're learning how to debug surveillance in PDF rendering to understand how publishers have so contorted the practice of science for profit. how can there be "normal science" when this is normal?

follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work)

which is surprising to me, so I'm a little hesitant to make that as a general claim

@jonny they look kind of meaningful. Not base64. Any ideas what could be in there?

@derwinmcgeary
yeah, I thought so too but don't know where to start reverse engineering it :/

@derwinmcgeary
it decodes with base85, but it's not Unicode. not sure if that's meaningful

@jonny I do not have any IT skills, but if I did I’d love to write a script to remove metadata from PDFs. Adobe has them wrapped up pretty well.

@beckett @jonny "PDFparanoia" was a project for exactly this - to strip identifying watermarks and metadata from shared academic PDFs. But it fell victim to the Python 2 to 3 transition and the mess of the PDF libraries in particular, and then fell to bitrot. Would be nice to see it brought back to health.

@seachaint
@beckett
yes, lives on in mat2 and I think pdfparanoia specifically redirects to dangerzone

@matthew
pretty straightforward to get at least the top level metadata
social.coop/@jonny/10768644281

@jonny

The cognitive load of constantly dealing with greedy corporate rentiers is exhausting.

@jonny

It's a couple things:

a) Elsevier's vendor's tool only has to be good enough to impress Elsevier

b) Deterrence being more efficient than prevention

@jonny Some other publishers render a unique PDF that has the download date, user, and institution. It is not in the metadata but a visible watermark on the edge of the pages.

@jonny for the normativity of science see the discourse of STS (science and technology studien), great field!

@shusha
yes definitely, love it and spend basically all my time reading it nowadays ❤️

@jonny I wonder whether uploading every paper to sci-hub twice would be feasible (i.e. would we still have enough people do that). (If we did so, then it would allow sci-hub to verify with reasonable certainty that whatever watermark-removal method they would use still works.)

@robryk
I think it may be easier to scrub it server side, like to have admins clean the PDFs they have. I don't know of any crowdsourced sci-hub-like projects. scrubbing metadata does seem to render the PDFs identical

@jonny And then obviously the watermarking techniques will adapt. Asking for two copies is a way to ensure that whatever we are doing still manages to scrub the watermark (they should be identical after scrubbing).

@robryk
yes, definitely. all of the above. fix what you have now, adapt to changes, making double grabs part of the protocol makes sense :)

@jonny hmmm, alternatively start inserting copies of that metadata into blank or template PDFs. Send 'em chasing wild geese and make them look at Lorem Ipsum 50 times a day 😈

@http_error_418 @jonny Very well.
Do you have code to insert those code ? I would like to help.
Btw, you can use mathgen thatsmathematics.com/mathgen/ to make meaningless paper to upload on scribd

thatsmathematics.comMathgen: Randomly generated math papers

@jonny Aren't these the same guys who charges you (or your institution) a sh*tload of money to publish these works in the first place?

@jonny word of caution is that while removing exif is good, knowing publishers there's a bunch of other ways they'd directly include such trackers into the file, in a less human/machine readable spot than EXIF. so be careful

@f0x
I suspect it should be a good idea to compare two PDF from two different source. If the hash match, it's all good. If the it doesn't, strip the EXIF. If it still doesn't match... find the difference somehow.
@jonny

@jonny (1/2) Why remove metadata if you could overwrite crucial parts with some similarly looking crap?

We use computationally expensive hashes and ciphers not to punish users for mistyped passwords with delaying next attempt, but because it transfers the brute forcing costs to attackers.

Don't make it immediately obvious to bad actors that their methods have been figured out

@jonny (2/2) Plant mindfucks - garbage hashes meeting original format, randomized dates (reasonable timeframe - not earlier that potential earliest records, not earlier timestamp than given document version etc), and any kind of poisoning that will make the (meta)data unobviously worthless.
Use ethical methods to fuck with unlawful actors. You won't beat the bigger guy using the same weapon.

@jonny Well, this is the first time I read the name "Elsevier" and I can't ask you to throw any copyright-protected files my way. Treat it as a hint to improve your script :-)

@jonny I often worry momentarily about that sort of thing happening whenever I download an image or a pdf from the net, even if I'm not planning to share it. I thought I was being too paranoid.

@jonny they're almost getting to the level of ISO standards for metadata f'wittery.

For a while, many ISO standards that you bought (for $$$$) looked like a bad photocopy. If you zoomed in really close to the marks on the page, they were made up of a pattern of punctuation characters. Totally screwed up any screen reading, though

@jonny Adding unique identifiers on stuff you distribute to be able to trace where it gets copied is hardly a new thing, and I don't think it is good terminology to call it "surveillance". As the hash is a passive part of the document, it is not used (and possibly can't be used) to spy on you.
I don't think it is productive to call this practice "surveillance", as it just make it more difficult for the readers to differentiate between levels of threats to their privacy.

@jonny@social.coop seems like some countermeasures against scihub, libgen and other shadow libraries that provide those PDFs for free 🤨

@jonny this is the same technique that was being used in the OS designed in North Korea called Red Star OS. It was in the Chaos Congress talk about it.

@kawaiipunk
interesting ... will take a look.

@jonny is there any easy way to modify this?
@jonny ignore me im retarded and only saw one post.

@jonny they really are a right bunch of bastards, aren’t they?

@jonny this makes me think some horrible things are going to happen to students because of this but I can’t quickly think of an example

@jonny But what meaningful data can they attach to that unique ID? The IP adress? Assume a user is not logged in, has cleared tracking cookies and is using a VPN.
Wait a sec. That is why we need open access.

@jonny You know how AAAS and T&F have some "Preparing your PDF" loading screen when you click "Download article"? I'm pretty that's what's happening, on the fly.

You don't need those loading pages for static assets, like PDFs. You would want loading screens like that, though, to smooth over the creation of a unique PDF with identifiable metadata in it every time someone clicks that button.

Good advice on how to view and/or remove the hash