PDF Forgeries Are Surprisingly Rare

Gwern

PDF Forgeries Are Surprisingly Rare

⁠Gwern⁠2022 (computer security, scientific bias; similar⁠)

One kind of fraud is striking in its absence online: tampered or forged PDFs⁠. People create malicious videos, photos, chat logs, and Microsoft Word documents all the time to scam and propagandize people, or publish entire PDFs full of garbage science, but they don’t edit existing PDFs.

Once in a while I see someone object to a paper I host on Gwern.net by saying “but that’s not on a real journal website! it’s not peer-reviewed! it’s just some random asshole’s personal website! You can’t believe that!” Aside from bemusement over people not believing PDFs can exist elsewhere (do they not understand the idea of “files”? Apparently many young people struggle with it. Or are they just extremely eager to obey copyright law & can’t even imagine just copying a PDF?), it’s interesting how well this actually works.

There is plenty of incompetence, fraud, and malice online, often in PDFs… but only new PDFs. I can’t think of a single fraud accomplished by editing a real PDF & just uploading it for Google Scholar etc. or where I’ve been burned by even mislabeling.

You can just search for a paper title, download it, and trust that ~100% of the time, you are getting what you thought you were getting, with the main caveat being that you may be downloading the author’s draft or a preprint and not the finalized version (particularly in economics, where papers might go through many preprints, sometimes changing the results substantially along the way, and take anywhere up to a decade to reach final publication). And when you do find a PDF claiming something malicious, like claiming to use statistics to show that ‘Trump won the 2020 US presidential election’, it’s always a ‘new’ PDF, which is forthright about it being a new unpublished ‘white paper’ or somesuch, and doesn’t purport to be a published paper. Or if it was a forged or edited document, it was usually clearly exported from Microsoft Word or another word processor (eg. all the forgeries exposed by anachronistic use of⁠ the Calibri font⁠).

Whereas, if you were so epistemically careless with images on, say, Facebook, you would wind up with a folder stuffed full of lying images which have been Photoshopped, claimed to be things other than what they are, ‘deep faked’, etc.

PDF forgery is striking because it’d be so easy to do: find a useful research paper, edit it in any of the many PDF utilities, upload anywhere, wait for people to copy it (as they do), then take down yours; now you have an authoritative peer-reviewed research paper floating around the Internet with no links to you, in the perfect crime. (Or better yet: upload it directly to Libgen⁠/Sci-Hub and let everyone else redistribute it.)

Given how rarely people check the original papers⁠, and how retracted studies like the Wakefield autism/vaccine study or blatant propaganda like Operation Denver⁠ documents will circulate among the epistemically-lazy indefinitely, a not-too-blatant forgery can get into widespread circulation for a long time before anyone notices. (In the rare cases anyone tampers with PDFs, it quickly turns into a technical morass and he-said-she-said, and requires several orders of magnitude more effort to prove than to do; consider the hoops epxperts had to jump through in the Craig Wright cases⁠, or even just ⁠a landlord editing a contract—where the contract was done through a digital document timestamping service!)

And it’s not as if there are no zealots or fanatics or malefactors willing to do so—historically, scribes tamper with documents all the time! (“Written by Confucius” or “apropos of nothing, now I, Josephus the Jew, will tell you how wonderful Jesus Christ was⁠”…)

Why can you just download PDFs off any random asshole’s website (like mine)?

Because there’s no Photoshop for PDFs, maybe? Places like arXiv⁠ provide T_eX sources, but that’s still a dark art for most would-be forgers and fanatics and con artists. PDFs are not necessarily hard to edit, but editing PDFs is not part of a normal workflow for most people: even little kids will edit photos for social media, but outside big corporations & designers, PDFs are strictly “write-only” formats—your document system compiles source documents to PDFs, and you never edit the PDF, only the source documents. (It is somewhat analogous to the compiled binary of a program: a PDF is focused on laying out, pixel by pixel, how a printed document should look; it may not even contain the original text, much less any of the structure. Just as there are hackers who specialize in understanding and changing raw binary computer code, there are people who specialize in editing PDFs… but not many.)

This is enough to push malefactors into other approaches. After all, if editing photos can work so well, why bother with the much harder editing of PDFs?

As the joke goes, PDFs don’t need to outrun the (Russian?) bear, they just need to outrun the other format.

⁠Similar Links:⁠

[⁠Return to blog index⁠]