Skip to main content

PDF Forgeries Are Surprisingly Rare

One kind of fraud is strik­ing in its ab­sence on­line: tam­pered or forged PDFs. Peo­ple cre­ate ma­li­cious videos, pho­tos, chat logs, and Mi­crosoft Word doc­u­ments all the time to scam and pro­pa­gan­dize peo­ple, or pub­lish en­tire PDFs full of garbage sci­ence, but they don’t edit ex­ist­ing PDFs.


Once in a while I see some­one ob­ject to a paper I host on Gwern.net by say­ing “but that’s not on a real jour­nal web­site! it’s not peer-reviewed! it’s just some ran­dom ass­hole’s per­sonal web­site! You can’t be­lieve that!” Aside from be­muse­ment over peo­ple not be­liev­ing PDFs can exist else­where (do they not un­der­stand the idea of “files”? Ap­par­ently many young peo­ple strug­gle with it. Or are they just ex­tremely eager to obey copy­right law & can’t even imag­ine just copy­ing a PDF?), it’s in­ter­est­ing how well this ac­tu­ally works.

There is plenty of in­com­pe­tence, fraud, and mal­ice on­line, often in PDFs… but only new PDFs. I can’t think of a sin­gle fraud ac­com­plished by edit­ing a real PDF & just up­load­ing it for Google Scholar etc. or where I’ve been burned by even mis­la­bel­ing.

You can just search for a paper title, down­load it, and trust that ~100% of the time, you are get­ting what you thought you were get­ting, with the main caveat being that you may be down­load­ing the au­thor’s draft or a preprint and not the fi­nal­ized ver­sion (par­tic­u­larly in eco­nom­ics, where pa­pers might go through many preprints, some­times chang­ing the re­sults sub­stan­tially along the way, and take any­where up to a decade to reach final pub­li­ca­tion). And when you do find a PDF claim­ing some­thing ma­li­cious, like claim­ing to use sta­tis­tics to show that ‘Trump won the 2020 US pres­i­den­tial elec­tion’, it’s al­ways a ‘new’ PDF, which is forth­right about it being a new un­pub­lished ‘white paper’ or some­such, and doesn’t pur­port to be a pub­lished paper. Or if it was a forged or edited doc­u­ment, it was usu­ally clearly ex­ported from Mi­crosoft Word or an­other word proces­sor (eg. all the forg­eries ex­posed by anachro­nis­tic use of the Cal­ibri font).

Whereas, if you were so epis­tem­i­cally care­less with im­ages on, say, Face­book, you would wind up with a folder stuffed full of lying im­ages which have been Pho­to­shopped, claimed to be things other than what they are, ‘deep faked’, etc.


PDF forgery is strik­ing be­cause it’d be so easy to do: find a use­ful re­search paper, edit it in any of the many PDF util­i­ties, up­load any­where, wait for peo­ple to copy it (as they do), then take down yours; now you have an au­thor­i­ta­tive peer-reviewed re­search paper float­ing around the In­ter­net with no links to you, in the per­fect crime. (Or bet­ter yet: up­load it di­rectly to Lib­gen/Sci-Hub and let every­one else re­dis­trib­ute it.)

Given how rarely peo­ple check the orig­i­nal pa­pers, and how re­tracted stud­ies like the Wake­field autism/vac­cine study or bla­tant pro­pa­ganda like Op­er­a­tion Den­ver doc­u­ments will cir­cu­late among the epistemically-lazy in­def­i­nitely, a not-too-blatant forgery can get into wide­spread cir­cu­la­tion for a long time be­fore any­one no­tices. (In the rare cases any­one tam­pers with PDFs, it quickly turns into a tech­ni­cal morass and he-said-she-said, and re­quires sev­eral or­ders of mag­ni­tude more ef­fort to prove than to do; con­sider the hoops epx­perts had to jump through in the Craig Wright cases, or even just ⁠a land­lord edit­ing a con­tract—where the con­tract was done through a dig­i­tal doc­u­ment time­stamp­ing ser­vice!)

And it’s not as if there are no zealots or fa­nat­ics or male­fac­tors will­ing to do so—his­tor­i­cally, scribes tam­per with doc­u­ments all the time! (“Writ­ten by Con­fu­cius” or “apro­pos of noth­ing, now I, Jose­phus the Jew, will tell you how won­der­ful Jesus Christ was”…)

Why can you just down­load PDFs off any ran­dom ass­hole’s web­site (like mine)?


Be­cause there’s no Pho­to­shop for PDFs, maybe? Places like arXiv pro­vide TeX sources, but that’s still a dark art for most would-be forg­ers and fa­nat­ics and con artists. PDFs are not nec­es­sar­ily hard to edit, but edit­ing PDFs is not part of a nor­mal work­flow for most peo­ple: even lit­tle kids will edit pho­tos for so­cial media, but out­side big cor­po­ra­tions & de­sign­ers, PDFs are strictly “write-only” for­mats—your doc­u­ment sys­tem com­piles source doc­u­ments to PDFs, and you never edit the PDF, only the source doc­u­ments. (It is some­what anal­o­gous to the com­piled bi­nary of a pro­gram: a PDF is fo­cused on lay­ing out, pixel by pixel, how a printed doc­u­ment should look; it may not even con­tain the orig­i­nal text, much less any of the struc­ture. Just as there are hack­ers who spe­cial­ize in un­der­stand­ing and chang­ing raw bi­nary com­puter code, there are peo­ple who spe­cial­ize in edit­ing PDFs… but not many.)

This is enough to push male­fac­tors into other ap­proaches. After all, if edit­ing pho­tos can work so well, why bother with the much harder edit­ing of PDFs?

As the joke goes, PDFs don’t need to out­run the (Russ­ian?) bear, they just need to out­run the other for­mat.