Skip to main content

Internet Search Tips

A de­scrip­tion of ad­vanced tips and tricks for ef­fec­tive In­ter­net re­search of pa­pers/books, with real-world ex­am­ples.

Over time, I de­vel­oped a cer­tain google-fu and ex­per­tise in find­ing ref­er­ences, pa­pers, and books on­line. I start with the stan­dard tricks like Boolean queries and key­board short­cuts, and go through the flow­chart for how to search, mod­ify searches for hard tar­gets, pen­e­trate pay­walls, re­quest jail­breaks, scan books, mon­i­tor top­ics, and host doc­u­ments. Some of these tricks are not well-known, like check­ing the In­ter­net Archive (IA) for books.

I try to write down my search work­flow, and give gen­eral ad­vice about find­ing and host­ing doc­u­ments, with demon­stra­tion case stud­ies.

Screenshot of Google Scholar search results, with an arrow pointing to the desirable fulltext link in these results, which many users are unaware of.

Google-fu search skill is some­thing I’ve prided my­self ever since el­e­men­tary school, when the li­brar­ian chal­lenged the class to find things in the al­manac; not in­fre­quently, I’d win. And I can still re­mem­ber the exact mo­ment it dawned on me in high school that much of the rest of my life would be spent deal­ing with searches, pay­walls, and bro­ken links. The In­ter­net is the great­est al­manac of all, and to the cu­ri­ous, a never-ending cor­nu­copia, so I am sad to see many fail to find things after a cur­sory search—or not look at all. For most peo­ple, if it’s not the first hit in Google/Google Scholar, it doesn’t exist. Below, I re­veal my best In­ter­net search tricks and try to pro­vide a rough flow­chart of how to go about an on­line search, ex­plain­ing the sub­tle tricks and tacit knowl­edge of search-fu.

Roughly, we need to have proper tools to cre­ate an oc­ca­sion for a search: we can­not search well if we avoid search­ing at all. Then each search will dif­fer by which search en­gine & type of medium we are search­ing—they all have their own quirks, blind spots, and ways to mod­ify a failed search. Often, we will run into walls, each of which has its own cir­cum­ven­tion meth­ods. But once we have found some­thing, we are not done: we would often be fool­ish & short-sighted if we did not then make sure it stayed found. Fi­nally, we might be in­ter­ested in ad­vanced top­ics like en­sur­ing in ad­vance re­sources can be found in the fu­ture if need be, or learn­ing about new things we might want to then go find. To il­lus­trate the over­all work­flow & pro­vide ex­am­ples of tacit knowl­edge, I in­clude many In­ter­net case stud­ies of find­ing hard-to-find things.

Papers

Request

Human flesh search en­gine. Last re­sort: if none of this works, there are a few places on­line you can re­quest a copy (how­ever, they will usu­ally fail if you have ex­hausted all pre­vi­ous av­enues):

Fi­nally, you can al­ways try to con­tact the au­thor. This only oc­ca­sion­ally works for the pa­pers I have the hard­est time with, since they tend to be old ones where the au­thor is dead or un­reach­able—any au­thor pub­lish­ing a paper since 199036ya will usu­ally have been dig­i­tized some­where—but it’s easy to try.

Post-Finding

After find­ing a full­text copy, you should find a re­li­able long-term link/place to store it and make it more find­able (re­mem­ber—if it’s not in Google/Google Scholar, it doesn’t exist!):

  • Never Link Un­re­li­able Hosts:

    • LG/SH: Al­ways op­er­ate under the as­sump­tion they could be gone to­mor­row. (As my uncle found out with Li­brary.nu shortly after pay­ing for a life­time mem­ber­ship!) There are no guar­an­tees ei­ther one will be around for long under their legal as­saults or the behind-the-scenes dra­mas, and no guar­an­tee that they are being prop­erly mir­rored or will be re­stored else­where.

      When in doubt, make a copy. Disk space is cheaper every day. Down­load any­thing you need and keep a copy of it your­self and, ide­ally, host it pub­licly.

    • NBER: never rely on a papers.nber.org/tmp/ or psycnet.apa.org URL, as they are tem­po­rary. (SSRN is also un­de­sir­able due to mak­ing it in­creas­ingly dif­fi­cult to down­load, but it is at least re­li­able.)

    • Scribd: never link Scribd—they are a scummy web­site which im­pede down­loads, and any­thing on Scribd usu­ally first ap­peared else­where any­way. (In fact, if you run into any­thing vaguely useful-looking which ex­ists only on Scribd, you’ll do hu­man­ity a ser­vice if you copy it else­where just in case.)

    • RG: avoid link­ing to Re­search­Gate (com­pro­mised by new own­er­ship & PDFs get deleted rou­tinely, ap­par­ently often by au­thors) or Academia.edu (the URLs are one-time and break)

    • high-impact jour­nals: be care­ful link­ing to Na­ture.com or Cell (if a paper is not ex­plic­itly marked as Open Ac­cess, even if it’s avail­able, it may dis­ap­pear in a few months!⁠⁠14⁠); sim­i­larly, watch out for wiley.com, tandfonline.com, jstor.org, springer.com, springerlink.com, & mendeley.com, who pull sim­i­lar shenani­gans.

    • ~/: be care­ful link­ing to aca­d­e­mic per­sonal di­rec­to­ries on uni­ver­sity web­sites (often no­tice­able by the Unix con­ven­tion .edu/~user/ or by di­rec­to­ries sug­ges­tive of ephemeral host­ing, like .edu/cs/course112/readings/foo.pdf); they have short half-lives.

    • ?token=: be­ware any PDF URL with a lot of trail­ing garbage in the URL such as query strings like ?casa_token or ?cookie or ?X (or hosted on S3/AWS); such links may or may not work for other peo­ple but will surely stop work­ing soon. (Acad­e­mia.edu, Na­ture, and El­se­vier are par­tic­u­larly egre­gious of­fend­ers here.)

  • PDF Edit­ing: if a scan, it may be worth edit­ing the PDF to crop the edges, thresh­old to bi­na­rize it (which, for a bad grayscale or color scan, can dras­ti­cally re­duce file­size while in­creas­ing read­abil­ity), and OCR it.

    I use gscan2pdf but there are al­ter­na­tives worth check­ing out.

  • Check & Im­prove Meta­data.

    Adding meta­data to pa­pers/books is a good idea be­cause it makes the file find­able in G/GS (if it’s not on­line, does it re­ally exist?) and helps you if you de­cide to use bib­li­o­graphic soft­ware like Zotero in the fu­ture. Many aca­d­e­mic pub­lish­ers & LG are ter­ri­ble about meta­data, and will not in­clude even title/au­thor/DOI/year.

    PDFs can be eas­ily an­no­tated with meta­data using ExifTool:: exiftool -All prints all meta­data, and the meta­data can be set in­di­vid­u­ally using sim­i­lar fields.

    For pa­pers hid­den in­side vol­umes or other files, you should ex­tract the rel­e­vant page range to cre­ate a sin­gle rel­e­vant file. (For ex­trac­tion of PDF page-ranges, I use pdftk, eg: pdftk 2010-davidson-wellplayed10-videogamesvaluemeaning.pdf cat 180-196 output 2009-fortugno.pdf. Many pub­lish­ers in­sert a spam page as the first page. You can drop that eas­ily with pdftk INPUT.pdf cat 2-end output OUTPUT.pdf, but note that PDFtk may drop all meta­data, so do that be­fore adding any meta­data. To delete pseudo-encryption or ‘pass­worded’ PDFs, do pdftk INPUT.pdf input_pw output OUTPUT.pdf; PDFs using ac­tual en­cryp­tion are trick­ier but ⁠can often be beaten by off-the-shelf password-cracking util­i­ties.) For con­vert­ing JPG/PNGs to PDF, one can use Im­ageMag­ick for <64 pages (convert *.png foo.pdf) but be­yond that one may need to con­vert them in­di­vid­u­ally & then join the re­sult­ing PDFs (eg. for f in *.png; do convert "$f" "${f%.png}.pdf"; done && pdftk *.pdf cat output foo.pdf or join with pdfunite *.pdf foo.pdf.)

    I try to set at least title/au­thor/DOI/year/sub­ject, and stuff any ad­di­tional top­ics & bib­li­o­graphic in­for­ma­tion into the “Key­words” field. Ex­am­ple of set­ting meta­data:

    exiftool -Author="Frank P. Ramsey" -Date=1930 -Title="On a Problem of Formal Logic" -DOI="10.1112/plms/s2-30.1.264" \
        -Subject="mathematics" -Keywords="Ramsey theory, Ramsey's theorem, combinatorics, mathematical logic, decidability, \
        first-order logic,  Bernays-Schönfinkel-Ramsey class of first-order logic, _Proceedings of the London Mathematical \
        Society_, Volume s2-30, Issue 1, 1930-01-01, pg264–286" 1930-ramsey.pdf

    “PDF Plus” is bet­ter than “PDF”.

    If two ver­sions are pro­vided, the “PDF” one may be in­tended (if there is any real dif­fer­ence) for print­ing and ex­clude fea­tures like hy­per­links .

  • Pub­lic Host­ing: if pos­si­ble, host a pub­lic copy; es­pe­cially if it was very dif­fi­cult to find, even if it was use­less, it should be hosted. The life you save may be your own.

  • Link On WP/So­cial Media: for bonus points, link it in ap­pro­pri­ate places on Wikipedia or Red­dit or Twit­ter; this makes peo­ple aware of the copy being avail­able, and also su­per­charges vis­i­bil­ity in search en­gines.

  • Link Spe­cific Pages: as noted be­fore, you can link a spe­cific page by adding #page=N to the URL. Link­ing the rel­e­vant page is help­ful to read­ers. (I rec­om­mend against doing this if this is done to link an en­tire ar­ti­cle in­side a book, be­cause that ar­ti­cle will still have bad SEO and it will be hard to find; in such cases, it’s bet­ter to crop out the rel­e­vant page range as a stand­alone ar­ti­cle, eg. using pdftk again for pdftk 1900-BOOK.pdf cat 123-456 output 1900-PAPER.pdf.)

Advanced

Aside from the (highly-recommended) use of hotkeys and Booleans for searches, there are a few use­ful tools for the re­searcher, which while ex­pen­sive ini­tially, can pay off in the long-term:

  • ⁠archiver-bot: au­to­mat­i­cally archive your web brows­ing and/or links from ar­bi­trary web­sites to fore­stall linkrot; par­tic­u­larly use­ful for de­tect­ing & re­cov­er­ing from dead PDF links

  • Sub­scrip­tions like PubMed & GS search alerts: set up alerts for a spe­cific search query, or for new ci­ta­tions of a spe­cific paper. (Google Alerts is not as use­ful as it seems.)

    1. PubMed has straight­for­ward con­ver­sion of search queries into alerts: “Cre­ate alert” below the search bar. (Given the vol­ume of PubMed in­dex­ing, I rec­om­mend care­fully tai­lor­ing your search to be as nar­row as pos­si­ble, or else your alerts may over­whelm you.)

    2. To cre­ate generic GS search query alert, sim­ply use the “Cre­ate alert” on the side­bar for any search. To fol­low ci­ta­tions of a key paper, you must: 1. bring up the paper in GS; 2. click on “Cited by X”; 3. then use “Cre­ate alert” on the side­bar.

  • GCSE: a Google Cus­tom Search En­gines is a spe­cial­ized search queries lim­ited to whitelisted pages/do­mains etc. (eg. my Wikipedia-focused anime/manga CSE).

    A GCSE can be thought of as a saved search query on steroids. If you find your­self reg­u­larly in­clud­ing scores of the same do­mains in mul­ti­ple searches search, or con­stantly black­list­ing do­mains with -site: or using many nega­tions to fil­ter out com­mon false pos­i­tives, it may be time to set up a GCSE which does all that by de­fault.

  • Clip­pings: note-taking ser­vices like Ever­note/Mi­crosoft OneNote: reg­u­larly mak­ing and keep­ing ex­cerpts cre­ates a per­son­al­ized search en­gine, in ef­fect.

    This can be vital for re­find­ing old things you read where the search terms are hope­lessly generic or you can’t re­mem­ber an exact quote or ref­er­ence; it is one thing to search a key­word like “autism” in a few score thou­sand clip­pings, and an­other thing to search that in the en­tire In­ter­net! (One can also re­or­ga­nize or edit the notes to add in the key­words one is think­ing of, to help with re­find­ing.) I make heavy use of Ever­note clip­ping and it is key to re­find­ing my ref­er­ences.

  • Crawl­ing Web­sites: some­times hav­ing copies of whole web­sites might be use­ful, ei­ther for more flex­i­ble search­ing or for en­sur­ing you have any­thing you might need in the fu­ture. (ex­am­ple: “Dark­net Mar­ket Archives (2013–201511ya)”).

    Use­ful tools to know about: wget, cURL, HT­Track; Fire­fox plu­g­ins: ⁠No­Script, uBlock ori­gin, Live HTTP Head­ers, By­pass Pay­walls, cookie ex­port­ing.

    Short of down­load­ing a web­site, it might also be use­ful to pre-emptively archive it by using linkchecker to crawl it, com­pile a list of all ex­ter­nal & in­ter­nal links, and store them for pro­cess­ing by an­other archival pro­gram (see Archiv­ing URLs for ex­am­ples). In cer­tain rare cir­cum­stances, se­cu­rity tools like nmap can be use­ful to ex­am­ine a mys­te­ri­ous server in more de­tail: what web server and ser­vices does it run, what else might be on it (some­times in­ter­est­ing things like old anony­mous FTP servers turn up), has a web­site moved be­tween IPs or servers, etc.

Web Pages

With proper use of pre-emptive archiv­ing tools like archiver-bot, fix­ing linkrot in one’s own pages is much eas­ier, but that leaves other ref­er­ences. Search­ing for lost web pages is sim­i­lar to search­ing for pa­pers:

  • Just Search The Title: if the page title is given, search for the title.

    It is a good idea to in­clude page ti­tles in one’s own pages, as well as the URL, to help with fu­ture searches, since the URL may be mean­ing­less gib­ber­ish on its own, and pre-emptive archiv­ing can fail. HTML sup­ports both alt and title pa­ra­me­ters in link tags, and, in cases where dis­play­ing a title is not de­sir­able (be­cause the link is being used in­line as part of nor­mal hy­per­tex­tual writ­ing), ti­tles can be in­cluded cleanly in Mark­down doc­u­ments like this: [inline text description](URL "Title").

  • Clean URLs: check the URL for weird­ness or trail­ing garbage like ?rss=1 or ?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+blogspot%2FgJZg+%28Google+AI+Blog%29? Or a vari­ant do­main, like a mobile.foo.com/m.foo.com/foo.com/amp/ URL? Those are all less likely to be find­able or archived than the canon­i­cal ver­sion.

  • Do­main Site Search: re­strict G search to the orig­i­nal do­main with site:, or to re­lated do­mains

  • Time-Limited Search: re­strict G search to the orig­i­nal date-range/years.

    You can use this to tame overly-general searches. An al­ter­na­tive to the date-range wid­get is the ad­vanced search syn­tax, which works (for now): spec­ify nu­meric range queries using double-dots like foo 2020..2023 (which is use­ful be­yond just years). If this is still too broad, it can al­ways be nar­rowed down to in­di­vid­ual years.

  • Switch En­gines: try a dif­fer­ent search en­gine: cor­puses can vary, and in some cases G tries to be too smart for its own good when you need a lit­eral search; Duck­DuckGo (es­pe­cially for ‘bang’ spe­cial searches), Bing, and Yan­dex are us­able al­ter­na­tives

  • Check Archives: if nowhere on the clear­net, try the In­ter­net Archive (IA) or the ⁠Me­mento meta-archive search en­gine:

    IA is the de­fault backup for a dead URL. If IA doesn’t Just Work, there may be other ver­sions in it:

    • mis­lead­ing redi­rects: did the IA ‘help­fully’ redi­rect you to a much-later-in-time error page? Kill the redi­rect and check the ear­li­est stored ver­sion for the exact URL rather than the redi­rect. Did the page ini­tially load but then error out/redi­rect? Dis­able JS with No­Script and re­load.

    • Within-Domain Archives: IA lets you list all URLs with any archived ver­sions, by search­ing for URL/*; the list of avail­able URLs may re­veal an al­ter­nate newer/older URL. It can also be use­ful to fil­ter by file­type or sub­string.

      For ex­am­ple, one might list all URLs in a do­main, and if the list is too long and filled with garbage URLs, then using the “Fil­ter re­sults” incremental-search wid­get to search for “up­loads/” on a Word­Press blog.⁠⁠15⁠

      Gwern 2019

      Screen­shot of an oft-overlooked fea­ture of the In­ter­net Archive: dis­play­ing all avail­able/archived URLs for a spe­cific do­main, fil­tered down to a sub­set match­ing a string like *uploads/*.

      • wayback_machine_downloader (not to be con­fused with the internetarchive Python pack­age which pro­vides a CLI in­ter­face to up­load­ing files) is a Ruby tool which lets you down­load whole do­mains from IA, which can be use­ful for run­ning a local full­text search using reg­exps (a good grep query is often enough), in cases where just look­ing at the URLs via URL/* is not help­ful. (An al­ter­na­tive which might work is websitedownloader.io.)

      Ex­am­ple:

      gem install --user-install wayback_machine_downloader
      ~/.gem/ruby/2.7.0/bin/wayback_machine_downloader wayback_machine_downloader --all-timestamps 'https://blog.okcupid.com'
    • did the do­main change, eg. from www.foo.com to foo.com or www.foo.org? En­tirely dif­fer­ent as far as IA is con­cerned.

    • does the in­ter­nal ev­i­dence of the URL pro­vide any hints? You can learn a lot from URLs just by pay­ing at­ten­tion and think­ing about what each di­rec­tory and ar­gu­ment means.

    • is this a Blogspot blog? Blogspot is uniquely hor­ri­ble in that it has ver­sions of each blog for every coun­try do­main: a foo.blogspot.com blog could be under any of foo.blogspot.de, foo.blogspot.au, foo.blogspot.hk, foo.blogspot.jp…⁠⁠16⁠

    • did the web­site pro­vide RSS feeds?

      A lit­tle known fact is that Google Reader (GR; Oc­to­ber 2005–July 201313ya) stored all RSS items it crawled, so if a web­site’s RSS feed was con­fig­ured to in­clude full items, the RSS feed his­tory was an al­ter­nate mir­ror of the whole web­site, and since GR never re­moved RSS items, it was pos­si­ble to re­trieve pages or whole web­sites from it. GR has since closed down, sadly, but be­fore it closed, Archive Team ⁠down­loaded a large frac­tion of GR’s his­tor­i­cal RSS feeds, and those archives are ⁠now hosted on IA. The catch is that they are stored in mega-WARCs, which, for all their archival virtues, are not the most user-friendly for­mat. The raw GR mega-WARCs are dif­fi­cult enough to work with that I ⁠defer an ex­am­ple to the ap­pen­dix.

    • archive.today: an IA-like mir­ror. (Some­times by­passes pay­walls or has snap­shots other ser­vices do not; I strongly rec­om­mend against treat­ing archive.today/archive.is/etc as any­thing but a tem­po­rary mir­ror to grab snap­shots from, as it has no long-term plans.)

    • any local archives, such as those made with my ⁠archiver-bot

    • Google Cache (GC): no longer ex­tant as of 2024-09-24.

Books

Digital

E-books are rarer and harder to get than pa­pers, al­though the sit­u­a­tion has im­proved vastly since the early 2000s. To search for books on­line:

  • More Straight­for­ward: book searches tend to be faster and sim­pler than paper searches, and to re­quire less clev­er­ness in search query for­mu­la­tion, per­haps be­cause they are rarer on­line, much larger, and have sim­pler ti­tles, mak­ing it eas­ier for search en­gines.

    Search G, not GS, for books:

    No Books in Google Scholar

    Book full­texts usu­ally don’t show up in GS (for un­known rea­sons). You need to check G when search­ing for books.

    To double-check, you can try a filetype:pdf search; then check LG. Typ­i­cally, if the main title + au­thor doesn’t turn it up, it’s not on­line. (In some cases, the au­thor order is re­versed, or the title:sub­ti­tle are re­versed, and you can find a copy by tweak­ing your search, but these are rare.).

  • IA: the In­ter­net Archive has many books scanned which do not ap­pear eas­ily in search re­sults (poor SEO?).

    • If an IA hit pops up in a search, al­ways check it; the OCR may offer hints as to where to find it. If you don’t find any­thing or the pro­vided, try doing an IA site search in G (not the IA built-in search en­gine), eg. book title site:archive.org.

    • DRM workarounds: if it is on IA but the IA ver­sion is DRMed and is only avail­able for “check­out”, you can jail­break it.

      Check the book out for the full pe­riod, 14 days. Down­load the PDF (not EPUB) ver­sion to Adobe Dig­i­tal El­e­ments ver­sion ≤4.0 (which can be run in Wine on Linux), and then im­port it to Cal­i­bre with ⁠the De-DRM plu­gin, which will pro­duce a DRM-free PDF in­side Cal­i­bre’s li­brary. (Get­ting De-DRM run­ning can be tricky, es­pe­cially under Linux. I wound up hav­ing to edit some of the paths in the Python files to make them work with Wine. It also ap­pears to fail on the most re­cent Google Play ebooks, ~2021.) You can then add meta­data to the PDF & up­load it to LG⁠⁠17⁠. (LG’s ver­sions of books are usu­ally bet­ter than the IA scans, but if they don’t exist, IA’s is bet­ter than noth­ing.)

  • Google Play: use the same PDF DRM as IA, can be bro­ken same way

  • HathiTrust also hosts many book scans, which can be searched for clues or hints or jail­bro­ken.

    HathiTrust blocks whole-book down­loads but it’s easy to down­load each page in a loop and stitch them to­gether, for ex­am­ple:

    for i in {1..151}
    do if [[ ! -s "$i.pdf" ]]; then
        wget "https://babel.hathitrust.org/cgi/imgsrv/download/pdf?id=mdp.39015050609067;orient=0;size=100;seq=$i;attachment=0" \
              -O "$i.pdf"
        sleep 20s
     fi
    done
     
    pdftk *.pdf cat output 1957-super-scientificcareersandvocationaldevelopmenttheory.pdf
     
    exiftool -Title="Scientific Careers and Vocational Development Theory: A review, a critique and some recommendations" \
        -Date=1957 -Author="Donald E. Super, Paul B. Bachrach" -Subject="psychology" \
        -Keywords="Bureau Of Publications (Teachers College Columbia University), LCCCN: 57-12336, National Science Foundation, public domain, \
        https://babel.hathitrust.org/cgi/pt?id=mdp.39015050609067;view=1up;seq=1 https://psycnet.apa.org/record/1959-04098-000" \
        1957-super-scientificcareersandvocationaldevelopmenttheory.pdf

    An­other ex­am­ple of this would be the Well­come Li­brary; while look­ing for An In­ves­ti­ga­tion Into The Re­la­tion Be­tween In­tel­li­gence And In­her­i­tance, Lawrence1931, I came up dry until I checked one of the last search re­sults, a “Well­come Dig­i­tal Li­brary” hit, on the slim off-chance that, like the oc­ca­sional Chi­nese/In­dian li­brary web­site, it just might have full­text. As it hap­pens, it did—good news? Yes, but with a caveat: it pro­vides no way to down­load the book! It pro­vides OCR, meta­data, and in­di­vid­ual page-image down­loads all under CC-BY-NC-SA (so no legal prob­lems), but… not the book. (The OCR is also un­nec­es­sar­ily zipped, so that is why Google ranked the page so low and did not show any re­veal­ing ex­cerpts from the OCR tran­script: be­cause it’s hid­den in an opaque archive to save a few kilo­bytes while de­stroy­ing SEO.) Ex­am­in­ing the down­load URLs for the highest-resolution im­ages, they fol­low an un­for­tu­nate schema:

    1. https://dlcs.io/iiif-img/wellcome/1/5c27d7de-6d55-473c-b3b2-6c74ac7a04c6/full/2212,/0/default.jpg

    2. https://dlcs.io/iiif-img/wellcome/1/d514271c-b290-4ae8-bed7-fd30fb14d59e/full/2212,/0/default.jpg

    3. etc

    In­stead of being se­quen­tially num­bered 1–90 or what­ever, they all live under a unique hash or ID. For­tu­nately, one of the meta­data files, the ‘man­i­fest’ file, pro­vides all of the hashes/IDs (but not the high-quality down­load URLs). Ex­tract­ing the IDs from the man­i­fest can be done with some quick sed & tr string pro­cess­ing, and fed into an­other short wget loop for down­load

    grep -F '@id' manifest\?manifest\=https\:%2F%2Fwellcomelibrary.org%2Fiiif%2Fb18032217%2Fmanifest | \
       sed -e 's/.*imageanno\/\(.*\)/\1/' | grep -E -v '^ .*' | tr -d ',' | tr -d '"' # "
    # bf23642e-e89b-43a0-8736-f5c6c77c03c3
    # 334faf27-3ee1-4a63-92d9-b40d55ab72ad
    # 5c27d7de-6d55-473c-b3b2-6c74ac7a04c6
    # d514271c-b290-4ae8-bed7-fd30fb14d59e
    # f85ef645-ec96-4d5a-be4e-0a781f87b5e2
    # a2e1af25-5576-4101-abee-96bd7c237a4d
    # 6580e767-0d03-40a1-ab8b-e6a37abe849c
    # ca178578-81c9-4829-b912-97c957b668a3
    # 2bd8959d-5540-4f36-82d9-49658f67cff6
    # ...etc
    I=1
    for HASH in $HASHES; do
        wget "https://dlcs.io/iiif-img/wellcome/1/$HASH/full/2212,/0/default.jpg" -O $I.jpg
        I=$((I+1))
    done

    And then the 59MB of JPGs can be cleaned up as usual with gscan2pdf (empty pages deleted, ta­bles ro­tated, cover page cropped, all other pages bi­na­rized), com­pressed/OCRed with ocrmypdf, and meta­data set with exiftool, pro­duc­ing a read­able, down­load­able, highly-search-engine-friendly 1.8MB PDF.

  • re­mem­ber the Ana­log Hole works for pa­pers/books too:

    if you can find a copy to read, but can­not fig­ure out how to down­load it di­rectly be­cause the site uses JS or com­pli­cated cookie au­then­ti­ca­tion or other tricks, you can al­ways ex­ploit the ‘ana­logue hole’—fullscreen the book in high res­o­lu­tion & take screen­shots of every page; then crop, OCR etc. This is te­dious but it works. And if you take screen­shots at suf­fi­ciently high res­o­lu­tion, there will be rel­a­tively lit­tle qual­ity loss. (This works bet­ter for books that are scans than ones born-digital.)

Physical

Ex­pen­sive but fea­si­ble. Books are some­thing of a double-edged sword com­pared to pa­pers/the­ses. On the one hand, books are much more often un­avail­able on­line, and must be bought of­fline, but at least you al­most al­ways can buy used books of­fline with­out much trou­ble (and often for <$13$102019 total); on the other hand, while paper/the­ses are often avail­able on­line, when one is not un­avail­able, it’s usu­ally very un­avail­able, and you’re stuck (un­less you have a uni­ver­sity ILL de­part­ment back­ing you up or are will­ing to travel to the few or only uni­ver­si­ties with paper or mi­cro­film copies).

Pur­chas­ing from used book sell­ers:

  • Sell­ers:

    • used book search en­gines: Google Books/⁠find-more-books.com: a good start­ing point for seller links; if buy­ing from a mar­ket­place like Abe­Books/Ama­zon/Barnes & Noble, it’s worth search­ing the seller to see if they have their own web­site, which is po­ten­tially much cheaper. They may also have mul­ti­ple edi­tions in stock.

    • bad: eBay & Ama­zon are often bad, due to high-minimum-order+S&H and sell­ers on Ama­zon seem to as­sume Ama­zon buy­ers are eas­ily rooked; but can be use­ful in pro­vid­ing meta­data like page count or ISBN or vari­a­tions on the title

    • good: Abe­Books, ⁠Thrift Books, Bet­ter World Books, B&N, Dis­cover Books.

      Note: on Abe­Books, in­ter­na­tional or­ders can be use­ful (es­pe­cially for be­hav­ioral ge­net­ics or psy­chol­ogy books) but be care­ful of in­ter­na­tional or­ders with your credit card—many debit/credit cards will fail on in­ter­na­tional or­ders and trig­ger a fraud alert, and Pay­Pal is not ac­cepted.

  • Price Alerts: if a book is not avail­able or too ex­pen­sive, set price watches: Abe­Books sup­ports email alerts on stored searches, and Ama­zon can be mon­i­tored via Camel­Camel­Camel (re­mem­ber the CCC price alert you want is on the used third-party cat­e­gory, as new books are more ex­pen­sive, less avail­able, and un­nec­es­sary).

Scan­ning:

  • De­struc­tive Vs Non-Destructive: the fun­da­men­tal dilemma of book scan­ning—de­struc­tively de­bind­ing books with a razor or guil­lo­tine cut­ter works much bet­ter & is much less time-consuming than spread­ing them on a flatbed scan­ner to scan one-by-one⁠⁠18⁠, be­cause it al­lows use of a sheet-fed scan­ner in­stead, which is eas­ily 5x faster and will give higher-quality scans (be­cause the sheets will be flat, scanned edge-to-edge, and much more closely aligned), but does, of course, re­quire ef­fec­tively de­stroy­ing the book.

  • Tools:

    • cut­ting: For sim­ple de­bind­ing of a few books a year, an X-acto knife/razor is good (avoid the ‘tri­an­gle’ blades, get curved blades in­tended for large cuts in­stead of de­tail work).

      Once you start doing more than one a month, it’s time to up­grade to a guil­lo­tine blade paper cut­ter (a fancier swinging-arm paper cut­ter, which uses a two-joint sys­tem to clamp down and cut uni­formly).

      A guil­lo­tine blade can cut chunks of 200 pages eas­ily with­out much slip­page, so for books with more pages, I use both: an X-acto to cut along the spine and turn it into sev­eral 200-page chunks for the guil­lo­tine cut­ter.

    • scan­ning: at some point, it may make sense to switch to a scan­ning ser­vice like ⁠1Dol­larScan (1DS has ac­cept­able qual­ity for the black-white scans I have used them for thus far, but watch out for their nickel-and-diming fees for OCR or “set­ting the PDF title”; these can be done in no time your­self using gscan2pdf/exiftool/ocrmypdf and will save a lot of money as they, amaz­ingly, bill by 100-page units). Books can be sent di­rectly to 1DS, re­duc­ing lo­gis­ti­cal has­sles.

  • Clean Up: after scan­ning, crop/thresh­old/OCR/add meta­data

    • Adding meta­data: same prin­ci­ples as pa­pers. While more elab­o­rate meta­data can be added, like book­marks, I have not ex­per­i­mented with those yet.

  • File for­mat: PDF, ⁠not DjVu

    De­spite being a worse for­mat in many re­spects, I now rec­om­mend PDF and have stopped using DjVu for new scans⁠⁠19⁠ and have con­verted my old DjVu files to PDF.

  • Up­load­ing: to Lib­Gen, usu­ally, and Gwern.net some­times. For back­ups, file­lock­ers like Drop­box, Mega, Me­di­aFire, or Google Drive are good if you have no web­site of your own. I usu­ally up­load 3 copies in­clud­ing LG. I ro­tate ac­counts once a year, to avoid putting too many files into a sin­gle ac­count. [I dis­cour­age re­liance on IA links.)

    • Do Not Use Google Docs/Scribd/Drop­box/IA/etc for Long-Term Doc­u­ments

      ‘Doc­u­ment’ web­sites like Google Docs (GD) should be strictly avoided as pri­mary host­ing. GD does not ap­pear in G/GS, doom­ing a doc­u­ment to ob­scu­rity, and Scribd is lu­di­crously user-hostile with chang­ing dark pat­terns. Such sites can­not be searched, scraped, down­loaded re­li­ably, clipped, used on many de­vices, archived⁠⁠20⁠, or counted on for the long haul. (For ex­am­ple, Google Docs has made many doc­u­ments ‘pri­vate’, break­ing pub­lic links, to the sur­prise of even the au­thors when I con­tact them about it, for un­clear rea­sons.)

      Such sites may be use­ful for col­lab­o­ra­tion or sur­veys, but should be re­garded as strictly tem­po­rary work­ing files, and moved to clean sta­tic HTML/PDF/XLSX hosted else­where as soon as pos­si­ble.

  • Host­ing: host­ing pa­pers is easy but books come with risk:

    Books can be dan­ger­ous; in de­cid­ing whether to host a book, my rule of thumb is host only books pre-2000 and which do not have Kin­dle edi­tions or other signs of ac­tive ex­ploita­tion and is ef­fec­tively an ‘or­phan work’.

    As of 2019-10-23, host­ing 4,090 files over 9 years (very roughly, as­sum­ing lin­ear growth, <6.7 mil­lion document-days of host­ing: 3,763 × 0.5 × 8 × 365.25 = 6,722,426), I’ve re­ceived 4 take­down or­ders: a be­hav­ioral ge­net­ics text­book (201313ya), The Hand­book of Psy­chopa­thy (200521ya), a re­cent meta-analysis paper (Roberts et al 2016), and a CUP DMCA take­down order for 27 files. I broke my rule of thumb to host the 2 books (my mis­take), which leaves only the 1 paper, which I think was a fluke. So, as long as one avoids rel­a­tively re­cent books, the risk should be min­i­mal.

Case Studies

See Also

Appendix

Searching the Google Reader Archives

A 201511ya tu­to­r­ial on how to do man­ual searches of the 201313ya Google Reader archives on the In­ter­net Archive. Google Reader pro­vides full­text mir­rors of many web­sites which are long gone and not oth­er­wise avail­able even in the IA; how­ever, the Archive Team archives are ex­tremely user-unfriendly and chal­leng­ing to use even for pro­gram­mers.

I ex­plain how to find & ex­tract spe­cific web­sites.

Note: now largely ob­so­leted by query­ing IA’s Way­back Ma­chine for the GR RSS URL.

Un­usual archive: Google ReaderArchive Team mir­rored Google ReaderAT mir­ror is raw bi­nary data

A little-known way to ‘un­delete’ a pre-2013 blog or web­site is to use Google Reader (GR). Un­usual archive: Google Reader. GR crawled reg­u­larly al­most all blogs’ RSS feeds; RSS feeds often con­tain the full­text of ar­ti­cles. If a blog au­thor writes an ar­ti­cle, the full­text is in­cluded in the RSS feed, GR down­loads it, and then the au­thor changes their mind and edits or deletes it, GR would re­down­load the new ver­sion but it would con­tinue to show the ver­sion the old ver­sion as well (you would see two ver­sions, chrono­log­i­cally). If the au­thor blogged reg­u­larly and so GR had learned to check reg­u­larly, it could hy­po­thet­i­cally grab dif­fer­ent edited ver­sions, even, not just ones with weeks or months in be­tween. As­sum­ing that GR did not, as it some­times did for in­scrutable rea­sons, stop dis­play­ing the his­tor­i­cal archives and only showed the last 90 days or so to read­ers; I was never able to fig­ure out why this hap­pened or if in­deed it re­ally did hap­pen and was not some sort of UI prob­lem. Re­gard­less, if all went well, this let you un­delete an ar­ti­cle, al­beit per­haps with messed up for­mat­ting or some­thing. Sadly, GR was closed back on 2013-07-01, and you can­not sim­ply log in and look for blogs.

Archive Team mir­rored Google Reader. How­ever, be­fore it was closed, ⁠Archive Team launched a major ef­fort to down­load as much of GR as pos­si­ble. So in that dump, there may be archives of all of a ran­dom blog’s posts. Specif­i­cally: if a GR user sub­scribed to it; if Archive Team knew about it; if they re­quested it in time be­fore clo­sure; and if GR did keep full archives stretch­ing back to the first post­ing.

AT mir­ror is raw bi­nary data. Down­side: the Archive Team dump is not in an eas­ily browsed for­mat, and merely fig­ur­ing out what it might have is dif­fi­cult. In fact, it’s so dif­fi­cult that be­fore re­search­ing Craig Wright in No­vem­ber–De­cem­ber 201511ya, I never had an ur­gent enough rea­son to fig­ure out how to get any­thing out of it be­fore, and I’m not sure I’ve ever seen any­one ac­tu­ally use it be­fore; Archive Team takes the at­ti­tude that it’s bet­ter to pre­serve the data some­how and let pos­ter­ity worry about using it. (There is a site which claimed to be a fron­tend to the dump but when I tried to use it, it was bro­ken & still is in April 2024.)

Extracting

Find the right archive. The 9TB of data is stored in ~69 opaque com­pressed WARC archives. 9TB is a bit much to down­load and un­com­press to look for one or two files, so to find out which WARC you need, you have to down­load the ~69 CDX in­dexes which record the con­tents of their re­spec­tive WARC, and search them for the URLs you are in­ter­ested in. (They are plain text so you can just grep them.)

Locations

In this ex­am­ple, we will look at the main blog of Craig Wright, gse-compliance.blogspot.com. (An­other blog, security-doctor.blogspot.com, ap­pears to have been too ob­scure to be crawled by GR.) To lo­cate the WARC with the Wright RSS feeds, down­load the ⁠the mas­ter index. To search:

for file in *.gz; do echo $file; zcat $file | grep -F -e 'gse-compliance' -e 'security-doctor'; done
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/atom.xml?client=\
# archiveteam&comments=true&likes=true&n=1000&r=n 20130602001238 https://www.google.com/reader/\
# api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml?r=n&n=1000&\
# likes=true&comments=true&client=ArchiveTeam unk - 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM - - 1316181\
# 19808229791 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/feeds/posts/default?\
# alt=rss?client=archiveteam&comments=true&likes=true&n=1000&r=n 20130602001249 https://www.google.\
# com/reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Ffeeds%2Fposts%2Fdefault\
# %3Falt%3Drss?r=n&n=1000&likes=true&comments=true&client=ArchiveTeam unk - HOYKQ63N2D6UJ4TOIXMOTUD4IY7MP5HM\
# - - 1326824 19810951910 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/feeds/posts/default?\
# client=archiveteam&comments=true&likes=true&n=1000&r=n 20130602001244 https://www.google.com/\
# reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Ffeeds%2Fposts%2Fdefault?\
# r=n&n=1000&likes=true&comments=true&client=ArchiveTeam unk - XXISZYMRUZWD3L6WEEEQQ7KY7KA5BD2X - - \
# 1404934 19809546472 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz
# com,google/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/rss.xml?client=archiveteam\
# &comments=true&likes=true&n=1000&r=n 20130602001253 https://www.google.com/reader/api/0/stream/contents\
# /feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Frss.xml?r=n&n=1000&likes=true&comments=true\
# &client=ArchiveTeam text/html 404 AJSJWHNSRBYIASRYY544HJMKLDBBKRMO - - 9467 19812279226 \
# archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Un­der­stand­ing the out­put: the for­mat is de­fined by the first line, which then can be ⁠looked up:

  • the for­mat string is: CDX N b a m s k r M S V g; which means here:

    • N: mas­saged url

    • b: date

    • a: orig­i­nal url

    • m: MIME type of orig­i­nal doc­u­ment

    • s: re­sponse code

    • k: new style check­sum

    • r: redi­rect

    • M: meta tags (AIF)

    • S: ?

    • V: com­pressed arc file off­set

    • g: file name

Ex­am­ple:

(com,google)/reader/api/0/stream/contents/feed/http:/gse-compliance.blogspot.com/atom.xml\
?client=archiveteam&comments=true&likes=true&n=1000&r=n 20130602001238 https://www.google.com\
/reader/api/0/stream/contents/feed/http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml?r=n\
&n=1000&likes=true&comments=true&client=ArchiveTeam unk - 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM\
- - 1316181 19808229791 archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

Con­verts to:

  • mas­saged URL: (com,google)/reader/api/0/stream/contents/feed/ http:/gse-compliance.blogspot.com/atom.xml? client=archiveteam&comments=true&likes=true&n=1000&r=n

  • date: 20130602001238

  • orig­i­nal URL: https://www.google.com/reader/api/0/stream/contents/feed/ http%3A%2F%2Fgse-compliance.blogspot.com%2Fatom.xml? r=n&n=1000&likes=true&comments=true&client=ArchiveTeam

  • MIME type: unk [un­known?]

  • re­sponse code:—[none?]

  • new-style check­sum: 4GZ4KXJISATWOFEZXMNB4Q5L3JVVPJPM

  • redi­rect:—[none?]

  • meta tags:—[none?]

  • S [? maybe length?]: 1316181

  • com­pressed arc file off­set: 19808229791 (19,808,229,791; so some­where around 19.8GB into the mega-WARC)

  • file­name: archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz

As of 2024, the WARCs have been processed into Way­back Ma­chine and the orig­i­nal google.com/reader/api/0/ RSS URLs are now search­able, so one could look the old GR RSS up like a nor­mal URL and do the nor­mal broader searches like ⁠search­ing for all ver­sions

How­ever, in 201511ya, we had to do it the hard way: ex­tract­ing di­rectly from the WARC. Know­ing the off­set the­o­ret­i­cally makes it pos­si­ble to ex­tract di­rectly from the IA copy with­out hav­ing to down­load and de­com­press the en­tire thing… The S & off­sets for gse-compliance are:

  1. 1316181/19808229791

  2. 1326824/19810951910

  3. 1404934/19809546472

  4. 9467/19812279226

So we found hits point­ing to­wards archiveteam_greader_20130604001315 & archiveteam_greader_20130614211457 which we then need to down­load (25GB each):

wget 'https://archive.org/download/archiveteam_greader_20130604001315/greader_20130604001315.megawarc.warc.gz'
wget 'https://archive.org/download/archiveteam_greader_20130614211457/greader_20130614211457.megawarc.warc.gz'

Once down­loaded, how do we get the feeds? There are a num­ber of ⁠hard-to-use and in­com­plete tools for work­ing with giant WARCs; I con­tacted the orig­i­nal GR archiver, ivan, but that wasn’t too help­ful.

warcat

I tried using warcat to un­pack the en­tire WARC archive into in­di­vid­ual files, and then delete every­thing which was not rel­e­vant:

python3 -m warcat extract /home/gwern/googlereader/...
find ./www.google.com/ -type f -not \( -name "*gse-compliance*" -or -name "*security-doctor*" \) -delete
find ./www.google.com/

But this was too slow, and crashed part­way through be­fore fin­ish­ing.

Bug re­ports:

A more re­cent al­ter­na­tive li­brary, which I haven’t tried, is warcio, which may be able to find the byte ranges & ex­tract them.

dd

If we are feel­ing brave, we can use the off­set and pre­sumed length to have dd di­rectly ex­tract byte ranges:

dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=2.gz bs=1
# 1326824+0 records in
# 1326824+0 records out
# 1326824 bytes (1.3 MB) copied, 14.6218 s, 90.7 kB/s
dd skip=19810951910 count=1326824 if=greader_20130604001315.megawarc.warc.gz of=3.gz bs=1
# 1326824+0 records in
# 1326824+0 records out
# 1326824 bytes (1.3 MB) copied, 14.2119 s, 93.4 kB/s
dd skip=19809546472 count=1404934 if=greader_20130604001315.megawarc.warc.gz of=4.gz bs=1
# 1404934+0 records in
# 1404934+0 records out
# 1404934 bytes (1.4 MB) copied, 15.4225 s, 91.1 kB/s
dd skip=19812279226 count=9467 if=greader_20130604001315.megawarc.warc.gz of=5.gz bs=1
# 9467+0 records in
# 9467+0 records out
# 9467 bytes (9.5 kB) copied, 0.125689 s, 75.3 kB/s
dd skip=19808229791 count=1316181 if=greader_20130604001315.megawarc.warc.gz of=1.gz bs=1
# 1316181+0 records in
# 316181+0 records out
# 1316181 bytes (1.3 MB) copied, 14.6209 s, 90.0 kB/s
gunzip *.gz

Results

Suc­cess: raw HTML. My dd ex­trac­tion was suc­cess­ful, and the re­sult­ing HTML/RSS could then be browsed with a com­mand like cat *.warc | fold --spaces -width=200 | less. They can prob­a­bly also be con­verted to a local form and browsed, al­though they won’t in­clude any of the site as­sets like im­ages or CSS/JS, since the orig­i­nal RSS feed as­sumes you can load any ref­er­ences from the orig­i­nal web­site and didn’t do any kind of data-URI or mir­ror­ing (not, after all, hav­ing been in­tended for archive pur­poses in the first place…)


  1.  

    For ex­am­ple, the info: op­er­a­tor is en­tirely use­less. The link: op­er­a­tor, in al­most a decade of me try­ing it once in a great while, has never re­turned re­motely as many links to my web­site as Google Web­mas­ter Tools re­turns for in­bound links, and seems to have been dis­abled en­tirely at some point.

  2.  

    WP is in­creas­ingly out of date & un­rep­re­sen­ta­tive due to in­creas­ingly nar­row poli­cies about sourc­ing & preprints, part of its over­all dele­tion­ist decay, so it’s not a good place to look for ref­er­ences. It is a good place to look for key ter­mi­nol­ogy, though.

  3.  

    When I was a kid, I knew I could just ask my ref­er­ence li­brar­ian to re­quest any book I wanted by pro­vid­ing the unique ID, the ISBN, and there was a phys­i­cal copy of the book in­side the Li­brary of Con­gress; made sense. I never un­der­stood how I was sup­posed to get these “paper” things my pop­u­lar sci­ence books or news­pa­per ar­ti­cles would some­times cite—where was a paper, ex­actly? If it was pub­lished in The Jour­nal of Pa­pers, where did I get this jour­nal? My li­brary only had a few score mag­a­zine sub­scrip­tions, cer­tainly not all of these Sci­ence and Na­ture and be­yond. The bit­ter an­swer turns out to be: ‘nowhere’. There is no unique iden­ti­fier (the ma­jor­ity of pa­pers lack any DOI still), and there is no cen­tral repos­i­tory nor any­one in charge—only a chaotic patch­work of in­di­vid­ual li­braries and de­funct web­sites. Thus, books tend to be easy to get, but a paper can be a multi-decade odyssey tak­ing one to the depths of the In­ter­net Archive or pur­chas­ing from sketchy Chi­nese web­sites who hire pi­rates to in­fil­trate pri­vate data­bases.

  4.  

    Most search en­gines will treat any space or sep­a­ra­tion as an im­plicit AND, but I find it help­ful to be ex­plicit about it to make sure I’m search­ing what I think I’m search­ing.

  5.  

    It also ex­poses OCR of them all, which can help Google find them—al­beit at the cost of you need­ing to learn ‘OCRese’ in the snip­pets, so you can rec­og­nize when rel­e­vant text has been found, but man­gled by OCR/lay­out.

  6.  

    This prob­a­bly ex­plains part of why no one cites that paper, and those who cite it clearly have not ac­tu­ally read it, even though it in­vented racial ad­mix­ture analy­sis, which, since rein­vented by oth­ers, has be­come a major method in med­ical ge­net­ics.

  7.  

    Uni­ver­sity ILL priv­i­leges are one of the most un­der­rated fringe ben­e­fits of being a stu­dent, if you do any kind of re­search or hob­by­ist read­ing—you can re­quest al­most any­thing you can find in World­Cat, whether it’s an ultra-obscure book or a mas­ter’s the­sis from 195076ya! Why wouldn’t you make reg­u­lar use of it‽ Of things I miss from being a stu­dent, ILL is near the top.

  8.  

    The com­plaint and in­dict­ment are not nec­es­sar­ily the same thing. An in­dict­ment fre­quently will leave out many de­tails and con­fine it­self to list­ing what the de­fen­dant is ac­cused of. Com­plaints tend to be much richer in de­tail. How­ever, some­times there will be only one and not the other, per­haps be­cause the more de­tailed com­plaint has been sealed (pos­si­bly pre­cisely be­cause it is more de­tailed).

  9.  

    Trial tes­ti­mony can run to hun­dreds of pages and blow through your re­main­ing PACER bud­get, so one must be care­ful. In par­tic­u­lar, tes­ti­mony op­er­ates under an in­ter­est­ing & con­tro­ver­sial price dis­crim­i­na­tion sys­tem re­lated to how court stenog­ra­phers re­port—who are not nec­es­sar­ily paid em­ploy­ees but may be con­trac­tors or free­lancers—in­tended to en­sure cov­er­ing tran­scrip­tion costs: the tran­script ini­tially may cost hun­dreds of dol­lars, in­tended to ex­tract full value from those who need the trial tran­script im­me­di­ately, such as lawyers or jour­nal­ists, but then a while later, PACER drops the price to some­thing more rea­son­able. That is, the first “orig­i­nal” fee costs a for­tune, but then “copy” fees are cheaper. So for the US fed­eral court sys­tem, the “orig­i­nal”, when or­dered within hours of the tes­ti­mony, will cost <$9.25$7.252019/page but then the sec­ond per­son or­der­ing the same tran­script pays only <$1.53$1.202019/page & every­one sub­se­quently <$1.15$0.902019/page, and as fur­ther time passes, that drops to <$0.77$0.602019 (and I be­lieve after a few months, PACER will then charge only the stan­dard $0.13$0.102019). So, when it comes to trial tran­script on PACER, pa­tience pays off.

  10.  

    I’ve heard that Lex­is­Nexis ter­mi­nals are some­times avail­able for pub­lic use in places like fed­eral li­braries or cour­t­houses, but I have never tried this my­self.

  11.  

    Cu­ri­ously, in his­tor­i­cal tex­tual crit­i­cism of copied man­u­scripts, it’s the op­po­site: shorter = truer. But with mem­o­ries or para­phrases, longer = truer, be­cause those tend to elide de­tails and mu­tate into catch­ier ver­sions when the trans­mit­ter is not os­ten­si­bly ex­actly copy­ing a text.

  12.  

    The quick sum­mary of DOIs is that they are “ISBNs but for re­search pa­pers”; they are those odd slash-separated al­phanu­meric strings you see around, typ­i­cally of a form like 10.000/abc.1234. (Un­like ISBNs, the DOI stan­dard is very loose, with about the only hard re­quire­ment being that there must be one / char­ac­ter in it, so al­most any string is a DOI, even hate­ful ones like this gen­uine DOI: 10.1890/0012-9658(2001)082[1655:SVITDB]2.0.CO;2.) Many pa­pers have no DOI, or the DOI was as­signed retroac­tively, but if they have a DOI, it can be the most re­li­able way to query any data­base for them.

  13.  

    I ad­vise prepend­ing, like https://sci-hub.st/https://journal.com in­stead of ap­pend­ing, like https://journal.com.sci-hub.st/ be­cause the for­mer is slightly eas­ier to type but more im­por­tantly, Sci-Hub does not have SSL cer­tifi­cates set up prop­erly (I as­sume they’re miss­ing a wild­card) and so ap­pend­ing the Sci-Hub do­main will fail to work in many web browsers due to HTTPS er­rors! How­ever, if prepended, it’ll al­ways work cor­rectly.

  14.  

    Aca­d­e­mic pub­lish­ers like to use the dark pat­tern of putting a lit­tle icon, la­beled “full ac­cess” or “ac­cess” etc., where an Open Ac­cess in­di­ca­tor would go, know­ing that if you are not in­ti­mately fa­mil­iar with that pub­lisher’s site de­sign & ex­am­in­ing it care­fully, you’ll be fooled. An­other dark pat­tern is the unan­nounced tem­po­rary paper: in par­tic­u­lar, the APA, NBER, & Cell are fond of tem­porar­ily un-paywalling PDFs to ex­ploit media cov­er­age, and then un­pre­dictably, silently, re­vok­ing ac­cess later and break­ing links.

  15.  

    To fur­ther il­lus­trate this IA fea­ture: if one was look­ing for Alex St. John’s en­ter­tain­ing mem­oir ⁠“Judg­ment Day Con­tin­ued…”, a 201313ya ac­count of or­ga­niz­ing the wild 1996 Doom tour­na­ment thrown by Mi­crosoft, but one didn’t have the URL handy, one could search the en­tire do­main by going to https://web.archive.org/web/*/http://www.alexstjohn.com/* and using the fil­ter with “judg­ment”, or if one at least re­mem­bered it was in 201313ya, one could nar­row it down fur­ther to https://web.archive.org/web/*/http://www.alexstjohn.com/WP/2013/* and then fil­ter or search by hand.

  16.  

    If any Blogspot em­ployee is read­ing this, for god’s sake stop this in­san­ity!

  17.  

    Up­load­ing is not as hard as it may seem. There is a web in­ter­face (user/pass­word: “gen­e­sis”/“up­load”). Up­load­ing large files can fail, so I usu­ally use the FTP server: curl -T "$FILE" ftp://anonymous@ftp.libgen.is/upload/.

  18.  

    Al­though flatbed scan­ning is some­times de­struc­tive too—I’ve cracked the spine of books while press­ing them flat into a flatbed scan­ner.

  19.  

    My workaround is to ex­port from gscan2pdf as DjVu, which avoids the bug, then con­vert the DjVu files with ddjvu -format=pdf; this strips any OCR, so I add OCR with ocrmypdf and meta­data with exiftool.

  20.  

    One ex­cep­tion is Google Docs: one can ap­pend /mobilebasic to (as of 2023-01-04) get a sim­pli­fied HTML view which can be archived. For ex­am­ple, ⁠“A Com­pre­hen­sive Guide to Daki­makuras as a Hobby” is avail­able only as a Google Docs page but the URL https://docs.google.com/document/d/1oIlLt1uqutTP8725wezfZ2mjc-IPfOFCdc6hlRIb-KM/mobilebasic will work with the In­ter­net Archive.

Similar Links