(cache)Secretly Public Domain: Update

Fri Aug 09 2019 09:45 Secretly Public Domain: Update: My "Secretly Public Domain" project got a lot of attention, which is great, but it also gave me a lot more work to do and pointed to some things that hadn't been explained very well. I've done that work, and here's an update:

Topline number is 73%

My original estimate was that 80% of pre-1963 books were not renewed. This was based on a couple of inaccurate assumptions, the big one being that I was counting works originally published in a foreign country. Those works might have lapsed into the public domain at some point, but the US copyright has since been restored by treaty. So their renewal status isn't really relevant.

Of the books where renewal status is relevant, here are the most recent statistics:

73% have no renewal record at all.
19% have a renewal record that's an excellent match.
8% are in a grey area. They have one or more renewal records, but none of them are an excellent match. One of them might be legit, or they might all be renewals for totally different books. They need to be checked manually.

Credits

The "Secretly Public Domain" bot was a publicity stunt to draw attention to the machine-readable registration records. It worked great, but it also drew attention to me, the person doing the publicity stunt, even though I had basically nothing to do with the original work. For the record, here are the people who actually did the work. The project inside NYPL was run by Sean Redmond, Greg Cram, and Josh Hadro (now of IIIF). The work of making the copyright records machine-readable was done by Data Conversion Laboratory.

Buried treasure

Most of the books whose copyright wasn't renewed are really obscure titles, but without looking very hard I found a very well-known science fiction novel that has no renewal record. I'm not mentioning the name as an incentive to get people to look at the data themselves. It's probably not the only well-known work whose copyright wasn't renewed.

How to make your own list

My original estimate of 80% was based on the quick and dirty script I used to write the Mastodon bot. To fix the "foreign works" problem and to produce a dataset that would stand up to scrutiny, I published a Python library specifically for handling this data. It's got business logic for making determinations like "was this book published in a foreign country" and "how well does this renewal record match this registration record". You run the scripts and at the end you have a bunch of JSON files with consolidated data. If you think there are bad assumptions, you can change the business logic and run the scripts again.

How to see the data

There were a number of requests for this data in a tabular form. I totally understand where this is coming from, and it's certainly the easiest way to get into the data, but it's tricky, because converting the JSON to tabular data destroys information that would be useful for taking the next step (see below).

So, I've done the best I can. I added a script to the end of my Python workflow which generates three huge tab-separated files, and I put those files in the cce-spreadsheets project. This should be good for getting an overview of which books were renewed, which weren't, and which are foreign publications.

What's next?

Discovering that a book published in 1950 is in the public domain, doesn't make a free digitized version of that book automatically appear. Somebody has to do the work. At this point we go from fast data processing to really slow research and digitization work. You or I can now make a near-complete list of unrenewed books in a few minutes, but that list just represents an enormous to-do list for someone.

There are basically three "someones" who might step up here: Project Gutenberg, Hathi Trust, and Internet Archive.

Project Gutenberg

As I mentioned earlier, Project Gutenberg digitized the copyright renewal records some time ago, and they use them all the time. They have a section of their Copyright How-To explaining how to check whether a particular title was renewed, and whether the renewal matters. There are other steps to clear a pre-1963 work: you have to verify that the author lived in the US at the time, stuff like that. The newly digitized registration records can help with some of this, and my data processing script that combines registration and renewal can help with more of it, but there's still some manual work you have to do for each book.

Once that work is done, Project Gutenberg volunteers will locate a copy of the book, scan it, and OCR it (assuming there's no existing scan). Then they'll proofread it and put out HTML and plain-text editions. As you can imagine, this process takes a really long time, but the result is a clean, accurate copy of the book that can be read on its own or reused in other projects. The catch is that somebody has to care enough about a specific book to go through all this trouble.

Hathi Trust

Hathi Trust already has scans of a lot of these 1924-1963 books. They just don't make these scans available to the public, because as far as they know, all these books are still under copyright. If they were convinced otherwise, they'd open up the scans—they opened up almost all of their 1923 stuff this January when the 95-year copyright term finally expired. So we have to make a case for opening up these books.

Earlier, NYPL took the highest-circulating 1924-1963 books in our research collection and checked to see which ones lacked a renewal record. We sent the list to Hathi Trust, and they did their own verification and opened up some of the books: The Americans in Santo Domingo from 1928 is an example. Once Hathi opens up a scan, it's available to the public. It also becomes possible for Gutenberg et al. to turn the raw scan into something more readable.

In the near future, people at NYPL (not me) will be talking to people at Hathi Trust about what kind of evidence is necessary, in general, to convince them that the copyright on a 1924-1963 book has lapsed. Then we'll be able to give them a list of all the books where we can find that kind of evidence. There'll still be a verification process on the Hathi Trust side -- at the very least, they have to go through the book and make sure it doesn't contain unauthorized reprints from other books -- but it should streamline things quite a bit.

Internet Archive

Internet Archive is a wild card here. They scan a lot of books, and I could see them treating the "unrenewed" list as a big list of additional books to scan, but it would be a new undertaking. Making unrenewed works available is something Project Gutenberg volunteers do already, and it's something that Hathi Trust could do relatively easily, but with Internet Archive it's more the sort of thing they'd do.

Data problems

That 8% of grey area, where it's not clear whether or not a book was renewed, points to the general difficulty of meshing together two sets of public records published across half a century and digitized by different people. The grey area represents a lot of manual work that has to be done, and of course there's always the fear that a book that seems to be free and clear actually isn't: the title page says "printed in Canada", or the smoking-gun copyright renewal didn't show up because its ID number was typed wrong.

There's going to be a lot of manual work in the process of clearing these books, but there's no reason to wait until everything's perfect to get started. My preference is to cast a very wide net, try to find any renewal that might possibly be related to a registration, and make the grey area as big as possible. We know that a majority of 1924-1963 books will always come up "no renewal", because there are way more registrations than renewals. We can deal with those and then take a closer look at the grey area.

Other media

A couple of people asked whether it was possible to do this for other media. The good news is that there are volumes of the Catalog of Copyright Entries for:

"Books, Pamphlets, Serials, and Contributions to Periodicals"
"Periodicals"
"Drama and Works Prepared for Oral Delivery"
"Music"
"Maps and Atlases"
"Works of Art; Reproductions of Works of Art; Scientific and Technical Drawings; Photographic Works; Prints and Pictoral Illustrations"
"Commercial Prints and Labels"
"Motion Pictures and Filmstrips"

All of these books have scans hosted at the Internet Archive. You can get an overview by looking at Penn's index of the CCE from a specific year, let's say 1960.

As far as I know--and I do know about one big exception--the rules here are the same as for books. If something wasn't registered, or the registration wasn't renewed, then the copyright on a work first published in the US 1924-1963 has lapsed.

Now, the bad news. We have scans of the Catalog of Copyright Entries, but the only bits where both the registration and renewals are machine-readable is "Part 1 Class A". That's the "Books" part of "Books, Pamphlets, Serials, and Contributions to Periodicals", and it represents only about 30% of the total.

If you want to see whether there's a renewal record for a fishing map of Kansas, or a magazine article, or a cool retro ad, or a classic film noir, or a vintage restaurant placemat, it is quite possible, but it's a huge pain. And you can forget about running the numbers on all the movies or all the restaurant placemats. We don't have a good picture of what's in there.

The situation is this way because the Catalog of Copyright Entries is huge, and digitizing it is boring/expensive. Up to this point, book nerds are the only nerds who've put in the time and money to make "their" part of the CCE machine-readable. NYPL has plans to give this same treatment to the entire CCE, but the crucial part of the plan where we have money to pay someone to do this is currently missing; it's a matter for fundraising.

The second piece of bad news regards music. When we in 2019 think about "music", we think of sound recordings. When the CCE thinks about "music", it's thinking about the underlying composition—basically the stuff that would go on the sheet music. Until 1972 there was no federal-level copyright on sound recordings, and the result is that music copyrights are a bigger mess than other types of copyright. I do not want to get into territory I don't understand, but suffice to say that for a vinyl record to be in the public domain, it's necessary but not sufficient that the copyright on the underlying composition have expired. So the CCE can only help so much.

News You Can Bruise for 2019 August 9 (entry 0)

Post a comment

News You Can Bruise for 2019 August 9 (entry 0)
< July Film Roundup

Your name:
Your home page:
Remember this information