< Beautiful Soup 4.8.0
July Film Roundup >

[Comments] (15) Secretly Public Domain: "Fun facts" are, sadly, often less than fun. But here's a genuinely fun fact: most books published in the US before 1964 are in the public domain! Back then, you had to send in a form to get a second 28-year copyright term, and most people didn't bother.

This is how Project Gutenberg is able to publish all these science fiction stories from the 50s and 60s. Those stories were published in issues of magazines that didn't send in the renewal form. But up til now this hasn't been a big factor, because 1) the big publishers generally made sure to send in their renewals, and 2) it's been impossible to check renewal status in bulk.

Up through the 1970s, the Library of Congress published a huge series of books listing all the registrations and the renewals. All these tomes have been scanned -- Internet Archive has the registration books—but only the renewal information was machine-readable. Checking renewal status for a given book was a tedious job, involving flipping back and forth between a bunch of books in a federal depository library or, more recently, a bunch of browser tabs. Checking the status for all books was impossible, because the list of registrations was not machine-readable.

But! A recent NYPL project has paid for the already-digitized registration records to be marked up as XML. (I was not involved, BTW, apart from saying "yes, this would work" four years ago.) Now for anything that's unambiguously a "book", we have a parseable record of its pre-1964 interactions with the Copyright Office: the initial registration and any potential renewal.

The two datasets are in different formats, but a little elbow grease will mesh them up. It turns out that eighty percent of 1924-1963 books never had their copyright renewed. More importantly, with a couple caveats about foreign publication and such, we now know which 80%.

This was announced back in May, but I don't think it got the attention it deserved. This is a really big deal, so I had no choice but to create a bot. Here's Secretly Public Domain, which highlights unrenewed works that have already been scanned for Hathi Trust. This only represents 10% of the 80%, but it's the ten percent most likely to be interesting, and these books have the easiest path towards being available online.

August 9 update: topline number is closer to 73%, next steps for the public domain books, and how to get the data on your own computer.


Comments:

Posted by Irina at Mon Jul 22 2019 14:23

Argh, I'm currently in Germany so I can't access Gutenberg at all! But it's a temporary state so I'll bookmark the page in order to check when I'm back home.

Posted by Hartwig Thomas at Mon Jul 29 2019 02:16

What would it take to apply the same method to music recordings (shellac records)?

Posted by Leonard at Thu Aug 01 2019 10:56

The Catalog of Copyright Entries has volumes for things other than "books proper". Here are four volumes for music from 1955, two of which might be duplicates: https://archive.org/details/catalogofcopyrig395li/ https://archive.org/details/catalogofcopyrig395libr/ https://archive.org/details/catalogofcopyrig395lib/ https://archive.org/details/CatalogOfCopyrightEntriesSeries3Vol.9Part5cNos.1-2jan.-dec.1955/

Up to this point almost all the attention has been on "books proper". I don't think any of the other data has been made machine-readable, except as a side effect of work on "books proper". The other complication is that -- I'm sure you know about this more than I do -- there's no federal copyright protection for sound recordings until 1972, so sound recordings generally don't have CCE entries.

I asked around and this is what I heard (but I can't confirm): NYPL has money for making the entire CCE machine-readable -- registrations and renewals for everything. However, there's no timeframe for making this happen. On top of that, I don't know if this information would be useful in clearing sound recordings, because I don't know what that process looks like. If it's important to establish that the underlying composition is in the public domain, then machine-readable CCE data would help a lot.

Posted by Beatnik at Thu Aug 01 2019 14:20

Very cool...but I found several of those 'public domain gems' at Secretly Public Domain, when checking for access to them at Hathi Trust they are still listed there as under copyright and thus are locked down against any downloads, except for individual pages.

Will Hathi be updating their rights status/accessibility any time soon? Thx!

Posted by Greg Cram at Thu Aug 01 2019 14:37

One clarification--NYPL does not yet have the resources in hand to convert the non-books records. Outside of our project, some of the other CCE data has been transcribed. We'd like to do the entire CCE, so we're actively fundraising for those efforts.

At this point, about 70% of the 450,000 pages that make up the CCE are left to review and convert. We have a lot of work to do to extract the rest of the data, but Leonard's work demonstrates the value of our transcription and parsing.

As we get more transcription done, we'll keep adding it to the GitHub repo. We'd appreciate knowing how the data is being used, but there's no limitation or restriction on your use of the data.

On the sound recordings, Leonard has it exactly right. Until this year, sound recordings fixed before 1972 would be protected by state law. Many states had indefinite terms for copyright protection, so those would have remained in copyright until 2067. With the new music bill, that changes. Sound recordings published before 1923 will now enter the public domain in 2023. Recordings published after 1923 will follow a series of tiers that will result in more works entering the public domain on an accelerated timetable.

Posted by John Mark Ockerbloom at Thu Aug 01 2019 14:37

In the US, nearly all music recordings aren't public domain yet. Until recently, they were effectively under perpetual copyright, but thanks to recent changes in copyright law, they'll start entering the public domain in 2022. See this Library of Congress post for more details:

https://blogs.loc.gov/now-see-hear/2019/02/copyright-breakdown-the-music-modernization-act/

As noted above, to be fully in the public domain, the copyright to the musical composition will need to have expired, as well as the copyright to the recording. It gets a little complicated.

Likewise, there are complications and various traps for the unwary with book copyrights as well. It isn't enough to simply fail to find a renewal; you also need to check things like whether material in the book was previously published somewhere that might still have an active copyright, whether the book is a foreign work that's exempt from renewal requirements, and various other things. The book _Finding the Public Domain_ by Levine et al. discusses the clearance process in more detail (and it's free to read online).

Posted by Leonard at Thu Aug 01 2019 15:21

Thanks, Greg.

Posted by Leonard at Thu Aug 01 2019 17:58

Based on this data, NYPL has convinced Hathi to remove the restrictions on a number of titles. An example is The Americans in Santo Domingo from 1928. But Hathi have their own verification process, and that's where we go back from fast computer work ("these items in dataset A have no matching item in dataset B") to one-at-a-time lawyer work.

As John points out, the absence of a renewal record isn't enough on its own to know that a work is public domain. In particular, Hathi declined to open up some of the titles we pointed out because they included third-party works. Hathi would have to know whether those third-party works were included with authorization, which is almost impossible to determine 80 years later. (This was one of the arguments for keeping "Happy Birthday" under copyright -- it ended up not working in that case.) And you can't tell whether this might be a problem without going through a copy of the book.

Posted by Patrick Durusau at Thu Aug 01 2019 19:05

What am I missing?

If a book was published 1923 to 1964 and is NOT in the renewed list, then it is not in copyright. Yes?

I'm missing the need to consult the original registration, unless you suspect renewals don't accurately represent the original publication. Yes?

Sorry. It sounds like an extremely excellent data set and project and would like to help. Thanks! Patrick

Posted by Leonard at Thu Aug 01 2019 22:42

Patrick, you're right. If a book was published in the US between 1924 and 1964, but not renewed, it is (with the aforementioned caveats) not in copyright. If you're thinking of one particular book, it's very easy to find the answer. All you need is the renewal records. No need to check the original registration. Project Gutenberg has been doing this for a while.

So, why weren't we able, up to this point, to make a complete list of unrenewed books?

Basically, there was no starting point. Nobody had a complete list of all the books published in the US between 1924 and 1964. The registration records only list the books that aren't public domain. It's like having a list of the people who moved out of a city. You can easily check if any particular person has left, but you can't make a list of everyone who still lives there. You have to start with a list of everyone who used to live there, and subtract the people who left.

We probably never will have a complete list of books published in the US between 1924 and 1964, but there's a pretty good substitute: the list of books registered with the Copyright Office for that time period. That's the list that NYPL just converted into machine-readable format. Now we can get something like a complete list of unrenewed books by subtracting the renewal records from the registration records.

As a bonus, each renewal record has a semi-unique ID linking it to the original registration -- something you wouldn't get if you built your own list of "all of the books" from a library catalog or something. This makes it easy to understand cases where a book was published multiple times by different publishers. It's not 100% reliable -- some IDs were reused for reasons we don't understand -- but it's more reliable than going for a title or author match. And data from the Copyright Office is pretty definitive. It's hard to claim that a book wasn't "really" published in the US if there's a copyright registration for it.

Posted by Blu Ryder at Sat Aug 03 2019 07:06

I'd love to review a spreadsheet with the 80% of books you suggest are likely in the public domain. Is such a document currently available, or do you plan to make one available?

Posted by Leonard at Sat Aug 03 2019 17:25

I'll write more about this later, but I've put up a Git repository containing Python scripts which automate the process of generating the lists of books that do or don't have renewal records.

Posted by Gary North at Mon Aug 05 2019 15:54

I do not understand how I can check the title/author of a book published in this period in order to see if it is PD. Is there a guide, either printed or on YouTube?

Posted by Matthew Cockerill at Tue Aug 06 2019 12:40

Leonard
The github oage mentions
"About 74% definitely have no renewal record."

As Blu Ryder notes, it would be *great* if you could share a snapshot of that 74% as a CSV...

Matt

Posted by Beatnik at Tue Aug 06 2019 19:36

@Gary:
If you want to search a specifc title for a renewal ( for books published in the US between 1923 and 1963), the previously mentioned Stanford Renewal database is found here: https://exhibits.stanford.edu/copyrightrenewals?forward=home

And a great chart reference on how to understand copyright terms and what might be in the public domain is this website: https://copyright.cornell.edu/publicdomain
Good luck!


Post a comment

Your name:

Your home page:

Remember this information

Comments:

Allowed HTML tags: <a>, <b>, <i>. Blank lines become paragraph separators.


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.