(cache) The Missing 11th of the Month

2015-12-29 | Culture

In months other than September, the 11th is mentioned substantially less often than any other date. It's been that way since long before 9/11 and I have no idea why.

Source xkcd. Image licensed under CC-BY-NC.

On November 28th, 2012, Randall Munroe published an xkcd comic that was a calendar in which the size of each date was proportional to how often each date is referenced by its ordinal name (e.g. “October 14th”) in the Google Ngrams database since 2000. Most of the large days are pretty much what you would expect: July 4th, December 25th, the 1st of every month, the last day of most months, and of course a September 11th that shoves its neighbors into the margins. There are not many days that seem to be smaller than the typical size. February 29th is a tiny speck, for instance. But if you stare at the comic long enough, you may get the impression that the 11th of most months is unusually small. The title text of the comic concurs, reading “In months other than September, the 11th is mentioned substantially less often than any other date. It’s been that way since long before 9/11 and I have no idea why.” After digging into the raw data, I believe I have figured out why.

First I confirmed that the 11th is actually interesting. There are 31 days and one of them has to be smallest. Maybe the 11th isn’t an outlier; it’s just on the smaller end and our eyes are picking up on a pattern that doesn’t exist. To confirm this is real, I compared actual numbers, not text size. The Ngrams database returns the total number times a phrase is mentioned in a given year normalized by the total number of books published that year. The database only goes up to the year 2008, so it is presumably unchanged from when Randall queried it in 2012.

I retrieved the count for each day for the year (January 1st, January 2nd etc.) and took the median over the months for each day (median of January 1st, February 1st, etc.) for each year. This summarizes how often the 11th and the other 30 days of the month appear in a given year. Using the median prevents outlier days like July 4th from dragging up the average for its corresponding ordinal (the 4th). Only if a ordinal is unusual for at least 6 of the 12 months will its median appear unusual.

I took the median for each ordinal over the years 2000-2008. The graph below is a histogram of the 31 medians. The 1st of the month stands out far above them all and the 15th just barely distinguishes itself from the remainder. Being the first day and the middle day of the month, these two make sense. However, the 11th stands out as the lowest by a significant margin (p-value < 0.05), with no immediate explanation.

histogram_2000-2008

This deficit has been around for a long time. Below is all the ordinals for every year in the data set, 1800-2008. The data is smoothed over eleven years to flatten out the noise. Even at the beginning, the 11th is significantly lower than the main group. This mild deficit continues for a few decades and then something weird happens in 1860s; the 11th suddenly diverges from its place just below the pack. The gap between the 11th and the ordinary ordinals expands rapidly until the 11th is about half of what one would expect it to be throughout the first half of the twentieth century. The gap shrinks in the second half of the twentieth century, but still persists at a smaller level until the end.

ordinals_1800-2008

Astute graph readers will notice that something else weird is going on. There are four other lines that are much lower than they should be. From highest to lowest, they are the 2nd, the 3rd, the 22nd, and the 23rd. They were even lower than the 11th from 1800 until the 1890s. However, starting around 1900, their gaps started shrinking even as the 11th diverged until the gap disappeared completely in the 1930s. There is an interesting story there, but because their effect doesn’t persist to the present, I’ll continue to focus on the 11th and leave the others for a future post.

Typographical hijinks

When I began this study, I was hoping to find a hidden taboo of holding events on the 11th or typographical bias against the shorthand ordinal. Alas, the reason is far is far more mundane: a numeral 1 looks a lot like a capital I or a lowercase l or a lowercase i in most of the fonts used for printing books. An 11 also looks like an n, apparently. Google’s algorithms made mistakes when reading the 11th from a page, interpreting the ordinal as some other word.

We can find some of these mistakes by directly searching for nonsense phrases like March llth or July IIth or May iith. There are nine possible combinations of I, l, and i that a 11 could be mistaken for. Five of them can actually be found in the database for at least one month: IIth, Ilth, iith, lith, and llth. Also found was 1lth, 1ith, and l1th, in which only one letter was misread. I collectively refer to these errors as xxth. Google books queries a newer database than the one on which Ngrams was built, but bona fide examples of the misreads can still be found. Here is something that Google books thinks says January IIth: . And here is one for February llth: . And finally one for March lith: . There are hordes of these in the database. You can find other ordinals that were misread as well, but the 11th with its slender and ambiguous 1s was misread far more often than the others.

I added back in every instance of January IIth, January llth, etc. to January 11th and did the same to the other months. The graph below shows that the 11th gets a big boost by adding back the nonsense phrases. Before the 1860s, the difference between the 11th and the main group is erased. After the 1860s, about a quarter to a third of the difference is erased.

xxth_1800-2008

To the nth degree

So where did the rest of the missing 11th go? Well, starting in the 1860s, the Google algorithm starts to make a rather peculiar error—it misreads 11th as nth. Here is one example from a page full of January nths: . In some years, the number of incorrect reads actually exceeds the number of correct reads. I added January nth to January 11th and did the same for all the months. The graph below shows both the nth and its sum with the 11th. There was little impact before the 1860s, but then this error alone accounts for nearly all of the missing 11th.

nth_1800-2008

Combined graph

When the xxth misreads and nth misreads are both added back into the 11th, the gap disappears across the entire timeline and the 11th looks like an ordinary day of the year. This suggests that the misreading of the 11th as nth, IIth, llth, etc. is sufficient to explain the unusually low incidence of the 11th as a day of the month.

total_1800-2008

Typographical machines

While it makes sense that the 11th was misread more than others, why is the misread rate not uniform? What happened in the 1860s that caused the dramatic rise in the error rate? I suspect that it has something to do with a special device invented in the 1860s—the typewriter. The earliest typewriters did not have a separate key for the numeral 1. Typists were expected to use the lowercase l to represent a 1. When the algorithm read October llth, it was far more correct than we have been giving it credit. There are not that many documents in Google books that are typewritten, but this popular new contraption had a powerful effect on the evolution of fonts. The 1 and l were identical on the increasingly familiar typewriters, and the fonts even of printed materials began to reflect this expectation. Compare the ls and 1s in this font from 1850: . There is a clear difference between an l with no serifs on the top and the 1 with a pronounced serif. Compare that to a font from 1920: . The characters are identical except for the kerning. Even to this day, most fonts represent both the 1 and the l as tall characters with two serifs on the bottom and one left-facing serif at the top. The only difference is that the serif on the 1 is slightly more angled than on the l. (In this post, I used a special monospace font to make it easier to tell the difference.) The print quality of more recent books (post 1970s) has reduced the rate of failure, but it still has not gone away entirely, so that the remaining failures were noticeable in the xkcd comic.

The largest open question is why nth was chosen so often. It seems like such a strange error to make. The word nth is a legal word in mathematical and scientific publications, so that should help its chances of getting picked. In most fonts the top of the n is really thin, and is likely invisible in many texts on which they trained the algorithm. But there is a big different in height between 1 and n, especially in the typewriter era, which is where the errors occur. And the phrase January nth is nonsense so that should have hurt its chances of being selected. Is it possible there was an error in one of the modern training texts that had an 11th labeled as nth, thereby confusing the algorithm? The only way to know for sure would be to crack open the source code of Google’s text-reading algorithm. This is left as an exercise for the reader.

The code used for the analysis in this post is available on Github.

Wurfenking

A very random yet interesting read. I don’t really know what else to comment as I don’t know how to react to this.
CC

A shameful, flimsy cover story for the sacred meeting day of the Illuminati.
Alec Kohl

Are you sure the font reading is confusing 11 for ‘n’ and not for ‘N’? There are several serif fonts where the diagonal bar in ‘N’ is veeery thin, to where the little serif bits on the top and bottom of the number 1 would probably lead the text reading algorithm to think that the two parallel lines are creating an ‘N’ and the serif bits it sees are just hints to there being a diagonal bar between the lines. This would also explain your confusion about the height of the characters, since the height of ‘N’ is the same as the 1s.
- David Hagen
  
  The Google Ngrams database is case sensitive, so it should be picking up only on the lowercase version. Here is a search for January 11th, January nth, and January Nth. There are actually no hits at all for the Nth version.
  
  It is baffling. The only thing that nth has going for it is that it is an actual word, unlike most of ways for the algorithm to misread it.
- yehosef
  
  You can go to google books and search for “January nth” and see the results. It’s highlighting “11th”
habeas corpus

JET FUEL CAN NOT MELT STEEL BEAMS
- L’esprit de l’escalier
  
  HEAT DOES NOT WEAKEN METAL
  
  oh wait yes it does ooops
- David Hagen
  
  Related fact: one of my my high school science projects was testing different methods of protecting steel from fire. At around 800 C, steel turns to rubber.
Urgh2012

I love this type of person.
Arne Babenhauserheide

great finding!
Joshua Dance

Awesome. That was a lot of work. And makes sense. Thanks!
Rikaishi Rikashi

I knew it! Google really is the source of everything wrong in the world.
Peter Hassett

great work!
Shreevatsa R

I’m curious about the 2nd / 3rd / 22nd / 22rd. I suspect it might have something to do with the -nd/-rd suffixes (which is why 12th and 13th are not affected), but don’t have a complete theory. Can you give hints? :)
hacksoncode

While I don’t know how common this might actually be… as a programmer I can see an actual use for the phrase “the nth of the month”, or just “January nth”.
- David Hagen
  
  I agree that it is conceivable. I worried about it enough to look through a lot of books for a legitimate use and came up empty. Maybe there’s one or two in there, but definitely not enough to affect the trend.
NedH

Very interesting and well written, kudos. I’ve run into his many times researching through old scanned newspapers, makes total sense now. I suspect the low usage of ordinals ending in -nd or -rd (i.e. 2nd, 3rd) between 1800-1930 is due to stylist preferences. It was more common early on to just use -d (i.e 2d, 3d).
- NedH
  
  Took a quick look with the N-Gram viewer and the use of -d vs -nd /-rd seems to account for the differences relative to -th usage. Here’s a representative example overlay using January 2nd, 2d, 4th.
  - NedH
    
    Lol, damn I see you’ve already been discussing this on reddit. :)
    - David Hagen
      
      For those who don’t have a Disqus account or just want to read the Reddit comments, here is the thread on /r/xkcd.
  - David Hagen
    
    And we have another winner!
  - NedH
    
    Cool, Apparently you can also use operators on ngrams. Summed January 2nd + January 2d and compared with January 4th.
Alex Reynolds

I was wondering if you have done the same analysis with “1st” and “lst”, to see if the same count-correction on “1”-labeled days would have an effect with that typo?
- David Hagen
  
  I did only a little bit of digging into this. Because the 1st is such an outlier, you would never expect to see a deficit (any misreads would just take a little off top of the mountain). That being said, you can find misreads. Strangely, it almost never misreads it as a lowercase l or capital I, but it does fairly often misread it as a lowercase i.
  - Robert Kosara
    
    This makes a lot of sense to me. “ist” is a possible fragment when a word is hyphenated, “lst” or “Ist” aren’t. They don’t just look at individual letters to recognize words, but also at matches with some sort of dictionary. That would also explain why “nth” shows up, since it’s a likely dictionary entry (for a tech company/background, anyway).
gypsydoctor

Touching characters in proportional fonts are very difficult in OCR, for example rn and m (r n and m). 11 with serifs on the bottom could be confused with U.
- David Hagen
  
  Uth is actually is a fairly common error.
  - Arcadia Berger
    
    Uth is wasted on the Ung.
NedH

Took a quick look with the N-Gram viewer and the use of -d vs -nd /-rd seems to account for the differences relative to -th usage. Here’s a representative example overlay using January 2nd, 2d, 4th.
Navneeth

Nice analysis. Is the reason for the (double) peak around 1900 known?
- David Hagen
  
  That is more difficult to analyze because it requires looking at database as a whole and not just as individual phrases. The overall trends in the use of dates should be largely driven by the types publications that go into the database and preferences for the style “January 14th” over “1/14” or “Jan 14”. Remember, it is normalized by the total number of words in the database for that year. So you don’t even need a change the usage of dates to get a change in the fraction; maybe there was an explosion of vampire fiction or scientific publications that drove down the total fraction of date usages.
Peter

Thank you. This would have made a nice quiz for a job interview at Google before they stopped doing these.
Arcadia Berger

I want to know what caused references to all days of the month to decline after the 1930s and trough so neatly in 2000.
- David Hagen
  
  Because the whole database is normalized to total number of words published each year, I can think of three possibilities:
  
  (1) use of dates in general declined, possibility due to faster news, which makes it less likely to say “On January 12th, the legislature passed a bill…” rather than “last Tuesday” or “yesterday”
  
  (2) use of that style declined in favor of something else like “12th of January” or “Jan 12th” or “January twelfth” or “1/12” or “12/1”
  
  (3) more stuff was published that doesn’t require dates, like scientific or entertaining publications.
  
  I would guess that all three contributed to some degree, with (2) probably having the largest impact. But that is pure speculation.
Alora

In your conclusion you specifically use a lower case n in your nths. Was that intentional? I wonder if the confusion is more likely of an upper case N. In a sanserif monotype font with 1s (one that doesn’t include the ‘angled serif’) it seems more likely that 11th is misread as Nth than nth. Did you look at the case of the Ns during your research?
- David Hagen
  
  See my answer to Alex Kohl. In short, the database is case sensitive, so it should be nth and not Nth, as strange as that may seem.
obryan

But wait — did you check for other error types? Like, February B (13 combined) or December I7 (17 where I is replacing the 1), etc?

I feel like if 11 was messed up with Google text reader algortihms then so must others
Tim Rosenblatt

David, you are a champion. This is really well thought out.
Ahmad Lashgar

How about the other four lines?!
Jerzy Wieczorek

Nice detective work! Reminds me of the (even more scary) similar mistakes made by scanners inside some photocopiers — so you think you’re getting an exact copy but the numbers/letters might actually be changing:
https://twitter.com/flamsmark/status/638832642897526789
Anonymouse

Haha. Awesome.
Christian Carl Schley

I wonder if a greater prevalence of text figures could partially account for the error rate. “11th” written with its numerals at x-height looks quite similar to “nth.” Were it so, I suppose one would also find errors reading a 1 as an i, 2 as z, and 0 as o.
Karl Siegemund

There are fonts where the numerals are designed to show a similar pattern of ascenders and descenders like the letters, so they blend better with the normal text. Our normal numerals look all like upper case letters in the usual fonts, which can lead to arkward looking pages if they appear within text.

In those fonts with the distributed ascenders and descenders, the 1 looks like an i without the dot, like the ı for instance in the Turkish language. If you have an 11 in a font like this, it looks like ıı, and if your OCR programm doesn’t know about that peculiarity, it would interpret it as an n rather than 11.
Brass Lion

A fantastic post. It’s always good to remember, before you go looking for an explanation of a nonsensical-seeming phenomenon, to make sure it’s not just phantasms in the data.

David R Hagen

The Missing 11th of the Month

Typographical hijinks

To the nth degree

Combined graph

Typographical machines

The code used for the analysis in this post is available on Github.

Latest Blog Entries

The Sanctification Button

Marginal vs. Conditional Subtyping

The Missing 23rd of the Month

Browse topics