I sometimes work on Aegisub, an open-source subtitle editor. A while back, some users reported a bug where if they attempted to time subtitles to video ripped from Crunchyroll (henceforth CR), the timing and preview in Aegisub did not reflect what would occur in playback—it was off by two frames. As it turns out, this seemingly simple bug was quite the rabbit hole. I’ve narrated below the actual process I took to figure this out. In addition, I’ve tried to include brief explanations of relevant video encoding topics. Hopefully between the two, it’ll prove both informative and interesting.

The first step I took was to download a sample video, in this case a random episode of 僕のヒーローアカデミア. Sure enough, the bug was easy to reproduce locally. The reason why was quickly made obvious upon looking at the file: there appeared to be 83ms of video delay set. Typically, you set the delay on the audio track instead as it’s better supported, makes more intuitive sense, and the value is relative anyway. Some brief debugging revealed that the relevant Aegisub dependency (ffms2) wasn’t properly handling a delayed video track. Easy enough to deal with, right? File an issue with the relevant repo and call it a day.

However, every file ripped from CR had the same characteristic: a flat 83ms video delay. There’s no obvious reason for that, and it means that anyone watching CR video rips is seeing incorrect subtitle timing and has been for quite some time. The files were otherwise normal, with MediaInfo reporting a constant video framerate of 24000/1001, or roughly 23.976[1]. My curiosity got the best of me, and I decided to see what was going on.

First, I wanted to make sure I could reproduce this with a video I ripped myself. Doing so is pretty easy with youtube-dl, so I grabbed the same episode off the CR site directly, muxed into MKV with mkvmerge, and sure enough: 83ms of video delay. Maybe it’s the FFmpeg HLS downloader? Nope, same behavior with the native one. Using MediaInfo on the MP4 files wasn’t even showing any delay! The next step was to start looking directly at the MP4 metadata in the ripped file, and suddenly things started to make a bit more sense.

As mentioned earlier, the videos are 23.976 fps, which is standard for anime. HLS with MPEG-TS, however, can’t match this frame rate by default. The format lists a sequence of paired frame counts and corresponding frame time deltas, which can be converted to an FPS by dividing the time scale specified elsewhere in the file by the delta. The time scale used for all of CR’s video is 90000, which is mandated for MPEG-TS. This means no integer delta value can produce 23.976; the closest two are 3754 (23.9744) and 3753 (23.9808). When encoding to HLS the encoder simply alternates between these two deltas, in this case using a 1-3-1-3 pattern, to achieve something very close to the original framerate. Upon checking the track metadata, most of it seemed to be just that. An example from the file I tested is below, which will hopefully make what I’m talking about a little more clear:

sampleCount[1] = 1 (0x00000001)
sampleDelta[1] = 3753 (0x00000ea9)
sampleCount[2] = 3 (0x00000003)
sampleDelta[2] = 3754 (0x00000eaa)
sampleCount[3] = 1 (0x00000001)
sampleDelta[3] = 3753 (0x00000ea9)
sampleCount[4] = 3 (0x00000003)
sampleDelta[4] = 3754 (0x00000eaa)

The very first entry, however, was odd. Instead of following the pattern, it listed 2 frames with a delta of 3754, as shown below:

sampleCount = 2 (0x00000002)
sampleDelta = 3754 (0x00000eaa)

MediaInfo and mkv2vfr were both reporting that the MKV file used a constant FPS of 23.976, as opposed to the alternating pattern in the MP4. My initial guess was that the muxer noticed the pattern and attempted to fix the FPS, but didn’t know how to handle those initial two samples. However, after dumping the full MKV info, it appeared the timecodes were just copied exactly from the MP4. This prompted me to check the MP4 metadata more thoroughly[2], and sure enough, those first two samples were actually just the specified 83ms of delay.

This means mkvmerge’s behavior is correct; the mp4 delays the video by 83ms and so the same is done in the mkv, and various tools just aren’t reporting all the info on both files correctly[3]. Unfortunately, this still leaves the question of why the rip includes the delay in the first place. As it turns out, it’s present in a 2012 RTMP stream someone had archived, so it’s not even new behavior![4]

Before we go further, I need to explain B-frame delay in MP4s. In H.264 there are three major frame types. I-frames, or intra-coded frames, contain a full frame’s data and are independent of any other frame. P-frames, or predictive frames, reference prior frames to improve compression. B-frames, or bidirectional frames, reference frames in both directions to further improve compression. However, the use of B-frames causes an issue in decoding: you have to read the later frame before the b-frame itself to be able to decode it, but don’t want to be jumping all around the file while reading. As a result, the decoded frame order and presented frame order are different.

In MP4, each sample has a specified DTS (decoding timestamp) and CTS offset (composition timestamp offset). To get the CTS/PTS (presentation timestamp), you just add the two together. However, there’s one issue: the offset has to be positive since it’s stored as an unsigned integer. This is handled by setting the initial CTS offset to a positive value so that when it has to be lowered to reference earlier frames it never drops below 0. This delay is the aforementioned B-frame delay. However, to ensure the tracks are still in sync, this has to be corrected elsewhere in the file. This is done in an atom, or section, known as the elst.

TL;DR: The video track is offset in one portion of the file for technical reasons and then later corrected in another one called the elst.

To figure out why the rip has the delay, I had to check out the actual MPEG-TS information before it was remuxed to MP4. And here, finally, the pieces all came together. Judging by the naming scheme (/assets/cbe8b975c5f632d7b05b70c61a018a3d_3382512.mp4/seg-1-v1-a1.ts for example), CR has their original MP4s stored and then their CDN generates the HLS on the fly. This is pretty common and makes sense for their setup. The CDN appears to be Akamai based on the domain, so presumably CR is just paying them to handle this step entirely. Unfortunately, Akamai appears to be doing it wrong. While they take into account the specified CTS offsets (that is, they’re calculating the base presentation time correctly), they’re not including the elst information. As such, you get something like the following (from ffprobe):

"packets": [
    {
        "codec_type": "video",
        "stream_index": 0,
        "pts": 16598,
        "pts_time": "0.184422",
        "dts": 9090,
        "dts_time": "0.101000",
        "size": "1220",
        "pos": "376",
        "flags": "K_",
        "side_data_list": [
            {
                "side_data_type": "MPEGTS Stream ID"
            }
        ]
    },

...

    {
        "codec_type": "audio",
        "stream_index": 1,
        "pts": 9090,
        "pts_time": "0.101000",
        "dts": 9090,
        "dts_time": "0.101000",
        "duration": 2089,
        "duration_time": "0.023211",
        "size": "378",
        "pos": "2256",
        "flags": "K_",
        "side_data_list": [
            {
                "side_data_type": "MPEGTS Stream ID"
            }
        ]
    },

The video presentation time is 16598 - 9090 ahead of the audio, and with a specified time scale of 90000, (16598 - 9090) / 90000 = 0.08342222, or 83ms. This mistake was causing every file following it to have the bizarre video delay, which then produced inconsistent results across players and editing applications resulting in this whole mess. Since the conversion is occurring automatically for every video in CR’s catalog served via the CDN, this affects every video on the site[5]. Akamai: please fix your shit.

On the bright side, the fix for users is simple! Just remux the video, manually specifying the fps in the process (--default-duration 0:24000/1001fps with mkvmerge) and the problem goes away.

Oh, and this took me ages to actually figure out. Multimedia is hell.


Footnote:

  1. 1: This seemingly strange value exists for historical reasons related to the transmission of electronic signals. For more information, see https://en.wikipedia.org/wiki/NTSC#Color_encoding.

  2. 2: Most importantly, I checked the elst atom. It actually had two entries, the first of which was 83ms duration with a specified media time of -1, which means to just delay the track.

  3. 3: In at some some cases this is probably by design to simplify things for users, but it sure makes it annoying when this behavior is inconsistent across tools and even with the same tool on different file formats.

  4. 4: Looking through that file as part of this actually uncovered some FFmpeg bugs in both writing and reading for FLV files with b-frames, as the delay there was a separate issue.

  5. 5: Please don’t cite this post as an example of some other streaming service’s supposed superiority over Crunchyroll. Every pirate site I’ve seen has far larger issues, and to the best of my knowledge every other legal option has significant problems too.