Unraveling
the JPEG

May 1, 2019

Created by

Metadata

Unraveling
the JPEG

JPEG images are everywhere in our digital lives, but behind the veil of familiarity lie algorithms that remove details that are imperceptible to the human eye. This produces the highest visual quality with the smallest file size—but what does that look like? Let's see what our eyes can't see!

It’s easy to take for granted that you can send a picture to a friend without worrying about what device, browser, or operating system they’re using, but things weren’t always this way. By the early 1980s, computers could store and display digital images, but there were many competing ideas about how best to do that. You couldn’t just send an image from one computer to another and expect it to work.

To solve this problem, the Joint Photographic Experts Group (JPEG), a committee of experts from all over the world, was established in 1986 as a joint effort by the ISO (International Organization for Standardization) and the IEC (International Electrotechnical Commission)—two international standards organizations headquartered in Geneva, Switzerland.

JPEG, the group of people, created JPEG, a standard for digital image compression, in 1992. Anyone who’s ever used the internet has probably seen a JPEG-encoded image. It is by far the most ubiquitous way of encoding, sending and storing images. From web pages to email to social media, JPEG is used billions of times a day—almost every time we view or send images online. Without JPEG, the web would be a little less colorful, a lot slower, and probably have far fewer cat pictures!

This article is about how to decode a JPEG image. In other words, it’s about what it takes to convert the compressed data stored on your computer to the image that appears on the screen. It’s worth learning about not just because it’s important to understand the technology we all use everyday, but also because, as we unravel the layers of compression, we learn a bit about perception and vision, and about what details our eyes are most sensitive to.

It’s also just a lot of fun to play with images this way.

Peering Inside a JPEG

Everything on a computer is stored as a series of binary numbers. Typically, these bits, the zeros and ones, are arranged in groups of eight, known as bytes. When you open a JPEG image on your computer, something (the browser, your operating system, or something else) has to decode the bytes to recover the original image as a list of colors that can then be displayed.

If you download that picture of the cat and open it using any text editor, you’ll see a bunch of garbled characters.

Here I’m using Notepad++ to look at the image file, since common text editors like Windows’ Notepad will change the file’s binary contents when you save it so it’s no longer a valid JPEG.

By opening an image in a text editor, you’ve confused the computer, in the same way you confuse your brain when you rub your eyes too hard and start to see blotches of dimness and color!

These blotches you see—known as phosphenes—don’t come from any light stimulus, nor are they hallucinations made up in your mind. They arise because your brain assumes that any electrical signal arriving through the nerves in your eye is conveying light information. The brain needs to make this assumption because there’s no way to know whether a given signal is sound, sight, or something else. All the nerves in your body carry exactly the same type of electrical pulse. When you apply pressure by rubbing your eyes, you’re sending non-visual signals that trigger the receptors in your eye, which your brain interprets—incorrectly, in this case—as vision. You can literally see the pressure!

It’s fun to think about how similar computers are to our brains, but this is also a useful analogy because it illustrates how much the meaning of data—whether carried through a body by nerves or stored in a computer—relies on how you interpret it. All binary data is made up of ones and zeros, basic components that could be conveying any kind of information. Your computer often guesses how to interpret it using clues, like the file extension. Here we’ve forced it to interpret it as text, because that’s what a text editor expects.

To understand how a JPEG image is decoded we need to see the original signals themselves—the binary data. This can be done with a hex editor

, or it can be done right here in this webpage! Below is the image next to all of its bytes, represented as decimal numbers. You can make changes to the bytes, and it will re-decode and display the new, edited image as you type.

JPEG Editor

Size: 79.21 kb. Dimensions: 700 x 437

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
255 218 0 12 3 1 0 2 17 3 17 0 63 0 249 148 
245 244 175 140 62 216 97 52 208 131 190 104 24 103 32 250 
208 33 51 84 12 9 164 3 41 128 229 60 208 2 154 0 
107 80 38 200 95 245 170 1 153 166 49 172 120 161 8 76 
241 77 130 26 78 71 248 80 75 35 99 84 50 187 158 212 
34 25 92 183 94 57 173 17 44 96 63 133 81 40 27 214 
144 198 49 31 90 6 153 11 114 105 12 117 179 109 153 104 
98 61 23 194 151 98 61 156 210 41 234 123 95 131 181 61 
225 59 86 177 48 108 245 205 14 231 204 137 115 90 163 38 
116 80 174 224 13 89 29 75 107 23 203 76 4 146 46 58 
82 2 133 204 124 30 40 3 18 250 62 181 44 70 29 202 
114 120 169 46 229 70 143 39 252 104 37 187 130 198 104 17 
110 24 137 233 76 108 188 176 100 123 85 48 26 240 117 226 
164 104 207 185 131 57 56 160 104 231 181 75 110 13 0 209 
199 106 208 240 220 80 51 138 213 163 198 234 0 228 53 21 
229 170 128 231 175 87 154 123 146 97 93 245 63 206 173 16 
204 217 133 81 5 9 135 90 162 89 74 65 207 74 98 66 
168 224 87 60 206 234 68 192 29 181 193 45 207 78 43 66 
165 200 60 213 196 137 148 36 25 53 212 142 9 11 0 203 
125 107 67 156 214 181 29 42 36 84 81 173 2 252 149 204 
206 148 180 30 203 245 164 85 136 177 64 135 45 12 99 240 
113 214 149 199 97 173 199 227 76 77 12 69 203 123 10 119 
18 52 173 23 24 61 235 158 76 232 137 183 104 57 21 206 
205 145 179 105 242 227 214 145 47 67 98 217 134 57 164 65 
126 23 226 129 22 16 156 131 210 128 36 3 34 129 6 78 
122 80 49 84 224 30 122 210 11 14 67 215 214 129 15 39 
241 166 49 81 136 53 86 2 81 32 199 25 169 176 134 23 
28 247 166 65 28 141 249 84 150 66 199 34 139 143 113 65 
61 13 49 9 35 149 92 243 211 173 33 153 242 92 48 106 
46 49 99 156 145 72 157 201 214 114 121 254 116 211 42 196 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Hint: try scrolling down and removing a few chunks. Don’t worry, you can always reset the image back to the original!

There’s a lot you can learn just from playing around with this editor. For example, can you figure out the order the pixels are stored in?

Something strange in the example above is that changing some numbers doesn’t seem to impact the image at all, while setting the 17 on line one to 0 completely ruins the image! Other actions, like setting the 7 on line 1988 to 254 change the color, but only for subsequent pixels.

The three layers of JPEG compression

Chrominance Subsampling
Discrete Cosine Transform & Quantization
Run-Length, Delta & Huffman Encoding

To give you an idea of the scale of this compression, notice that the image above is represented using exactly 79,819 numbers, which makes it about 79 kilobytes. If it were stored with no compression, three numbers would be needed for each pixel—one for each of the red, green and blue components. That would mean a total of 917,700 numbers, or about 917 kilobytes. With JPEG compression, the resulting file is over ten times smaller!

In fact, this image can be squeezed into far fewer bytes. Below is the image next to a version of it that was compressed down to just 16 kilobytes, which makes it fifty-seven times smaller than the uncompressed version would be!

If you look closely you’ll see that these images are not identical. Both are JPEG-encoded images, but the one on the right is much smaller in terms of file size. It also doesn’t look as nice (notice how blocky the colors look in the background). This is why JPEG is known as a lossy compression technique; the image changes and loses some detail as a result of the compression.

1. Chrominance Subsampling

Here’s the image with just the first layer of compression applied.

Chrominance Subsampling

Size: 153.60 kb. Dimensions: 400 x 250

1
71 71 70 70 108 148 70 70 69 69 108 148 69 69 69 69 108 148 69 69 70 70 
    108 148 70 70 69 69 109 148 69 69 69 69 109 148 69 69 69 68 109 148 68 
    68 68 68 109 148 68 68 69 69 110 148 69 69 69 69 110 148 68 68 68 68 
    110 148 67 67 66 66 110 148 66 65 65 65 110 148 65 65 65 64 110 148 65 
    64 63 63 110 148 64 64 64 64 110 148 63 63 63 63 110 148 63 62 62 61 
    110 148 61 60 59 58 110 148 58 57 56 55 110 148 58 56 55 54 110 148 54 
    55 55 55 110 148 53 54 54 54 110 148 54 53 53 52 110 148 53 52 51 50 
    110 148 49 50 51 51 111 148 57 64 72 81 111 148 91 99 110 122 111 148 
    128 136 143 149 111 148 157 161 161 163 112 148 164 164 165 166 112 
    148 167 167 168 168 112 148 168 168 167 166 112 148 165 163 162 161 
    112 147 159 157 154 152 112 147 149 146 144 142 112 147 142 140 139 
    138 112 147 137 137 137 136 113 146 136 137 138 139 113 146 140 140 
    141 142 113 146 144 145 145 146 114 144 146 145 145 144 114 144 144 
    142 140 139 114 145 137 134 130 128 114 145 123 120 116 113 114 145 
    109 105 99 95 114 145 91 88 84 80 114 145 76 73 68 65 114 146 65 62 59 
    56 114 146 54 52 50 48 114 146 45 44 43 42 114 146 41 39 38 37 114 146 
    36 35 34 32 114 146 31 31 31 31 114 145 32 32 32 33 114 145 34 35 34 
    34 114 145 34 32 30 28 117 143 27 26 25 24 117 142 24 24 24 25 117 142 
    26 29 31 32 118 141 35 36 38 40 118 141 42 43 43 43 119 140 43 42 42 
    41 119 140 41 41 41 41 119 140 41 41 42 42 118 140 42 43 43 43 118 140 
    43 43 42 42 118 140 41 41 40 40 118 141 40 41 43 45 117 141 47 50 53 
    55 117 141 56 58 60 62 117 141 64 66 69 72 117 141 74 76 78 79 117 140 
    80 81 83 84 117 140 85 85 84 85 117 140 87 87 87 86 117 140 86 86 87 
    87 117 140 87 86 86 85 117 140 87 88 88 87 117 140 87 87 88 89 117 140 
    90 89 89 88 118 140 88 89 89 90 117 141 91 91 92 93 117 141 92 92 91 
    90 117 141 91 91 91 90 117 141 89 88 88 87 116 141 87 86 86 86 116 142 
    86 87 87 87 116 142 88 88 88 88 116 142 86 84 83 82 117 142 82 82 83 
    83 117 142 84 85 85 86 117 142 86 86 86 86 117 142 86 87 88 89 118 142 
    90 90 91 91 118 142 91 91 92 93 118 142 93 93 93 92 119 141 92 91 91 
    91 119 140 91 92 92 93 119 140 92 91 90 90 119 140 71 70 70 70 120 139 
    70 69 69 69 120 139 69 69 69 69 120 138 69 69 69 69 120 138 70 69 69 
    69 121 138 69 69 69 69 121 138 69 69 69 68 121 138 68 68 68 68 121 139 
    68 68 69 69 119 141 69 69 69 69 117 145 68 68 68 68 114 149 68 67 66 
    66 113 152 66 65 65 65 113 150 65 65 65 64 114 148 65 64 63 63 116 146 
    64 64 64 64 118 143 63 63 63 63 119 141 63 62 62 61 119 140 61 60 59 
    58 119 140 58 57 56 55 118 140 57 56 55 54 116 142 54 55 55 55 116 143 
    54 54 54 54 115 144 54 53 53 53 114 145 52 52 51 50 114 146 50 50 50 
    51 114 145 56 64 71 80 115 144 90 97 107 118 115 143 126 134 141 147 
    115 145 156 160 161 163 114 145 164 165 166 167 113 145 168 168 169 
    169 113 145 169 169 168 167 113 144 166 164 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Hint: Notice that removing one number ruins all the colors. But removing exactly 6, or any multiple of 6, has a minimal effect on the image.

It’s a little more straightforward to decipher now. It’s almost a simple list of colors, where each byte changes exactly one pixel, and yet it’s already almost twice as small as the uncompressed image (which would be around 300 kb for this smaller size). Can you guess why?

You can tell that these numbers don’t represent the standard red, green and blue components because replacing all the numbers with 0 turns the image green (as opposed to black).

This is because these bytes represent the Y (brightness), Cb (relative blueness), and Cr (relative redness) of the image.

Why not just use RGB? After all, that’s how most modern screens work. Your monitor can display any color by turning on red, green and blue lights at various intensities for each pixel. White is displayed by turning on all three colors at full brightness, while black is displayed by turning them all off.

That’s also very similar to how human eyes work. The color receptors in our eyes known as “cones” are split into three types, each of which is mostly sensitive either to red, green, or blue. Rods, the other type of receptor we have in our eyes, can only detect changes in brightness, but they’re far more sensitive. We have about one hundred and twenty million rods in our eyes, compared to a measly six million cones.

This means that our eyes are much better at detecting changes in brightness than they are at detecting changes in color. If we can separate the color from the brightness, we can remove a bit of the color without anyone noticing. Chrominance subsampling is the process of representing an image’s color components at a lower resolution than its luminance components. In the example above, each pixel has exactly one Y component, while each discrete group of four pixels has exactly one Cb and one Cr component. So the image contains only a quarter as much color information as it originally did.

Using the YCbCr colorspace is not unique to JPEG. In fact, it was originally developed in 1938 for TV broadcasts. Not everyone had color TVs, so separating out the color from the luminance allowed everyone to receive the same transmission, and TVs that didn’t support color would just use the luminance component.

This is why removing one number from the editor above completely ruins the color. Here, the components are stored

as Y Y Y Y Cb Cr. Removing the first number causes the Cb value to be interpreted as Y, the Cr as Cb, and creates a ripple effect that flips all the colors across the image.

There’s nothing in the JPEG specification that says you must use YCbCr. Most JPEG images you’ll find use it because it tends to produce higher quality images after subsampling compared to RGB. You don’t have to take this for granted though. You can see for yourself in the grid below what it looks like to subsample each component individually across RGB as well as YCbCr.

Subsample percent: 15%
Move slider to adjust the amount of subsampling applied.

Removing a bit of blue isn’t as noticeable as removing red or green. This is because of the six million cones you have in your eyes, about 64% are most sensitive to red, 32% to green, and only 2% to blue.

Subsampling the Y component (bottom left) has the greatest effect on the image quality. Even a tiny bit is already noticeable. You can move the slider to see how removing a greater percentage of each component affects the image.

Converting an image from RGB to YCbCr doesn’t make the file size any smaller, but it does make it easier to find less noticeable details to remove. It’s that second step where the actual lossy compression happens. This idea of finding new ways to represent data to make it more compressible is at the heart of what the next layer does.

2. Discrete Cosine Transform & Quantization

This layer of compression is largely the defining feature of JPEG. After the colors are converted to YCbCr, the components are compressed individually, so we can focus on just the Y component for the rest of the article. Here’s what the bytes for the Y component look like with this layer applied.

Discrete Cosine Transform

Size: 102.40 kb. Dimensions: 400 x 250

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-156 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-158 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-158 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-160 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-158 -2 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-161 3 -1 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-169 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-172 1 1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-176 3 -1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-188 8 0 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-197 1 1 1 0 0 0 0 3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-200 3 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-208 3 2 0 0 0 0 0 2 -1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
-132 -70 7 -6 1 -1 0 0 23 -7 -3 0 0 0 0 0 0 -2 2 0 0 0 0 0 1 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
44 -58 -11 -3 -1 -1 0 0 12 6 0 1 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Hint: Try clicking on any pixel in the image to see the line in the editor that represents it. Try removing numbers from the end, or adding a few zeros to any individual number to make its effect more obvious.

At first glance, this seems like very poor compression. There are 100,000 pixels in this image, and yet it takes 102,400 numbers to represent the luminance of each pixel—that’s worse than not compressing it at all!

But notice that most of these values are 0. In fact, all of these trailing zeros can be removed with no change to the image. This leaves only about 26,000 numbers, which makes it about four times smaller!

In this layer lies the secret to the checkerboard patterns. Unlike the other effects we’ve seen, the appearance of these patterns is not a glitch. They are in fact the building blocks of the entire image. Every line in the editor above contains exactly 64 numbers, known as the Discrete Cosine Transform (DCT) coefficients, which correspond to intensities of 64 unique patterns.

These patterns are formed out of cosine waves. Here’s what a few of them look like:

These are 8 of the 64 discrete cosine transform coefficients. Credit: Jez Swanson.

Below is an image that shows all 64 of them individually.

Discrete Cosine Transform

Size: 4.10 kb. Dimensions: 64 x 64

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 254 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

These patterns are special because they form a basis for 8x8 images. If you’re not familiar with linear algebra, what that means is that any 8x8 image, anything at all that you can imagine, can be made out of these specific 64 patterns. The Discrete Cosine Transform is the process of breaking up the image into 8x8 blocks and converting each block into a combination of these 64 coefficients. Here’s how you would form a circle by combining these patterns, or the cat’s face. You can click here to go back to the grid of 64 patterns.

It seems like magic to say that any image can be represented using 64 specific patterns. But this is the same thing as saying any location on the Earth can be represented using only two numbers: longitude and latitude. We often treat the surface of the Earth as two-dimensional, so only two numbers are needed. An 8x8 image is sixty-four-dimensional, so we need sixty-four numbers.

In terms of compression, it’s not obvious how this helps us. If we need sixty-four numbers to represent an 8x8 image, why is this better than storing the sixty-four luminance components? We do it for the same reason we converted from the three numbers of RGB to the three numbers of YCbCr: it allows us to remove detail that’s less noticeable.

It’s hard to see exactly what the details that are removed in this compression step look like because JPEG only applies the Discrete Cosine Transform to blocks of 8x8 pixels at a time. However, there’s no reason we can’t apply it to the whole image. Here’s what it looks like to apply the DCT to the Y component of the entire image:

Full Discrete Cosine Transform

Size: 131.07 kb. Dimensions: 400 x 250

1
13428465 834183 -3326940 4177541 -1815723 -1015133 1539816 -2045280 17638 
    -196060 -702055 594376 -646990 492606 381611 -289537 661968 -208696 
    293165 221314 -496249 438519 -196935 -107794 216488 -343306 277679 
    -141153 -158864 260895 -292771 142687 -61140 -173951 255516 -257615 
    105261 84737 -206589 229959 -132835 57910 145999 -169560 208622 -78922 
    -8288 138901 -209279 155276 -24849 -83174 143339 -185906 82696 -8839 
    -146174 132159 -137955 58468 66248 -125461 138714 -98659 14857 98747 
    -122005 140658 -42520 -24497 119361 -128686 76320 -28933 -55363 136831 
    -107767 52233 10822 -98878 100025 -109556 8220 46470 -90327 116706 
    -57903 -10193 74503 -95712 85283 -37154 -27026 103526 -94856 53618 
    1008 -47012 94101 -86519 33747 29852 -63322 84308 -73502 6628 33796 
    -95609 78313 -34263 -12836 57073 -79165 72397 -12660 -33833 67008 
    -80921 44453 8012 -43142 83960 -59756 27616 29384 -65015 64525 -50256 
    6092 43843 -70482 59983 -30468 -21587 43938 -74065 51378 -6477 -29826 
    69549 -55097 29535 1746 -46461 65068 -51640 19743 34435 -46872 62326 
    -42751 -6038 39858 -60964 50554 -21106 -20940 46262 -61288 39932 -3107 
    -35249 57157 -47828 21306 3827 -47720 60468 -37008 11053 34421 -43993 
    51251 -29917 -9263 35899 -53778 43318 -11072 -20937 44698 -48356 34042 
    1629 -36412 46413 -43155 16595 9354 -47610 50503 -27292 6808 31540 
    -42733 43214 -22198 -11047 37342 -46150 34027 -6867 -20767 42615 
    -43297 23401 5645 -29958 45753 -37104 8957 14283 -40682 41594 -26174 
    -244 26747 -40976 39401 -15526 -11951 39396 -37808 29661 -4052 -24669 
    36561 -40451 18353 11535 -27350 42160 -28975 9166 16626 -41959 34476 
    -23382 -5907 29070 -34816 34798 -10233 -16225 33595 -35171 23952 -1536 
    -24366 37958 -30960 15075 12161 -28979 36310 -25672 4390 19512 -35175 
    32721 -18004 -7717 26001 -36261 28417 -6035 -15805 32510 -33037 20531 
    3072 -24406 36004 -28150 9074 13719 -27966 34881 -19932 -29 20405 
    -32847 27288 -13269 -7307 26529 -32811 24936 -4269 -19691 29504 -30021 
    16374 6440 -23022 33459 -23610 7999 14477 -28801 31676 -17184 -2420 
    22540 -30531 25746 -9151 -11292 27373 -28474 19350 -1413 -19291 27108 
    -27735 12638 8371 -22700 31254 -19685 5282 15235 -27253 29264 -14244 
    -4932 23078 -27323 22923 -7183 -12047 26778 -27413 16367 1389 -18942 
    26866 -26217 9079 10583 -23323 28835 -16947 3273 17672 -27311 25679 
    -10694 -7160 22962 -25887 21146 -2182 -13982 24960 -24595 13079 3585 
    -19048 25595 -22925 5885 12362 -21690 26537 -14192 397 17164 -26338 
    21370 -8077 -4897 23825 -23915 20436 -1699 -16935 24646 -23049 11204 
    7381 -19001 25116 -19105 2771 12002 -22203 23462 -12935 -1635 18603 
    -24432 19703 -5364 -6844 22478 -23061 18314 3296 -13460 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

We can remove over 60,000 numbers from the end with almost no noticeable change. But notice that if we set just the first five numbers to zero (ignoring the first because it just makes the image darker) there’s already an obvious difference.

Get updates from the Parametric Press

The numbers at the beginning represent the lower frequency changes in the image, which our eyes are better at detecting. The numbers towards the end represent the higher frequency changes, which are harder for us to see, so we don’t notice when they’re gone. To see “what our eyes can’t see”, we can isolate these high frequency details by setting the first 5,000 numbers to zero.

What you’re looking at here is all the areas of the image that have the greatest change from one pixel to the next. The cat’s eyes, whiskers, fuzzy blanket and shadows in the bottom left corner all stand out. This can be taken even further, to setting the first 10,000 numbers to zero; 20,000; 40,000 or 60,000.

These high frequency details are what JPEG removes during this compression step. Converting the colors to the DCT coefficients is not a lossy operation. It’s the quantization step that’s lossy, where values that are high frequency, close to zero, or both, are removed. When you select a lower quality setting when creating a JPEG image, it increases the threshold for how many of these values are removed, which leads to a smaller file size but a blockier image. This is why the version of the image in the first section that was 57 times smaller looked blocky. Each 8x8 block was represented by far fewer DCT coefficients compared to the higher quality version.

One really cool thing you can do with this technique is progressively stream pictures. Imagine seeing a blurry version of the whole image and slowly seeing it become more and more detailed as the download progresses and more DCT coefficients are available. This is actually possible to do with JPEG, but not as commonly used.

Just for fun, here’s what it looks like using just 24,000 numbers, or just 5000 numbers. Pretty blurry, but almost recognizable!

3. Run-Length, Delta & Huffman Encoding

All the compression steps so far have been lossy. This last layer, by contrast, is lossless. It doesn’t remove any information, but it does make the file size significantly smaller.

How do you compress something without throwing away any information? Think about how you would represent a simple solid black image.

JPEG Editor

Size: 4.95 kb. Dimensions: 700 x 437

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
255 218 0 12 3 1 0 2 17 3 17 0 63 0 252 170 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
162 138 40 0 162 138 40 0 162 138 40 0 162 138 40 0 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

JPEG uses about 5,000 numbers to represent this, but we can do much better. Can you think of an encoding scheme to represent this image using as few bytes as possible?

The smallest I could think of was four bytes: three to specify the color and one to specify how many pixels have this color. The idea of expressing all repeated values concisely this way is called run-length encoding. It’s lossless because we can recover the encoded data exactly as it was before.

The file size of the solid black JPEG image is much bigger than four bytes because remember that in the DCT layer, the compression is applied to 8x8 blocks at a time. So at minimum we’ll need one DCT coefficient for each 64 pixel block. We only need one because instead of storing one DCT coefficient followed by 63 zeros for this image, run-length encoding allows us to just store one number and say “the rest are zero”.

Delta-encoding is the technique of storing each byte as a relative value compared to something before it instead of storing its absolute value. This is the reason editing certain bytes will change the color for all subsequent pixels. For example, instead of storing:

12 13 14 14 14 13 13 14

You would start with 12, and from there, just store how much you need to add or subtract to get the next number. So once delta-encoded the sequence above becomes:

12 1 1 0 0 -1 0 1

Once again, the transformed data is not any smaller than the original, but it is more compressible. Applying delta encoding before run-length can help a lot, while still remaining a completely lossless compression step.

Delta encoding is one of the few techniques that is applied outside the 8x8 blocks. Out of the 64 DCT coefficients, the first one is just a constant wave function (you see it as a solid color). It represents the average brightness of each block for the luminance components, or the average blueness for the Cb components etc. This first value in each DCT block is called the DC value, and each DC value is delta-encoded relative to the ones before it. So changing the brightness of the very first block will affect all blocks in the image.

This all leaves just one final mystery: how can changing just a single number completely wreck the image? This was not a property of any of compression layers so far. The answer lies in the JPEG header. It’s the first 500 or so bytes that contain metadata about the image, like its width and height, and has been omitted from all the byte editors so far.

Below is the original image with the header included.

JPEG Editor

Size: 79.82 kb. Dimensions: 700 x 437

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
255 216 255 224 0 16 74 70 73 70 0 1 1 1 0 72 
0 72 0 0 255 219 0 67 0 3 2 2 3 2 2 3 
3 3 3 4 3 3 4 5 8 5 5 4 4 5 10 7 
7 6 8 12 10 12 12 11 10 11 11 13 14 18 16 13 
14 17 14 11 11 16 22 16 17 19 20 21 21 21 12 15 
23 24 22 20 24 18 20 21 20 255 219 0 67 1 3 4 
4 5 4 5 9 5 5 9 20 13 11 13 20 20 20 20 
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 
20 20 20 20 20 20 20 20 20 20 20 20 20 20 255 192 
0 17 8 1 181 2 188 3 1 17 0 2 17 1 3 17 
1 255 196 0 31 0 0 1 5 1 1 1 1 1 1 0 
0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 
10 11 255 196 0 181 16 0 2 1 3 3 2 4 3 5 
5 4 4 0 0 1 125 1 2 3 0 4 17 5 18 33 
49 65 6 19 81 97 7 34 113 20 50 129 145 161 8 35 
66 177 193 21 82 209 240 36 51 98 114 130 9 10 22 23 
24 25 26 37 38 39 40 41 42 52 53 54 55 56 57 58 
67 68 69 70 71 72 73 74 83 84 85 86 87 88 89 90 
99 100 101 102 103 104 105 106 115 116 117 118 119 120 121 122 
131 132 133 134 135 136 137 138 146 147 148 149 150 151 152 153 
154 162 163 164 165 166 167 168 169 170 178 179 180 181 182 183 
184 185 186 194 195 196 197 198 199 200 201 202 210 211 212 213 
214 215 216 217 218 225 226 227 228 229 230 231 232 233 234 241 
242 243 244 245 246 247 248 249 250 255 196 0 31 1 0 3 
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 
2 3 4 5 6 7 8 9 10 11 255 196 0 181 17 0 
2 1 2 4 4 3 4 7 5 4 4 0 1 2 119 0 
1 2 3 17 4 5 33 49 6 18 65 81 7 97 113 19 
34 50 129 8 20 66 145 161 177 193 9 35 51 82 240 21 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Without the header, it’s practically impossible (or at least very difficult) to decode the JPEG image. It would be as if I was trying to describe a painting to you, and I started to invent words to communicate what I saw. It’s probably going to be a very concise description, since I can define the words to mean exactly what I want to communicate, but it would be meaningless to anyone other than me.

This may sound ridiculous, but this is exactly what’s going on here. Every single JPEG image is compressed with a code that’s specific to this particular image. These codes are defined in a dictionary stored in the header. This technique is called Huffman encoding, and the dictionary is called a Huffman table. This table is marked in the header by two bytes: 255 followed by 196. Each color component may have its own Huffman table.

Changes to these Huffman tables will have the most dramatic effects on any image. Changing the second 1 to 12 on line 15 is a good example. Changing anything after the 125 on that line works too.

The Huffman tables have such a dramatic effect on the image because they tell us how to read the individual bits. So far we’ve just been dealing with the binary numbers in decimal. This hides the fact that if you want to store the number 1 in a byte, it would look like 00000001, because each byte must have exactly eight bits even if it only needs one bit.

This is potentially a huge waste of storage if you have a lot of small numbers. Huffman encoding is a technique that allows us to relax this requirement that each number must occupy eight bits. That means if you see the two bytes:

234 115

Based on the Huffman table, these could actually be three values. To extract them, you’ll need to first break them into their individual bits:

11101010 01110011

One neat trick you can do with this knowledge is strip out the header from a JPEG image and save it separately. You’re effectively making it so only you can read it. Facebook actually does this to make JPEG images even smaller.

Another thing you can do is change the Huffman table just slightly. To anyone else, it looks like a corrupted image. But only you would know the magic edit needed to fix it.

Then follow the table to figure out how to group them. For example, it could be the first six bits (111010) which is 58 in decimal, followed by another five bits (10011) which is 19 and finally the last four bits (0011), which is three.

This is why it’s very difficult to make sense of the bytes at this layer of compression. The bytes don’t actually represent what they seem to represent. I won’t go into the details of how to extract the Huffman table and translate the bits in this article, but there are many good resources on this if you’re curious.

So to summarize, what all does it take to decode a JPEG image? You need to:

Extract the Huffman table(s) from the header and decode the bits.
Extract the Discrete Cosine Transform coefficients for each color/luminance component, for each 8x8 block, by undoing the run-length and delta encodings.
Combine the cosine waves based on the coefficients to get back the pixel values for each 8x8 block (this is known as the inverse Discrete Cosine Transform).
Scale up the chrominance components if they were subsampled (the header has this information).
Convert the resulting YCbCr of each pixel to RGB.
Display the image!

That’s a lot of work to view a simple cat picture! But what I love about this is that you can see how JPEG is a very human-centric technology. It relies on the quirks of our perception to achieve compression rates far greater than is possible with general purpose techniques. And now that you understand how JPEG works, you can imagine how many of these techniques can be extended to other domains. For example, applying delta-encoding in video can produce a huge file size reduction since there are often areas that don’t change at all between frames (such as the background).

All the code for this article is open source and includes instructions on replacing the images in these byte editors with your own.

Omar Shehata is a graphics programmer at Cesium working on open source, web-based 3D maps. He grew up in Alexandria, Egypt and currently lives in Philadelphia, PA.

Edited by Matthew Conlen and Victoria Uren.

Read the next article
The Myth of the Impartial Machine

→

Unravelingthe JPEG

Unravelingthe JPEG

Unravelingthe JPEG

Peering Inside a JPEG

The three layers of JPEG compression

1. Chrominance Subsampling

2. Discrete Cosine Transform & Quantization

3. Run-Length, Delta & Huffman Encoding

Unraveling
the JPEG

Unraveling
the JPEG

Unraveling
the JPEG