The unreasonable effectiveness of run-length encoding
I’m writing a game that uses an extremely limited color palette. Most of my art assets are just two or three colors. The obvious image format to use in this case is PNG, and it does a good job here, compressing 10MB of raw pixel data down to 100kB. So I have all these PNG files sitting in a folder, and at runtime I load and decompress them.
This is a fine approach during development, but there are two things I want to improve. First, I don’t want to ship a folder of PNGs to all my users. Most games compile groups of assets into bigger files, for example Unity’s .assets/.resS files, and I could do that here. But since my game is really small, and has a retro aesthetic, I really want to compile the entire thing into a standalone executable.
Secondly, I had a hunch I could beat PNG’s compression in this case, by exploiting the weirdness of my assets. As a side benefit, cutting the dependency on the PNG decompression library is nice. It’s not a huge library, at 74kB, but it’s still about 15% of my binary size.
Embedding assets in C++ code
Clang doesn’t support #embed
yet, so the easiest way to embed data in a C++ app is to generate a header file that contains the data as a literal. For a byte array, your options are a string literal or an array literal. We want to encode this data compactly, to avoid bloating our source files too much.
The most compact way of declaring arbitrary bytes in a string literal is octal escape characters. For example the byte 135 would be "\207"
. You might think that hexadecimal escape characters would be shorter, but they have to be prefixed with an X that makes them equal or longer (135 becomes "\X87"
). With their slash prefix, octal escape characters are up to 4 bytes per byte.
For array literals, your best option is just ordinary decimal literals, separated by commas (but no space after the comma). Hexadecimals are once again undermined by their prefix, 0x87
. The decimal encoded byte plus comma is also up to 4 bytes. But the decimal will sometimes be shorter than the octal, so on average the array literal should be about 5.5% more compact. If you use an auto-formatter that insists on inserting spaces after the commas, then string literals win.
A more complicated solution is to encode the data as an array of uint64
s, rather than uint8
s. Then you can use a 16 character hexadecimal number to represent each uint64
, which is significantly more compact. With the comma and the 0x
, it averages out to about 2.4 bytes per uint8
. You can then cast the uint64*
to a uint8*
and use it as a byte array. There are two complications with this approach. First, you have to worry about endianness when converting to and from uint64
s (make sure the CPU architecture that generates the header is the same that reads it; don’t git commit the generated header). Second, if your uint8
array isn’t a multiple of 8 bytes long, your last uint64
won’t be filled up, so you’ll need to explicitly store the size of the uint8
array.
Whatever approach you use, most text editors have trouble dealing with extremely long lines. So it’s a good idea to wrap your lines, and keep them limited to ~1000 chars at most.
My art assets are about 10MB uncompressed. So if I just decode the PNG files and stick them in headers, I bloat my few hundred kB binary a lot. Time for some compression.
Exploiting the limited palette
As I mentioned, my assets use a very limited palette. About half the assets are pure black and white, and the other half also include transparent pixels. So the most obvious first step is to reduce the uncompressed 4 bytes per pixel (RGBA, one byte per channel) down do a bit or two per pixel.
I have a couple of assets that have a few more colors. So I’m going to encode the palette and store it as a prefix to the rest of the image, rather than defining a fixed rule like white is 0, black is 1, and transparent is 2.
This already reduces the file size by 94%, down to around 564kB. But this is still more than 5x the PNG files. We can do a lot better.
Run-length encoding
In the opening sequence of every episode of the Simpsons, Bart is in detention and writing something on the blackboard. When the teacher is giving him this task, and tells him to write, for example, “I will not waste bytes” on the board 100 times, this is run-length encoding (RLE). Without RLE, the teacher would have to say “Write the following on the blackboard: I will not waste bytes. I will not waste bytes. I will not waste bytes. I will not waste bytes…”. This would obviously be extremely inefficient.
RLE works pretty well on my images, partly due to the limited palette, and partly because I’m using Bayer dithering, as a stylistic choice, which generates very regular dithering patterns. If you scan across the pixels of the EduQuest image above, line by line, left to right, top to bottom, you can see there are a whole lot of long runs of the same colored pixels (mostly black in this case). We can more efficiently write a string of 47 black pixels as “47, 1” (where black is entry 1 in our palette), rather than “1, 1, 1, 1, …”.
We pay a steep cost for this approach though. Since we’re no longer encoding simple indexes into a tiny palette, we can’t store those numbers in one or two bits. Instead we’ll encode them as varints, with the reduced redundancy trick I mentioned in my last post.
We can combine the length and the palette index into a single number, since we know the palette index must be less than the palette size. We just multiply the length by the palette size and add the palette index. So if we have a three color palette we encode “47 black pixels” as 3 * 47 + 1 = 142
, which is encoded in the the two byte varint [128, 14]
. But we can do better.
Just like when we were reducing the redundancy of the ordinary varint encoding, we can exploit redundancy in this encoding to shrink it. If there are two compressed sequences that decompress to the same image, or if there are certain compressed sequences that the compressor will never generate, that’s usually an opportunity for further compression.
When we’re implementing RLE, we walk through our data, and emit a {length, palette index} pair whenever the color changes. This means that it’s impossible for adjacent pairs to have the same palette index. How can we exploit this? Well, for a 3 color palette, we delta encode the indices modulo 3, and subtract 1. It’s a bit confusing when I say it like that, but it’s easy to see how it works with an example:
Take just the palette indices from adjacent {length, index} pairs:
1 2 1 0 1 2 1 2 0 1 0 2
For each index, subtract the previous index from it:
1 1 -1 -1 1 1 -1 1 -2 1 -1 2
Modulo 3:
1 1 2 2 1 1 2 1 1 1 2 2
^ Note that there are no 0s, because each adjacent index is different.
Subtract 1, and we have our compressed sequence
0 0 1 1 0 0 1 0 0 0 1 1
To decompress this sequence, start by adding 1
1 1 2 2 1 1 2 1 1 1 2 2
Then calculate the cumulative sum of the sequence
1 2 4 6 7 8 10 11 12 13 15 17
Then wrap the numbers around modulo 3
1 2 1 0 1 2 1 2 0 1 0 2
Now for a 3 color palette we only have 2 possible delta encoded indices, 0 and 1. So we only need to multiply the length by 2. Now the result, 2 * 47 + 1 = 95
, fits in a one byte varint.
For two color palette images, we don’t even need to encode the index. Every {length, index} pair alternates between the two colors, so we just have to encode the length.
Finally, one more trick along the same lines is to notice that we can’t have a run with length 0. The minimum length is 1, so we can subtract 1 from the length before we encode it. This is much less likely to save a byte than the delta encoded index, but it doesn’t hurt.
With all these tricks, the total size of all the images is reduced to 365kB, a 35% reduction. That’s a lot less impressive than I was hoping for, but the move from encoding the numbers using a few bits to whole bytes is really expensive. Even assuming all the varints fit in one byte, that’s a 4x increase per number for the three color images, and an 8x increase for the two color images. So it’s actually pretty impressive that we get any saving at all.
Recursion
Let’s take a look at the encoding of the EduQuest image. So far we’ve got it down to 46kB, which is better than the uncompressed 506kB, but still a lot bigger than the PNG’s 14kB.
The first few bytes are the header, which contains the palette and some other stuff that I’ll talk about later. The rest is the run-length encoded pixel data. Looking at the pixels, what jumps out to me is all the zeros. This is a two color image, so we don’t encode the palette index, just the run length (minus 1). So those 0s are all runs of length 1 of alternating colors: a black pixel, followed by a white pixel, then a black pixel…
Our trick of reducing/omitting the palette index has had the interesting side effect that our RLE encoded data is itself full of runs. So what happens if we try to RLE encode the RLE pixels?
When we were encoding pixel data, each pixel was 4 bytes, so that was the natural choice for our word size. That is, our palette is a list of uint32
s that the delta encoded indices point into. Now we’re encoding a byte array, so what word length should we use?
The simple choice would be to treat each byte as a separate word. Our palette would then be a list of all the different byte values that appear in the data. But that’s suboptimal.
Taking another look at the RLE pixel data above, you might notice there’s also a whole lot of runs of alternating 0s and 2s. This encodes 1 pixel of one color, followed by 3 pixels of the other color. Due to the ordered dithering I’m doing, these sorts of runs are very common, as different shades of gray are converted to different sequences of black and white pixels. A word size of one byte will run-length encode this as “{1, 0}, {1, 2}, {1, 0}, {1, 2}, …”, which won’t save any space. A word size of two would work much better, because those adjacent bytes would be grouped together into a single symbol, and a run of N pairs of 0 and 2 could be encoded as “{N, [0, 2]}”. The palette would then be a list of uint16
s, listing all the pairs of bytes that appear in the data.
More generally, I’ll let the compressor pick the word size. I’ll have it try encoding the image using 1, 2, 3, and 4 byte words, and choose whichever encoding shrinks the image the most. The trade-off with using longer words is that it makes the palette larger (exponentially so, for high entropy data). So when deciding which encoding is the smallest, it’s important to include the header size.
The headers themselves are basically never compressible via RLE. Just take a look at the header in the example above. There are more repeated values than you’d expect from random data, but nowhere near enough to be compressible. It’s rare for such short sequences to be compressible anyway. So when I apply RLE a second time, I’m only applying it to the body.
For the EduQuest image, applying it a second time brings the size down to 10.7kB, which beats PNG’s 14.1kB. The total size of all the images is now just 44.4kB.
I find this really surprising. If you had asked me a week ago if it ever made sense to apply RLE more than once, I would have said no. I assumed it was similar to trying to zip a zip file. But not only does it work, it works really well. The first round of RLE only reduced the file size by 35%, but this second round of RLE reduces the total image size by a further 88%!
The EduQuest image now looks like noise, so it won’t benefit from any more rounds of RLE. But I have other assets that are less complex and more patterned. One of them is just a large checkerboard of black and transparent pixels that I use to apply a crude shadow to the entire screen.
This image can be effectively compressed by six rounds of RLE! Uncompressed, this image is 506kB, and PNG gets it down to 1659 bytes. But recursive RLE gets it down to just 31 bytes! I was considering generating this image algorithmically, but the code size of a checkerboard generating function would definitely be more than 31 bytes.
The bulk of the remaining size is taken up by the few assets that don’t benefit from more than 2 rounds of RLE, so the overall savings of the later rounds are less impressive. Our total size is now 42.9kB. But benchmarking tells me that these later rounds have essentially no impact on decompression time (probably because they’re inflating tiny amounts of data, like the 31 byte checkerboard), so there’s no reason not to do this.
More tricks
The final format for the encoded image, after N rounds of RLE, looks like this:
[Header 1][Header 2]...[Header N-1][Header N][Body]
To decompress it, we take the body, decompress it using header N to get the N-1 compressed body. Then we decompress that using header N-1 to get the N-2 compressed body. We continue like this until we’ve decompressed it using each header.
All the headers have a varint length prefix that lets us split up the raw bytes into these chunks again. The body is identified by a length prefix of 0, and it takes up all the remaining bytes (it’s impossible to have a 0 byte header, so this uniquely identifies the body).
One detail I haven’t mentioned yet is what we do when the body doesn’t fit neatly into the compressor’s chosen word size. Say the compressor has decided to use a word size of 4, but our body is 11 bytes. The first 8 bytes of the body will become 2 words, but what about the last 3 bytes? This wasn’t a problem for the initial round of RLE, because our pixel data was guaranteed to be a multiple of 4 bytes long. But now that we’re doing recursive RLE, we have to worry about this. The solution is that we drop those 3 bytes off the body and store them in the header. Note that 3 is the maximum possible number of bytes that need to be dropped, as our largest word size is 4.
The header is a list of varints. The last number in the list defines the format, and the rest define the palette. The format number is a uint32
made up of 4 bytes. The first byte contains the word size (1, 2, 3, or 4; so 4 possible values means we need 2 bits), and the number of bytes that were dropped from the body (0, 1, 2, or 3; so again we need 2 bits). The other 3 bytes of the format contain the dropped bytes.
So far the ordering of the palette has gone undefined, and I just added new colors/words to the end as I found them. If we change the order of the palette entries, and update all the palette indices to match, we’ll get the same decompressed output. This is more redundancy that we can exploit for data savings.
If we sort the palette entries numerically from smallest to largest, then we can delta encode them. In fact, we can delta minus 1 encode them, like we did for the length, since it’s impossible to have duplicate palette entries.
Unsorted palette: [497, 872, 615, 848, 611, 334]
Varint bytes: [130, 113, 133, 104, 131, 103, 133, 80, 131, 99, 129, 78]
Resulting header is 12 bytes long ^
Sorted palette: [334, 497, 611, 615, 848, 872]
Delta - 1 encoded: [334, 162, 113, 3, 232, 23]
Varint bytes: [129, 78, 128, 34, 113, 3, 128, 104, 23]
Now down to 9 bytes ^
Delta encoding significantly shrinks the size of the palette values, giving us shorter varints. Since every round of RLE generates its own header, this has a surprisingly large effect on the total size. We’re now down to 39.5kB for all the images (the checkerboard is now just 27 bytes!).
Huffman coding
The final compression technique I’m going to apply is Huffman coding. This is the simplest form of entropy coding. There are better entropy coding algorithms, but from my calculations Huffman coding is less than 0.5% away from optimal in my case (though I haven’t looked at the bigram probabilities yet).
Given a set of symbols, and the frequency or probability of each symbol, Huffman coding finds the optimal set of code words to refer to those symbols. More common symbols get shorter code words, and less common symbols get longer codes. The more non-uniform your symbol distribution is, the better the compression.
Let’s look at the distribution of all the byte values across all out images.
This distribution is highly non-uniform, so it benefits a lot from entropy coding. The peak towards the left is to be expected, as small numbers are typically more common than large ones in almost any data set (this was the main thing I was hoping to exploit with entropy coding).
I wasn’t expecting the larger peak in the middle, but it makes sense. The single byte varint limit is 127. Numbers larger than that need at least two bytes to encode: 128 = [128, 0]
, 129 = [128, 1]
, 130 = [128, 2]
. This means 128 is by far the most common byte in all the compressed images, and 129 is the second most common. The floor of the left half of the plot is also higher than the floor of the right, as values above 160 are extremely uncommon.
There’s one more odd peak at 254. As I’ve mentioned, the 2 main colors in my art assets are black and white. It turns out, when you encode white, 0xFFFFFFFF
, as a varint, you get [142, 254, 254, 254, 127]
, and 254 also appears in the encoding for black. So just about every image begins with several 254s.
All this unevenness means that Huffman coding performs well. 128 is assigned a code word of just 3 bits, 0b001
. The average number of bits per symbol is 6.833, which means the compressed size goes from 39.5kB to 33.8kB. Including the code word table, which we need for decompression, the total is 34.3kB.
Summary
Reducing the palette took the total image size from an uncompressed 10MB down to 563kB, a 94% reduction. Run-length encoding gave us a further 35% reduction, down to 365kB. Surprisingly, we were able to apply RLE a second time, giving us an extra 89% reduction, down to 44.4kB, which was enough to beat PNG’s total of 100kB.
Applying RLE recursively until it stopped saving data worked well for some images, but most of the remaining size was taken up by images that stopped at 2 rounds of RLE. So overall this only reduced the size by another 3%, to 42.9kB. But these extra rounds of RLE had a negligible effect on decompression time, so there’s no harm in keeping them.
Delta encoding the RLE palette worked really well, saving an additional 8%, bringing our total down to 39.5kB. Finally, Huffman coding gave us another 13%, reducing the total size of all the images to 34.3kB. I probably won’t actually use Huffman coding in the final product though, because it comes at the cost of increasing the decompression time by 23%, which doesn’t seem worth it since the images are already so small.