Serialization in C++
I’ve written a bunch of C++ serialization formats in the past. I’ve recently written another.
Serialization is an interesting problem because there are so many possible requirements you can have, depending on your use case. And different mixes of these requirements will necessitate wildly different solutions. So it’s one of those problems that you can productively revisit over and over again.
Some of the design decisions you’ll need to make are:
- Do you want it to be a human readable and editable format? Or is it just a binary format?
- Does it need to be backwards and forwards compatible? If fields are added or removed from your objects, do existing serializations from different versions of your program still need to deserialize to something valid? Do you need to support both adding and removing fields, or just adding? I’m going to refer to this requirement as just “compatibility” for the rest of the post.
- How fast does it need to be? Is encoding speed important, or just decoding speed?
- Does the encoded format need to be as small as possible? What sort of speed vs encoded size trade off do you want to make?
- What sort of objects do you need to support? Just primitives and simple structs? Optional fields? Arrays? Maps? Pointers?
- What about inheritance and polymorphism? Is the runtime type of the deserialized object always known at compile time of your program? If not, does the encoded format need some concept of type ID?
- If you need to serialize pointers, do you know the structure of the object graph? Is it a tree? If it’s a DAG you’ll have to worry about deduping the objects being pointed to. If it’s a general graph, you’ll also have to worry about cycles.
The most advanced serializer I’ve implemented before was a compact binary encoding of a polymorphic flow graph (with cycles). But in that case I didn’t have to worry about compatibility.
My current use case is save files and other data files for a game. I don’t need polymorphism anymore, and I don’t have to worry about graph cycles, but I do need compatibility. This post is about the design decisions and implementation details that followed from those requirements.
The encoded format I’ve ended up with is similar to the protocol buffer wire format, with some minor improvements. This wasn’t intentional. It’s been a while since I learned the protobuf format, and I thought I’d made a substantial improvement to the way substructs are encoded. But rereading the format today, they’ve made the same improvement. I’ll talk about it more below, but it’s nice to have that design decision validated by such a battle hardened format.
Raw Binary Formats
Our first major decision is whether we want the format to be human readable or not. If you want human readability you’re going to end up with something like JSON. These formats take up more space when encoded, and are slower to serialize and deserialize.
In my case I’m not interested in human readability. I want a compact binary format. For now, let’s assume this is the only requirement, ignoring compatibility. What would our serializer look like in this case?
If that was the only requirement, then the serializer would be pretty simple. Just iterate over the fields of your objects, recursively calling your serializer on every field. Add some base cases for the primitive types, and you’re done.
We’re working in C++, so we’re going to make heavy use of templates. Our top level serialize and deserialize functions will defer to a SerializeImpl<T> class that is templated on the type being serialized. We specialize SerializeImpl for each type T we want to serialize. The default implementation of this class (the fallback for unknown types) assumes that T is a struct with a report method that reports all the struct’s fields, like this:
struct Foo {
uint64_t x;
uint32_t y;
template <typename Reporter> void report(Reporter& reporter) {
reporter(x);
reporter(y);
}
};
This is known as the reporter pattern. The reporter is an object with a templated operator() method, like the StructSerializeReporter and StructDeserializeReporter below.
The serializer implementation would look something like this:
// Push a serialized T to the end of out.
template <typename T> void serialize(const T& t, Array<uint8_t>& out) {
SerializeImpl<T>::push(t, out);
}
// Pull a T from the start of bytes, and store it in out. Advance the head
// of bytes by however many bytes were read. Return whether successful.
template <typename T>
bool deserialize(const ArrayView<uint8_t>& bytes, T* out) {
return SerializeImpl<T>::pull(bytes, out);
}
// Default implementation of SerializeImpl assumes T is a struct.
template <typename T> struct SerializeImpl {
static void push(const T& t, Array<uint8_t>& o) {
StructSerializeReporter r(o);
t.report(r);
}
static bool pull(ArrayView<uint8_t>& b, T* t) {
StructDeserializeReporter r(b);
t->report(r);
return r.ok;
}
};
// StructSerializeReporter serializes each reported field.
struct StructSerializeReporter {
Array<uint8_t>& o;
StructSerializeReporter(Array<uint8_t>& o) : o(o) {}
template <typename T> void operator()(const T& t) {
serialize(t, o);
}
};
// StructDeserializeReporter deserializes each reported field.
struct StructDeserializeReporter {
ArrayView<uint8_t>& b;
bool ok = true;
StructSerializeReporter(ArrayView<uint8_t>& b) : b(b) {}
template <typename T> void operator()(T& t) {
if (!deserialize(b, &t)) ok = false;
}
};
// Specializations for each primitive type handle the actual serialization.
template <> struct SerializeImpl<int64_t> {
static void push(int64_t t, Array<uint8_t>& o) {
// Serialize t into o.
// I'll talk about how to actually do this later.
}
static bool pull(ArrayView<uint8_t>& b, int64_t* t) {
// Deserialize an int from b, and store it in t.
// Advance the head of b by however many bytes were read.
// Return whether we succeded.
}
};
// Partial specialization for Arrays.
template <typename T> struct SerializeImpl<Array<T>> {
static void push(const Array<T>& a, Array<uint8_t>& o) {
// Write a size prefix.
serialize(a.size(), o);
// Then write all the elements.
for (const T& t : a) serialize(t, o);
}
static bool pull(ArrayView<uint8_t>& b, Array<T>* a) {
size_t size;
if (!deserialize(b, &size)) return false;
a->clear();
for (size_t i = 0; i < size; ++i) {
T t;
if (!deserialize(b, &t)) return false;
a->add(t);
}
}
};
The SerializeImpl class is necessary so we can make use of partial template specialization to handle arrays generically. I use this pattern a lot in C++. Function templates are unique because the T can be inferred (ie you can call serialize(foo, out), rather than serialize<Foo>(foo, out)). And class templates are unique because you can define partial specializations. Function templates are more ergonomic, but class templates are more powerful. So I often end up writing a function template that is a thin wrapper around a class template which does all the heavy lifting.
Anyway, the point is that if you just need a compact binary encoding, and don’t have any other requirements, the serializer is very simple.
Varints
I omitted the details of how ints are encoded from the above code. There’s actually a bit of a design decision to make there.
One approach is to just take the literal bytes of the int and dump them in the serialization (modulo endianness). So a uint32 would become 4 bytes in the encoding. If you use this approach for every type then deserialization can essentially be reduced to a reinterpret_cast from the byte buffer to the target type. This is the core idea behind FlatBuffers.
The approach that protobufs use, and that I’ve used in my case, is to compress your ints under the assumption that small ints are much more likely than large ints. Varints are a very simple and common technique for this. I’ll be using big-endian varints.
The idea is that you split your int up into chunks of 7 bits, drop any leading chunks that are all 0s, then prepend a 1 to all but the last chunk. In other words, the high bit of each byte of the varint indicates whether the varint should continue.
Big-endian varints:
0: 0x00
1: 0x01
127: 0x7F
128: 0x81 00
129: 0x81 01
16383: 0xFF 7F
16384: 0x81 80 00
16385: 0x81 80 01
This is the format protobufs use (except they use little-endian). There’s a small flaw in this format though. Any varint can be validly encoded with extra 0x80's at the start, since this is equivalent to adding leading 0s to your int. For example, 129 can be encoded as 0x8101, or 0x808101, or 0x80808080808101. Of course, the encoder always uses the shortest form, but that means that the decoder will simply never see the bit pattern 0x808101. This unused bit pattern means that the encoding is slightly inefficient.
To fix this, we just have to change the encoding to make use of those patterns:
Big-endian varints with reduced redundancy:
0: 0x00
1: 0x01
127: 0x7F
128: 0x80 00
129: 0x80 01
16511: 0xFF 7F
16512: 0x80 80 00
16513: 0x80 80 01
Note that the first byte after the varint is extended is now 0x80 rather than 0x81. Also note that the largest 2 byte int is 16511 instead of 16383 (this is the efficiency gain).
At first glance this optimization seems like it would be a pain to implement, but it turns out to be super easy. A single increment and decrement instruction is enough to turn ordinary varints into reduced redundancy varints (see the commented lines):
// Big-endian varints with reduced redundancy.
void serializeVarint(uint64_t x, Array<uint8_t>& o) {
size_t start = o.size();
while (true) {
uint8_t b = x % 0x80;
if (o.size() > start) b |= 0x80;
o.add(b);
x >>= 7;
if (x == 0) break;
--x; // <== This line reduces redundancy.
}
// Reverse the sub-array we just added, to switch to big-endian.
o.reverseSubArray(start, o.size());
}
bool deserializeVarint(ArrayView<uint8_t>& b, uint64_t* x) {
uint64_t _x = 0;
while (true) {
if (b.empty()) return false;
uint8_t y = b.pull();
_x |= y & ~0x80;
if ((y & 0x80) == 0) break;
++_x; // <== This line reduces redundancy.
_x <<= 7;
}
*x = _x;
return true;
}
Signed ints are handled in protobufs using zig-zag encoding, so that small negatives translate to small unsigned ints. I’ve done the same. Bit hacks FTW:
// int64 | Zig-zag encoded uint64
// 0 | 0
// -1 | 1
// 1 | 2
// -2 | 3
void serializeSignedVarint(int64_t x, Array<uint8_t>& o) {
serializeVarint((x << 1) ^ (x >> 63), o);
}
bool deserializeSignedVarint(ArrayView<uint8_t>& b, int64_t* x) {
uint64_t y = 0;
bool status = deserializeVarint(b, &y);
*x = (y >> 1) ^ -(y & 1);
return status;
}
Other primitive types
There’s no easy way of compressing floats and doubles, so we serialize them as the raw bytes (4 bytes and 8 bytes respectively).
Similarly, for 8-bit ints, it’s better to encode their single byte directly, rather than using a varint (though protobufs still use varints for these).
Polymorphism
I’m not going to go into too much detail about how to handle polymorphism because I didn’t need it in this case. But the basic idea is that when you’re serializing a pointer to the base object, simply prefix it with an int ID indicating its runtime type.
These type IDs don’t need to be globally unique, just unique among the child types of each base type. The deserializer knows that it’s deserializing a child of the base type, so it just needs to be able to identify which child type to instantiate.
Unfortunately this doesn’t play nicely with the reporter pattern I used earlier. You can’t make the report method virtual because it’s templated. The way I solved this was to have a method on the base class that directly calls the child class’s report method:
class Animal {
// Child classes must implement this and return a unique ID.
virtual uint64_t runtimeType() = 0;
template <typename Reporter> void polyReport(Reporter& reporter) {
switch (runtimeType()) {
case kCat:
return ((Cat*)this)->report(reporter);
case kDog:
return ((Dog*)this)->report(reporter);
...
}
}
};
Boilerplate like this should be auto generated using a macro list:
// Calls macro F for each child class of Animal.
#define ALL_ANIMALS(F) \
F(Cat) \
F(Dog) \
...
class Animal {
template <typename Reporter> void polyReport(Reporter& reporter) {
switch (runtimeType()) {
#define CASE(T) \
case k##T: \
return ((T*)this)->report(reporter);
ALL_ANIMALS(CASE)
#undef CASE
}
}
};
Backwards and Forwards Compatibility
Compatibility is the main complexity of this format. The data I’m storing is the save files for a game. If I push out an update to the game that changes these data structures, the user’s existing save files need to continue working.
So I need to be able to add or remove fields from a struct, and still deserialize older serializations.
The way protobufs do this is to give each field an ID (unique within one struct, across all versions of the struct). During deserialization, these IDs are matched up to the IDs of the declared fields of the struct. Any extra fields in the serialization are ignored, and any fields that don’t have a serialized value are set to their default value.
In my case the struct reporter pattern must be updated to include a field ID:
struct Foo {
uint64_t x;
uint32_t y;
template <typename Reporter> void report(Reporter& reporter) {
reporter(0, x);
reporter(1, y);
}
};
These IDs are manually assigned by the developer, starting from 0 and increasing 1 at a time. Every field needs a different ID, and if a field is deleted its ID must never be reused. So it’s good practice to leave a comment if a field is deleted, to avoid reusing the ID.
I also mandate that the fields must be reported to the reporter in order of increasing ID (though the order that the fields are reported doesn’t necessarily need to match the order that they’re declared). This makes it easy to verify that the IDs are unique, without the need to maintain a set of seen IDs in the serializer. It also has a bunch of other benefits I’ll talk about later.
Now when we serialize the struct, we just prefix each field with its ID, right?
Foo{ x: 7, y: 9} is serialized to:
0x00 07 01 09
| x | | y |
Well unfortunately it’s not quite that simple. If fields were only ever added to a struct, this would work fine.
struct Foo {
uint64_t x;
uint32_t y;
uint32_t z; // New field.
template <typename Reporter> void report(Reporter& reporter) {
reporter(0, x);
reporter(1, y);
reporter(2, z);
}
};
Deserializing the 0x00070109 example above, we find the reported x and y fields, but we’re missing the z field so we set it to 0. So far so good (actually if we only ever added fields we wouldn’t need the field IDs at all).
The problem arises when deleting a field. Let’s take a different example struct that includes 8-bit ints:
struct Bar {
uint32_t a;
uint8_t b;
uint8_t c;
template <typename Reporter> void report(Reporter& reporter) {
reporter(0, a);
reporter(1, b);
reporter(2, c);
}
};
I mentioned above that I’m encoding 8-bit ints like b and c directly, rather than using varints, so the serialization looks like this:
Bar{ a: 129, b: 255, c: 6} is serialized to:
0x00 80 01 01 FF 02 06
| a | | b | | c |
Now let’s say we delete the b field:
struct Bar {
uint32_t a;
uint8_t c;
template <typename Reporter> void report(Reporter& reporter) {
// Deleted field IDs (DO NOT REUSE): 1
reporter(0, a);
reporter(2, c);
}
};
So how do we deserialize 0x00800101FF0206 now? We see field 0, which matches a, so we deserialize the varint, which leaves 0x01FF0206. Now we see field 1, which doesn’t match any known field. This is a problem because we don’t know how to deserialize the following bytes. Is 0xFF a one byte 255, or is 0xFF02 a two byte varint 16386? Without knowing the type of the missing field, it’s impossible to know.
I considered adding a method to the reporter to tell it the type of any deleted fields:
struct Bar {
uint32_t a;
uint8_t c;
template <typename Reporter> void report(Reporter& reporter) {
reporter(0, a);
reporter.skip<uint8_t>(1);
reporter(2, c);
}
};
But this is pretty awkward, and doesn’t solve the case where a serialized field is missing a reporter call because it’s an older version of the program trying to load a newer serialized file (ie forward compatibility).
Protobufs solve this by adding an additional few bits to the field ID, to encode the type of the field. I’ve used the same solution. The field type doesn’t describe the full C++ type. It just provides enough information to know how to skip over the serialized field if we don’t recognize it. For example, we don’t need to be able to distinguish between a uint64 and an int32, since they’re both varints.
Specifically, I add 2 bits to the field ID, which represent the encoding type:
kByte = 0; // 1 byte, fixed width. Eg uint8_t, bool.
kOctet = 1; // 8 bytes, fixed width. Eg double, float.
kVarint = 2; // Varint. Eg uint64_t, int16_t.
kSized = 3; // Varint size, then that many bytes. Eg struct, Array.
I made a few small sacrifices to keep these encoding types within 2 bits.
I’m encoding floats as kOctet by first casting them to doubles. This is inefficient, but I don’t think I’ll really need to use many floats.
If I add fixed width uint64s, they can be encoded as kOctet. I don’t think I’ll ever need fixed width uint32s or uint16s.
The size prefix of kSized is always in bytes. For arrays, this is a departure from how I encoded them earlier. If we just encoded the number of elements, we’d need to know the type of the elements to be able to skip over them. Perhaps we could also store the encoding type of the elements, but this would drastically increases the computational cost of skipping over the array, because we’d need to skip each element of the array individually (unless the elements were kByte or kOctet).
Protbufs store arrays as repeated fields. That is, if you have an array field with ID 3, each element will be serialized directly in the parent struct as if it was field number 3. The struct serialization will contain multiple items called field 3, and the deserializer will add each to the array. I’m not sure why it’s designed this way. The length prefix approach is simpler, faster, more compact, and more general purpose (for example you can’t represent an array of arrays using the repeated field approach). Newer protobuf versions support “packed” repeated fields that use the length prefix approach, but only for arrays of primitive types. If anyone knows why protobufs work this way, let me know in the comments.
Nested structs (eg if Foo has a field of type Bar) also get this varint size prefix, so that we can skip over the field without knowing the type of the struct.
Another option I considered for nested structs is to use an end tag. The idea is to add another encoding type, kEnd, and once all the nested struct’s fields are done, add a sentinel field with the kEnd type. This has the advantage that the kEnd tag is always 1 byte, whereas the size prefix will be more than 1 byte if the struct is larger than 127 bytes.
The downside of end tags is that we have to parse every field of the struct in order to skip over it. It’s not enough to look for the kEnd byte, because that may appear inside the struct. Worse, if there are multiple nested structs, we have to keep track of how many levels deep we are. This would really complicate deserialization. This is how protobufs used to work, but now they also use a size prefix for nested structs.
Implementation Details and Optimizations
The size prefixes used by arrays and structs are a bit tricky to handle efficiently. I want to do serialization in a single pass, and I don’t want to allocate a temporary buffer. I don’t know the size of the serialized object until after it’s serialized, and the size prefix is a varint, so I can’t just insert 4 or 8 placeholder bytes for the size and fill in the value afterwards.
Instead, I serialize the object, then I serialize its size (the difference between the buffer’s initial size and its current size). Then I rotate the end of the buffer so that the size comes before the object. For example:
First, serialize like this:
[existing serialization][object, N bytes][size, M bytes]
Then rotate the last N+M bytes to the right by M bytes:
[existing serialization][size, M bytes][object, N bytes]
This sort of array rotation can be implemented efficiently, with no need to allocate a secondary buffer.
Instead of adding fixed width uint64s as a core type, like protobufs fixed64 type, we can just have the serializer decide at runtime whether to use kOctet, or kVarint, based on whichever will be more compact. In fact it could also select kByte if the value is small enough. This is not an optimization I’ve actually implemented yet though, as I’d need to restructure my code a bit.
Another small optimization I can do is to omit any default valued fields from the serialization. They’ll be set to the default during deserialization anyway.
Also, since I mandate that the field IDs must be reported in increasing order, I can delta encode them. For example, once I’ve seen field number 7, I know the next field must be at least 8, so I just encode how many IDs beyond 8 it is (so if it’s 10 I encode it as 2). This means that it’s very unlikely that a field header will ever need more than 1 byte (we’d have to be skipping 32 fields at once).
So now our encoded Bar object looks like this:
Bar{ a: 129, b: 255, c: 6} is serialized to:
0x02 80 01 00 FF 00 06
| a | | b | | c |
The main reason I require that the IDs are reported in increasing order is that it makes deserialization much simpler. It means that no matter how many fields are missing from the serialization, or from the struct, we can deserialize in a single pass. We also avoid the need to store auxiliary data, such as a map from field ID to byte offset.
Instead we simultaneously iterate through the serialization and the struct, in a manner reminiscent of the merge step of merge sort. We keep track of the IDs of the current serialization field and the current struct field. If the serialzation field ID is smaller, we advance through the serialization (using the encoding type to skip the field). If the struct field ID is smaller, we advance through the struct (just setting the skipped field to its default value). And if they match, then we deserialize the field.
With all that in mind, the struct serialization code now looks like this:
// All SerializeImpl<T> implementations now have a kEncoding constant
// that contains the encoding type for that T.
namespace encoding {
static constexpr uint64_t kByte = 0;
static constexpr uint64_t kOctet = 1;
static constexpr uint64_t kVarint = 2;
static constexpr uint64_t kSized = 3;
// Number of header bits needed to store all the encoding types.
static constexpr uint64_t kBits = 2;
static constexpr uint64_t kMask = (1 << kBits) - 1;
} // namespace encoding
// StructSerializeReporter serializes each reported field, prefixed with
// the field header. Default valued fields are omitted.
struct StructSerializeReporter {
static constexpr uint64_t kNoId = -1;
Array<uint8_t>& o;
uint64_t prevId = kNoId;
StructSerializeReporter(Array<uint8_t>& o) : o(o) {}
template <typename T> void operator()(uint64_t id, const T& t) {
ASSERT(prevId == kNoId || id > prevId);
// Only serialize non-default valued fields.
if (t != T()) {
// Delta encode all field IDs (except the first one).
uint64_t deltaId = (prevId == kNoId) ? id : id - prevId - 1;
// Put the field encoding in the last 2 bits of the ID.
uint64_t head =
SerializeImpl<T>::kEncoding | (deltaId << encoding::kBits);
// Serialize the header then the T.
serializeVarint(head, o);
SerializeImpl<T>::push(t, o);
prevId = id;
}
}
};
// StructDeserializeReporter walks through the serialized fields and the
// struct fields, deserializing any fields with matching IDs.
struct StructDeserializeReporter {
static constexpr uint64_t kNoId = -1;
ArrayView<uint8_t> b;
uint64_t prevId = kNoId;
// nextId always contains the id of the field sitting at the start of b,
// and nextEncoding contains its encoding type.
uint64_t nextId;
uint64_t nextEncoding;
bool ok = true;
StructDeserializeReporter(ArrayView<uint8_t> b) : b(b) { loadNextId(); }
template <typename T> void operator()(uint64_t id, T& t) {
while (nextId < id) {
// Skip fields until we match or exceed id.
skipField();
loadNextId();
}
if (nextId > id) {
// We never found a matching ID. Set t to the default value.
t = T();
return;
}
// Is the field encoding correct?
if (nextEncoding != SerializeImpl<T>::kEncoding) {
ok = false;
return;
}
// Found the field. Deserialize it.
ok = deserialize(b, &t);
loadNextId();
}
void loadNextId() {
if (!ok || b.empty()) {
// Nothing left to deserialize.
nextId = kNoId;
return;
}
uint64_t head = 0;
ok = deserializeVarint(b, &head);
if (!ok) return;
nextEncoding = head & encoding::kMask;
uint64_t deltaId = head >> encoding::kBits;
nextId = (prevId == kNoId) ? deltaId : deltaId + prevId + 1;
prevId = nextId;
}
void skipField() {
if (!ok) return;
// nextEncoding tells us how to skip the field.
if (nextEncoding == encoding::kByte) {
// Skip one byte.
if ((ok = b.size() >= 1)) b += 1;
} else if (nextEncoding == encoding::kOctet) {
// Skip 8 bytes.
if ((ok = b.size() >= 8)) b += 8;
} else if (nextEncoding == encoding::kVarint) {
// Skip bytes until we find one that doesn't have the high bit set.
ok = skipVarint(b);
} else {
ASSERT(nextEncoding == encoding::kSized);
// Read the size prefix, then skip that many bytes.
uint64_t size;
if ((ok = deserializeVarint(b, &size))) {
if ((ok = b.size() >= size)) b += size;
}
}
}
};
template <typename T> struct SerializeImpl {
static constexpr uint64_t kEncoding = encoding::kSized;
static void push(const T& t, Array<uint8_t>& o) {
// The kSized format is the size followed by the object bytes. But we
// don't know the size yet, and the size itself is a varint.
size_t sizeBefore = o.size();
// So first we serialize the object.
StructSerializeReporter r(o);
t.report(r);
size_t sizeBetween = o.size();
// Then we serialize the size.
serializeVarint(sizeBetween - sizeBefore, o);
size_t sizeAfter = o.size();
// Then we rotate the buffer so that the size is now a prefix.
o.rotateSubArray(sizeBefore, sizeAfter, sizeAfter - sizeBetween);
}
static bool pull(ArrayView<uint8_t>& b, T* t) {
size_t size = 0;
// Deserialize the size.
if (!deserializeVarint(b, &size)) return false;
if (b.size() < size) return false;
// Remove the first size bytes from b. c now contains every byte
// of the serialized struct, and b has been advanced just past it.
ArrayView<uint8_t> c = b.pull(size);
StructDeserializeReporter d(c);
t->report(d);
return d.ok;
}
};
The real implementation is pretty well tested, but there may be mistakes in the above code, as I had to make a lot of edits to remove all my idiosyncrasies 😅
That’s all for today.