Indeed, I feel like much of this is self-inflicted and could be avoided completely if only more developers would learn to use far more compact and efficient binary formats instead, where parsing (if it could even be called that) is just a natural consequence of reading the data itself.
One of my favourite examples of this is numerical values, which is notably called out in the performance results for this library as being one of the slowest paths.
In a textual format you need to read digits and accumulate them, and JSON allows floating-point which is even more complex to correctly parse.
In a binary format, you read n bytes and interpret it as an integer or a floating-point value directly. One machine instruction instead of dozens if not hundreds or more.
I see such inefficiencies so often that I wonder if most developers even know how to read/write binary formats anymore...
"The fastest way to do something is to not do it at all."
Fun fact, JSON's number type isn't floating point. It may map to it, but the limits on the numbers that are representible are defined by the encoder and decoder.
Which creates this lovely reality where you effectively have numbers that can be accurately represented in IEEE float but can't be represented accurately in decimal notation, and numbers that can be represented accurately in decimal notation but not in IEEE float. All of those cases are dicey with JSON.
Well, those are all error conditions and not a number. NaN's explicitly say they aren't, and +-Inf may or may not be infinity, all we know is that it is outside the representable range of IEEE floating point.
IMHO, code should check for floating-point overflow just like it checks for integer overflow. If your inputs are finite and your output is infinite, that's similar to your inputs being positive and your output being negative: you should check for it and emit an error instead. Then infinity (like negative numbers) can have an actual semantic meaning, rather than just being the detritus of an error condition.
NaNs have an odd ontology, regardless of what they claim to be. For example, hypot(Inf, NaN) == Inf (C99), which suggests that NaN is "some number but we cannot know what it is".
For certain inputs, compression doesn't necessarily work out so well, because of the overhead introduced by the compression format itself. For example, "compressing" a small datagram may actually make it larger.
I also believe (but am not prepared to prove) that packed representations can, under the right circumstances, evade some information theoretic limits that constrain lossless compression, by engaging in a little honest cheating. Since you're using a purpose-specific format instead of creating a general-purpose compression scheme, you don't have to record every single bit of information in the blob itself. You can use a side channel - in the form of the format specification itself - to record some information.
Which is why you see a packed representation (perhaps with optional compression) in things like protocol buffers, whereas a format like Parquet that's meant to store large volumes of self-describing data will just go straight for compression.
Does lz4 really have low enough overhead that it can compress a single integer better than having a predetermined compression scheme for small numbers? Sure, most protos are not a single integer, but i'd doubt it if the average proto message was more than 1 kb.
Compression algos come with their own disadvantages. Varint encoding exploits the normal distribution/Benford's law type reality that the vast majority of fixed point numbers tend to be closer to zero than their upper bounds. As fast as lz4 is, encoding & decoding varint is way, way faster. There are other advantages too... lz4 works at the block level, rather than the field level, so you have effects from surrounding fields and the need to decode all the fields/records at once. You also need a certain size block for it to really be useful...
There are cases where what you are saying makes a lot of sense, and for that you have things like FlatBuffers & CapnProto.
Varint is not compression. Proper compression looks for redundancy in data and eliminates it.
All a compression algorithm has to do to get similar advantages to varint is notice that 0-valued bytes are common and compress them. A simple Huffman coding will do that nicely.
Is varint actually "way, way faster" than Huffman? I don't think it's entirely clear -- it probably depends on the implementation. Protobuf generated code inlines copies of varint encoding all over the place, which is better than not inlining, but still not very nice to the instruction cache. Huffman coding would process the entire buffer in one go and can be a much tighter loop. A lot of work has gone into optimizing Huffman coding, including with things like SIMD. You can't really leverage SIMD for Varints in Protobuf because the input is not a homogenous array. Varint is known to be a very branchy format which is not so great for performance. Branchless Huffman is a thing.
You can probably do even better with an algorithm tailor-made to look for zeros. I took a crack at this with "packed" format in Cap'n Proto, which I implemented in a branchless way, but not with SIMD (I'm no good at assembly). It's been like 7 years since I did the benchmarks but IIRC capnp+packing turned out to be pretty similar to protobuf in both size and speed... but there are a lot of confounding factors there. Could be interesting to replace Varint with fixed-width encoding in Protobuf itself and then run packing on the output and see how that performs...
But it really doesn't seem obvious to me at all that Varint would be "much faster" than a matching compression algorithm. Do you have some data behind that or is it just a hunch?
Disclaimer: I'm not a compression expert. I am the author of Protobuf v2 though. My recollection is that the original designers of Protobuf were never very happy with varint encoding. It was a casual decision made early on that became impossible to change later.
Hmm, I guess, ironically, lz4 does not do a Huffman pass, unlike most other compression algorithms. It only does dictionary-matching, which I imagine has a tougher time squeezing the zeros out of fixed-width-encoded integers.
I wonder if this implies lz4 pairs well with Protobuf (since it does Varint) but that Cap'n Proto users should look at different algorithms (or maybe apply packing followed by lz4).
My experience is indeed that lz4 tends to work reasonably well on protobuf encoded data, which generally does not compress terribly well. I hadn't much thought about why, but your reasoning makes sense.
Protobuf varints are particularly annoying because they require searching for a terminating byte. A naive implementation needs to make 20 comparisons to decode an 8-byte int (checking the value of each byte, and for overflow in the input data).
UTF-8-style varints, where the encoded length can be determined entirely from the first byte, are much nicer.
Again, for small integers, this isn't really much of a disadvantage. You can find the terminating byte very efficiently. The bigger issue with varint tends to be more that when you pack data into bytes you have a lot more unaligned memory access. You can work around this, but it increases the complexity of your decoder.
Flatbuffers, Cap'nproto, and similar formats can pad everything else for alignment, at the cost of easily compressible nulls. For a lot of use cases, that's a good trade off.
> Varint is not compression. Proper compression looks for redundancy in data and eliminates it.
I'm not sure why you are choosing to make a semantic argument about an assertion I didn't make, but very well. Compression is any mechanism that allows you to encode information using fewer bits than the original representation. There are compression mechanisms that don't require there to be any redundancy within the data they are compressing (for example, with a fixed dictionary).
> Huffman coding would process the entire buffer in one go and can be a much tighter loop.
That's a very good point, except part of how lz4 pulls off the wonders that it pulls off is by not having an entropy encoding stage... so no Huffman coding going on there. You can optimize Huffman coding all you want, but it still going to be slower than lz4's "not doing it" (hence why lz4 is so fast), which is in turn slower than varint.
You're right, you can't really leverage x86's various SIMD extensions for varints, because varints are usually only less than four bytes long. On the other hand, you can use normal processor instructions for executing an instruction in parallel on a sequence of bytes packed in to a 32-bit or 64-bit register. I agree that protobuf has some surprisingly untuned logic for this (in fairness, the three bits reserved for field identifiers does hamper taking advantage of it, but then those also screw up the use of SIMD instructions in general).
I agree that I found capnp+packing to be very competitive with protobuf for certain applications, as you had promised. However, as you said, there are confounding factors there.
The notion that varint is much faster is intuitive rather than benchmarks. I've looked at the assembly for a varint decoder and compared it to lz4. While the lz4 might be able to win out if you are trying to decode a long sequence of bytes, it's all over but the crying within a a few instructions for simply decoding a small integer.
> My recollection is that the original designers of Protobuf were never very happy with varint encoding. It was a casual decision made early on that became impossible to change later.
That is my recollection as well (although a lot of the pain was around the ugly work around for signed integers). Varint isn't the best thing, but it is a thing, and if used properly, it does yield advantages.
I'd love to hear more on that last part, if you care to share more details. I've always been a little curious if varints are worth the effort, at least as a default format. Seems like there's a decent chance that what you gain by shaving bytes, you lose in branch instructions.
This is why I love using serde bincode serialization. I can even deal with serde bincode binary blobs inside rust based wasm on the front end. It can save a ton of bytes on the wire, used appropriately.
Having grown up with 8bit home computers everything about the web seems insanely inefficient. But text formats do have undeniable upside. Best would be more portable efficient formats like protobuf.
It might be worth noting that, by the words of Douglas Crockford, JSON wasn't even at first specified, it was "discovered."[0] As in, a small group of people at a startup were wanting an object interchange format they could use in the browser in the early 2000s, found that JavaScript object literals worked good enough, and went with that. It only became a standard after the fact.
If the data you have requires more thought as to how to store it than "JavaScript's object notation is good enough," then perhaps a specialized binary format would work better.
Then again, nearly all APIs only accept and receive JSON. JSON is the one object notation that gets all the attention, so most maintainers will focus their time making optimizations for reading or writing JSON. And if you don't like XML or YAML for some reason, then JSON is tempting to use as a config file format, although it critically lacks comments or heredocs.
Unfortunately, binary formats still have plenty of issues too like usually requiring schemas (msgpack doesn't but it's MUCH slower than something with a schema like flatbuffers). They also are fairly clunky for handling nested structured data (eg; arrays of maps).
One of my favourite examples of this is numerical values, which is notably called out in the performance results for this library as being one of the slowest paths.
In a textual format you need to read digits and accumulate them, and JSON allows floating-point which is even more complex to correctly parse.
In a binary format, you read n bytes and interpret it as an integer or a floating-point value directly. One machine instruction instead of dozens if not hundreds or more.
I see such inefficiencies so often that I wonder if most developers even know how to read/write binary formats anymore...
"The fastest way to do something is to not do it at all."