Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think arithmetic coding is much simpler than the way most resources describe it (the hard part is making it efficient).

Consider this example: you want to transmit two completely random variables, both of which can have 5 states. The obvious way is to concatenate two bit fields of size ceil(log2(5)), so 3+3 = 6 bits.

But alternatively, you can count the total number of states possible for both variables together, 5*5 = 25 and encode it as a single integer of size ceil(log2(25)) = 5, so both variables can be stored with just 5 bits.

So we arrive at the idea that there can be a fractional number of bits, which we often round up to the nearest integer for simplicity (or, in practice, the nearest multiple of 8 since most protocols are based on bytes).

The other part is just assigning shorter sequences of bits to more common symbols, except, of course unlike in Huffman coding, our symbols can have a fractional number of bits. This allows them to match the actual symbol probabilities more closely. If your data is highly repetitive, you can fit dozens of (common) symbols per bit.

The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.



> The coolest part IMO is how easy it is to plug in custom models for symbol probabilities. Usually a simple counter for each symbol is enough, but you can go crazy and start predicting the next symbol based on previous ones.

PPM-based [0] compression can work quite well for text but on its own it's not enough to unseat the Lempel-Ziv family as the top general purpose compressor.

[0] https://en.wikipedia.org/wiki/Prediction_by_partial_matching


Yeah arithmetic coding on its own is not enough, typically you would have it as the last step, with dictionary and other compressors before it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: