Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For certain inputs, compression doesn't necessarily work out so well, because of the overhead introduced by the compression format itself. For example, "compressing" a small datagram may actually make it larger.

I also believe (but am not prepared to prove) that packed representations can, under the right circumstances, evade some information theoretic limits that constrain lossless compression, by engaging in a little honest cheating. Since you're using a purpose-specific format instead of creating a general-purpose compression scheme, you don't have to record every single bit of information in the blob itself. You can use a side channel - in the form of the format specification itself - to record some information.

Which is why you see a packed representation (perhaps with optional compression) in things like protocol buffers, whereas a format like Parquet that's meant to store large volumes of self-describing data will just go straight for compression.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: