I need some ELI5 explaining to: colstores allow you to pick out just the cols of...

lmwnshn · on May 27, 2022

You might be interested in [0], which is one of the course readings for [1] which has video available ("Storage Models, Data Layout, & System Catalogs"). Specifically, the paper asks if you can turn a row-store into a column-store by just vertically partitioning the schema or building more indexes etc; the answer is no, and they go into various reasons why (late materialization 3x, compression 2x to 10x depending on whether query is accessing sorted data, etc).

[0] https://15721.courses.cs.cmu.edu/spring2020/papers/08-storag...

[1] https://15721.courses.cs.cmu.edu/spring2020/schedule.html

zasdffaa · on May 27, 2022

Now that is a bloody great answer! (too little karma to upvote, so my thanks instead).

eatonphil · on May 27, 2022

Try it out yourself experimentally. You'll see the difference quickly.

https://datastation.multiprocess.io/blog/2021-10-18-experime...

zasdffaa · on May 27, 2022

Might be missing something, but using an interpreted language that splashes its objects around in memory with pointers everywhere, using json, and gets a 50% speedup, depending, doesn't look like a convincing test of anything.

eatonphil · on May 27, 2022

Do it in whatever language you want to prove the concepts! I'm not trying to convince of the facts. I'm mentioning an approach to experimentally learning the facts yourself. :)

nicoburns · on May 27, 2022

I doubt it's that alone (and a 10% factor is much more reasonable than a 1% factor). But I think another factor is that column stores usually compress the data in each column. Which can be particularly effective for columns with a lot of repeated values.

zasdffaa · on May 27, 2022

Missed compression, thanks!