Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I need some ELI5 explaining to: colstores allow you to pick out just the cols of data you need rather than get the full row and discard what you don't need.

If that alone explained the 100X speedup that would mean the row is storing 99% of data that's not of interest to a query. That would a) be unlike pretty most table structures + queries I've ever seen, but in those cases cases such as when a fat blob is stored in the row, you hive off blobs to a separate table and only join to that when blob's wanted.

I'm missing something big here.



You might be interested in [0], which is one of the course readings for [1] which has video available ("Storage Models, Data Layout, & System Catalogs"). Specifically, the paper asks if you can turn a row-store into a column-store by just vertically partitioning the schema or building more indexes etc; the answer is no, and they go into various reasons why (late materialization 3x, compression 2x to 10x depending on whether query is accessing sorted data, etc).

[0] https://15721.courses.cs.cmu.edu/spring2020/papers/08-storag...

[1] https://15721.courses.cs.cmu.edu/spring2020/schedule.html


Now that is a bloody great answer! (too little karma to upvote, so my thanks instead).


Try it out yourself experimentally. You'll see the difference quickly.

https://datastation.multiprocess.io/blog/2021-10-18-experime...


Might be missing something, but using an interpreted language that splashes its objects around in memory with pointers everywhere, using json, and gets a 50% speedup, depending, doesn't look like a convincing test of anything.


Do it in whatever language you want to prove the concepts! I'm not trying to convince of the facts. I'm mentioning an approach to experimentally learning the facts yourself. :)


I doubt it's that alone (and a 10% factor is much more reasonable than a 1% factor). But I think another factor is that column stores usually compress the data in each column. Which can be particularly effective for columns with a lot of repeated values.


Missed compression, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: