ClickHouse and Scuba address this. The core idea is the data layout on disk only requires the scan to open files or otherwise access data for the columns referenced in that query.
ClickHouse and Scuba are extremely good at what they’re designed for: fast OLAP over relatively narrow schemas (dozens to hundreds of columns) with heavy aggregation.
The issue I kept running into was extreme width: tens or hundreds of thousands of columns per row, where metadata handling, query planning, and even column enumeration start to dominate.
In those cases, I found that pushing width this far forces very different tradeoffs (e.g. giving up joins and transactions, distributing columns instead of rows, and making SELECT projection part of the contract).
If you’ve seen ClickHouse or Scuba used successfully at that kind of width, I’d genuinely be interested in the details.
Scuba could handle 100,000 columns, probably more. But yes, the model is that you have one table and you can only do self-joins and it’s more or less append only and you were only accessing maybe dozens of columns in a single query.
reply