Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can't find any evidence showing that OLAP means it is okay to lose data from unexpected shutdowns. How can you have correct analytics without a complete set of data?

> For good reason. It's not a simple matter of choosing one of two options. The choice has consequences: performance.

It is a simple matter though. They can choose to sacrifice performance for data durability which I suspect would not be impacted very much since clickhouse acts like an append log. It just seems that Yandex doesn't care much for durability since they are just using the database to store people's web traffic. They wouldn't care if some of that data is lost so they don't use fsync.



> I can't find any evidence showing that OLAP means it is okay to ...

OLAP also doesn't mean "be the source of truth of the data". You can have a separate source of truth of the "complete set of data" outside of your OLAP engine and load (and reload) data into your OLAP engine any time you're not sure if you have the "complete set of data" in it.

The important difference lies in how often one finds themselves in that situation. In OLAP, the sheer majority of the time is spent querying (i.e., reading) data than loading (i.e., writing) data and waiting for it to be durably saved (i.e., fsync-ed). Because of this imbalance, it makes sense to prioritise for one scenario and handle the other sub-optimally.

> They wouldn't care if some of that data is lost so they don't use fsync.

Or, they can still care about data correctness and simply re-load data they suspect is/may not consistent in the rare case of an improper shutdown. It's not like they use ClickHouse as their primary data store.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: