As a software engineer who has built distributed systems, I can attest that buil...

As a software engineer who has built distributed systems, I can attest that building reliable software is really difficult, bordering on impossible.

The hardest part is probably handling and recovering from all possible failure scenarios. You need to make sure that the system could crash while in the middle of processing any line of logic in your system and it should be able to recover elegantly; without skipping anything and without re-processing what has already been processed (which can cause duplication of records).

The challenge with distributed/partitioned systems specifically is that atomicity is much harder to achieve and strategies for achieving a similar result are complex and error-prone (e.g. two phase commits, using idempotency to avoid double-insertion)... For complex database transactions involving several tables with a custom two-phase commit mechanism, you have to be careful to process records of different types in a specific order. Also, you need to set up your database indexes carefully for fast lookup and sorting...