Lots of concerns and scepticism in the discussions here. Any suggestions about g...

zbentley · on March 20, 2022

Require domain teams' code to communicate (with other domain teams and with the outside world) using the same pathways, schemas, and contracts that are used when extracting a domain team's data into a data lake.

Whether or not that data lake is semi-operated by the team (as proposed in the article) or operated centrally, requiring the lake's ETL process to use at least some of the APIs and tools used for transactional interaction goes a long way towards making data architecture tend towards sanity.

Resist the temptation of things like RDBMS-level CDC/log stream capture or database snapshots for populating data lakes (RDS Aurora's snapshot export/restore is like methamphetamine in this area: incredibly fast and powerful, has a very severe long term cost for data lake uniformity and usability).

I'm not saying "every row in the data lake must be extracted by making the exact same API hit that an internet user would make, with all of the overhead incurred by that". You can tap into the stack at a lower level than that (e.g. use the same DAOs that user APIs use when populating the data lake, but skip the whole web layer). Just don't tap into the lowest possible layer of the stack for data lake ETL--even though that lowest layer is probably the quickest to get working and most performant, it results in poor data hygiene over the medium and long term.