Nice! I like the focus on Postgres. Most ETL tools end up trying to build for a larger matrix of source and targets which limits using database specific features and optimizations. Is the CDC built primarily on top of the logical replication / logical decoding infrastructure in Postgres? If so, what are the limitations in that infrastructure which you'd like to see addressed in future Postgres versions?
That is a really good question! A few of them that come to my mind:
1/ logical replication support for schema (DDL) changes
2/ a native logical replication plugin (not wal2json) which is easier to read from the client side. pgoutput is fast but from reading/parsing from the client side is not as straightforward.
3/ improve decoding perf - i've observed pgoutput to cap at 10-15k changes per sec, for an average usecase. This is
after good amount of tuning - ex: logical_replication_work_mem etc. Enabling larger tps - 50k+ tps would be great. Also this is important for Postgres, considering the diverse variety of workloads users are running. For example at Citus, I saw customers doing 500k rps (with COPY), I am not sure logical replication can handle those cases.
4/ logical replication slots in remote storage. one big risk with slots is that they can grow in size (if not read properly) and use up storage on the source. allowing shipping slots to remote storage would really help. i think Oracle allows something like this, but not 100% sure.
5/ logical decoding on standby. it is coming in postgre 16! we will aim to support in PeerDB, right after it is available.
I can think of many more, but sharing a few top ones that came to my mind!