I just started using kSQL at work to build a data pipeline from Postgres to Elasticsearch. This will be replacing our error-prone batch processes. So far it has been a really nice experience, however I don't have it running in production yet.
Little bit concerned with how we will update the schema of the streams in production. At some point we will need to drop a stream and recreate it with a different schema, and I'm afraid we will miss some incoming data during the downtime. Confluent just released a migration tool, but I'm pretty sure this only works for adding/dropping columns and not changing the datatype of a column.
There's a whole class of interesting problems related to query evolution - and it varies greatly depending on the "environment" that you're interested in (see mjdrogalis' docs on updating a running query). Generally, the strategy that ksqlDB takes at the moment is to validate what upgrades are possible to do in-place and which are not - for the former, ksqlDB "just does it" and for the latter, we are designing a mechanism to deploy topologies side-by-side and then atomically cut over when the new topology is caught up to the old one.
There's an in-progress blog post that describes exactly this class of problems - keep an eye out for it!
I'm just going to throw out another problem since you guys are here :)
The current batch process joins some tables before sending data to elasticsearch. This means that the debezium connector doesn't write all the data I need into kafka. I was thinking I could create a materialized table in ksql with infinite retention for the other postgres tables I need to join on. Then when I stream in an update for the data I want in elasticsearch, I can join on these tables.
The issue is that a Stream-Table join only gets triggered when the stream changes. This means that when the data in the tables change we will not see these updates in elasticsearch.
I guess my only option is to join everything in our app and then produce the full message to the elastic sink topic?
Little bit concerned with how we will update the schema of the streams in production. At some point we will need to drop a stream and recreate it with a different schema, and I'm afraid we will miss some incoming data during the downtime. Confluent just released a migration tool, but I'm pretty sure this only works for adding/dropping columns and not changing the datatype of a column.