Replicating a FlinkSQL ETL Enrichment Pattern with Timeplus Proton #957
Replies: 1 comment
-
Hi @fsiddiqui-mdsol, Thank you for your interest in Timeplus Proton. CDC processing and join is a typical use case in Timeplus Proton as well. Here are some brief answers to your questions above. Happy to discuss more via slack chat or hop on a quick Zoom chat. Q#1: Does Timeplus Proton offer a built-in mechanism for handling CDC and managing state, similar to Flink's RocksDB-backed state and ROW_NUMBER() over a DISTINCT or upsert view?
Q#2: What is the idiomatic Timeplus Proton approach to performing stream-to-stream joins for enrichment on multiple Kafka topics?
Q#3: In Flink, even if we add new column, it breaks the state hence required the job to start from beginning of kafka topic offsets with blackhole sink and after establishing a state, flink the savepoint from original job to another job. How do we handle this in timeplus?
Q#4: We have savepoint configured to run few times a day, in case of infra upgrades and patching these savepoints are used. Maintain only few days of savepoint on s3. What is pattern in proton?
Q#5: We have 300+ clients data stream into kafka topics, that goes through series of joins and business logic source from 15 tables/topics with 59 vertices on flink dag. for any change in master tables, entire transactional data is recomputed and cause backpressure. This in result, timeout checkpoints/savepoint and eventually restarts the job. How is it handled in proton?
Q#6: Intermediate state are nightmare to crack and upstream data in kafka topic is not guaranteed in sequence. for example client1 data changes on T1, client2 data changes on T2 but its possible that client2 data produced ot kafka earlier than client 1 data
Q#7: Because of large state and joins, we have to provide minimum 30gb to managed memory (rocksdb) out of total 40gb of taskmanager with 15 core of cpu in our pods with local SSD. how does different memory allocation and tuning handled in proton?
Q#8: How does observability is done on proton based materailized view like back-pressure, state size, checkpoint/savepoints failures, thoughput per operator.
![]() ![]() Q9: How do we do UDF in proton? Can we register existing jar file and expose function in proton ? does it impact performance>
Q#10: Does proton support kafka-upsert connector for sink to represent insert/update/delete ??
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We have an established ETL enrichment pattern in our FlinkSQL environment that we're looking to replicate using Timeplus Proton. Our current Flink pipeline processes change data capture (CDC) events from multiple Kafka topics, manages state via RocksDB, and performs stream-to-stream joins to produce a unified output.
The FlinkSQL pattern involves:
Ingesting CDC Data: Three independent Kafka topics are consumed as Flink tables (table1, table2, table3.. tablen).
State Management & Deduplication: We use ROW_NUMBER() partitioned by the primary key to handle late and out-of-order events, effectively creating a deduplicated, up-to-date view of each table's state.
Stream Enrichment: We perform stream-to-stream joins on these deduplicated views (table1_dist, table2_dist, table3_dist,tablen_dist) based on their logical relationships (t1.id = t2.t1_id, t2.t3_id = t3.id).
Sinking to a Target Topic: The final enriched dataset is inserted into a target Kafka topic. some of the jobs produce data into upsert kafka connector with tombstone records.
Our goal is to understand how to implement this pattern efficiently in Timeplus Proton. Specifically, we'd like to know:
Q#1: Does Timeplus Proton offer a built-in mechanism for handling CDC and managing state, similar to Flink's RocksDB-backed state and ROW_NUMBER() over a DISTINCT or upsert view?
Q#2: What is the idiomatic Timeplus Proton approach to performing stream-to-stream joins for enrichment on multiple Kafka topics?
Q#3: In Flink, even if we add new column, it breaks the state hence required the job to start from beginning of kafka topic offsets with blackhole sink and after establishing a state, flink the savepoint from original job to another job. How do we handle this in timeplus?
Q#4: We have savepoint configured to run few times a day, in case of infra upgrades and patching these savepoints are used. Maintain only few days of savepoint on s3. What is pattern in proton?
Q#4: We have 300+ clients data stream into kafka topics, that goes through series of joins and business logic source from 15 tables/topics with 59 vertices on flink dag. for any change in master tables, entire transactional data is recomputed and cause backpressure. This in result, timeout checkpoints/savepoint and eventually restarts the job. How is it handled in proton?
Q#5: Intermediate state are nightmare to crack and upstream data in kafka topic is not guaranteed in sequence. for example client1 data changes on T1, client2 data changes on T2 but its possible that client2 data produced ot kafka earlier than client 1 data
hence cdc_timestamp is not always the largest on largest offset. This requires to disable watermark in flink jobs. What is the pattern in proton for such data behavior.
link 2.1 released some sql to view data from state but figuring out the name of operqto from flinksql is still challenging)
Q#6: Because of large state and joins, we have to provide minimum 30gb to managed memory (rocksdb) out of total 40gb of taskmanager with 15 core of cpu in our pods with local SSD. how does different memory allocation and tuning handled in proton?
Q#7: How does observability is done on proton based materailized view like back-pressure, state size, checkpoint/savepoints failures, thoughput per operator.
Q#8: How do we do UDF in proton? Can we register existing jar file and expose function in proton ? does it impact performance
Q#9: Does proton support kafka-upsert connector for sink to represent insert/update/delete ??
Any guidance on how Proton handles these complexities would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions