Best practice for dynamic 'De-serialize From File' in a Producer/Consumer pattern #6545
-
|
I am implementing a Producer/Consumer pattern to optimize a fast REST-to-Slow Oracle data movement. Producer Pipeline: Fetches paginated data from a REST API and uses the Serialize to File transform to dump pages to local .ser files. The Problem: Questions for the Community: I have tested this pattern by writing out raw REST API data via text output but this uses a lot of disk space so pivoted to serialization as it offers much better performance & compression, just stuck with last mile to Deserialize Files 🤷♂️ Thank for you the inputs in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
|
Hey variables is fine but I think you got this not quite right. Process 1 Process 2 super simple and scalable. |
Beta Was this translation helpful? Give feedback.
-
|
@dmainou - Yup, we have that working without serialization but raw flat files per API page are 6-8MB vs a serialized binary file is just 350kb! We need to scale this to 100s of similar pipelines so trying to be conservative about transient disk space needs. Also the performance with serialized file vs raw text is considerably better. |
Beta Was this translation helpful? Give feedback.
-
|
@dmainou - Thanks for the tip on the parameter way of sending a variable, which helped resolve it via pipeline executor. We are managing in a single workflow, performing both source fetch and parallel push to the database. This helps decouple the source fetch loop, which is faster v/s target Oracle DB Push, establishing a healthy, high-performance pipeline!
|
Beta Was this translation helpful? Give feedback.
-
|
@dmainou - Great suggestion, I will explore it further! The current job metrics are just capturing & printing total rows processed vs inserted vs updated vs deleted in log for our stakeholder KPIs. Since we run all pipelines via Airflow + Hop run, we push all of it via Airflow Open Telemetry export and aggregate the data for KPIs. We did try direct Hop run auto instrumentation using Open Telemetry Java Agent, but since Hop uses custom logging it was did not capture anything, but I will definitely explore dispatching Pipeline metrics to Otel. |
Beta Was this translation helpful? Give feedback.

ok then
Process 1
Pipeline1
P_DATA_SOURCE = customers (parameter)
get current time
create filename (e.g. P_DATA_SOURCE + "_" + timestamp+ ".ser"
pipeline executor passing filename as P_FILENAME (parameter)
Pipeline 2
P_FILENAME
REST call -> landing/P_FILENAME
Process 2
Pipeline 1
get current time as batch_id
/landing/filename_..json -> /archive/batch_id/filename_..json
get list of files in /archive/batch_id/filename_.*.json
pipeline executor passing filename as P_FILENAME (parameter)
Pipeline 2
P_FILENAME -> oracle