This is the source code that accompanies the solution: Deduplication of messages with Cloud PubSub and Cloud Dataflow. This sample code demonstrates three approaches for deduplication:
- PubSubIO:
com.google.examples.dfdedup.DedupWithPubSubIO - Distinct transform:
com.google.examples.dfdedup.DedupWithDistinct - Custom state based deduplication:
com.google.examples.dfdedup.DedupWithStateAndGC
You can run the following end to end pipeline to explore deduplication behavior across all three approaches:
NOTE: If you're new to GCP, please see quickstarts for Cloud PubSub, BigQuery and Cloud Dataflow
Use the schema files under bqschemas/ to create
Blah blah