Skip to content

Attach xrepl_origin_id to every DML CDC event#30493

Open
kgalieva wants to merge 4 commits intoyugabyte:masterfrom
Shopify:origin_id
Open

Attach xrepl_origin_id to every DML CDC event#30493
kgalieva wants to merge 4 commits intoyugabyte:masterfrom
Shopify:origin_id

Conversation

@kgalieva
Copy link

@kgalieva kgalieva commented Feb 26, 2026

Summary

Attach xrepl_origin_id to every DML CDC event (INSERT/UPDATE/DELETE), not just COMMIT records.

Motivation

pg_replication_origin sets an integer origin_id per session. Today, this value only appears on COMMIT RowMessage records in CDC output. Individual DML events carry xrepl_origin_id = 0. This forces CDC consumers to buffer all DML events, wait for the COMMIT to learn the origin, then retroactively attribute it — impractical when consuming from independent per-tablet streams.

With this change, each DML record carries the origin_id immediately, enabling CDC consumers to route, filter, or tag events on the fly without buffering.

Changes

The xrepl_origin_id is already extracted from the WAL at every point where DML records are created — it just wasn't being set on the RowMessage. This PR adds that.

src/yb/cdc/cdcsdk_producer.cc

Single-shard path (PopulateCDCSDKWriteRecord): After extracting xrepl_origin_id from msg->write(), set it on the DML row_message immediately.

Multi-shard path (PopulateCDCSDKIntentRecord): Added xrepl_origin_id parameter to the function signature. The value was already available in the caller (ProcessIntents) but wasn't passed through. Now each DML row_message created from intents carries the origin_id.

src/yb/cdc/cdc_service.proto

Updated comment on xrepl_origin_id field from "Only set for COMMIT Ops" to "Set on DML and COMMIT Ops". No wire format change.

Test plan

  • Existing test TestOriginId passes unchanged (the get_xrepl_origin_id lambda finds the first non-zero origin_id, which is now a DML record instead of COMMIT — same value)
  • New test TestOriginIdOnDMLRecords verifies:
    • Single-shard INSERT/UPDATE/DELETE records carry xrepl_origin_id
    • Multi-shard (explicit transaction) DML records carry xrepl_origin_id
    • COMMIT records still carry xrepl_origin_id (backwards compat)
    • Local writes (no origin) have origin_id 0/absent
  • End-to-end verified with CDC consumer app against a local YugabyteDB build — DML records show varying xrepl_origin_id per transaction

Note

Medium Risk
Changes CDC output semantics by adding an additional field on high-volume DML records, which could affect downstream consumers that assume it is only present on COMMIT, but the change is additive and tested.

Overview
CDCSDK RowMessage records for INSERT/UPDATE/DELETE now populate xrepl_origin_id immediately, for both the single-shard write path and the multi-shard intent path (by threading xrepl_origin_id into PopulateCDCSDKIntentRecord).

Adds an integration test TestOriginIdOnDMLRecords covering single-shard and multi-shard transactions and verifying local writes omit/zero the field; updates the proto comment to reflect the expanded semantics (no wire change).

Written by Cursor Bugbot for commit c63e4c2. This will update automatically on new commits. Configure here.

@hari90
Copy link
Contributor

hari90 commented Feb 26, 2026

@kgalieva tha is for putting this PR out.
What's the GitHub issue this is solving? The description mentions buffering of DMLs but that's the pg logical behavior. A full transaction is buffered and sent and only the COMMIT has the origin_id. Adding the id to the dmls does not change the buffering logic.

@hari90 hari90 self-requested a review February 26, 2026 14:59
@kgalieva
Copy link
Author

What's the GitHub issue this is solving? The description mentions buffering of DMLs but that's the pg logical behavior. A full transaction is buffered and sent and only the COMMIT has the origin_id. Adding the id to the dmls does not change the buffering logic.

You're right that in PG logical replication, the full transaction is buffered and decoded as a unit — the output plugin sees all changes together with the COMMIT metadata, so COMMIT-level origin_id is sufficient.

But CDCSDK consumers receive events as a stream of protobuf records via GetChanges RPCs. The consumer processes RowMessage records one at a time as they arrive. When the consumer's goal is to filter or route events by origin (e.g., discard events from a specific replication origin, or route events to different downstream systems based on origin), it has to hold every DML record in memory until the COMMIT record arrives to learn the origin_id — then go back and process/discard them.

With origin_id on every DML record, the consumer can make routing/filtering decisions immediately per-record. For large transactions this avoids buffering entirely.

This is a small change — the origin_id is already extracted from the WAL at every point where DML records are created, it just wasn't being set on the RowMessage.

Copy link
Member

@shishir2001-yb shishir2001-yb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. cc: @Sumukh-Phalgaonkar can you take a look too.

@hari90 hari90 removed their request for review March 18, 2026 18:13
@austenLacy
Copy link
Contributor

@Sumukh-Phalgaonkar have you had a chance to review this yet?


TEST_F(CDCSDKYsqlTest, TestOriginIdOnDMLRecords) {
ANNOTATE_UNPROTECTED_WRITE(FLAGS_yb_enable_cdc_consistent_snapshot_streams) = true;
ANNOTATE_UNPROTECTED_WRITE(FLAGS_cdc_populate_end_markers_transactions) = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed since the flag is true by default.

ASSERT_EQ(tablets.size(), 1);
auto stream_id = ASSERT_RESULT(CreateConsistentSnapshotStream());

// Consume the initial schema record.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by initial schema record. Shouldn't the get changes response below contain change records corresponding to the below insert?

Maybe, you can assert that these records don't contain any origin id.

ASSERT_OK(conn.ExecuteFormat("INSERT INTO $0 VALUES (1, 100)", kTableName));
ASSERT_OK(conn.Fetch("SELECT pg_replication_origin_session_reset()"));
change_resp = ASSERT_RESULT(GetChangesFromCDC(stream_id, tablets, &cdc_sdk_checkpoint));
LOG(INFO) << "Single-shard INSERT: " << change_resp.ShortDebugString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally do not keep any INFO level logs in the tests

@Sumukh-Phalgaonkar
Copy link
Contributor

With this change, the origin_id would be populated in the DMLs as well as the COMMIT messages. For the sake of completeness, I think we should add it in the BEGIN message as well.

@cursor
Copy link

cursor bot commented Mar 23, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@netlify
Copy link

netlify bot commented Mar 23, 2026

Deploy Preview for infallible-bardeen-164bc9 ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit 70e9b87
🔍 Latest deploy log https://app.netlify.com/projects/infallible-bardeen-164bc9/deploys/69c183ddf7b70300080bf71c
😎 Deploy Preview https://deploy-preview-30493--infallible-bardeen-164bc9.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@kgalieva
Copy link
Author

With this change, the origin_id would be populated in the DMLs as well as the COMMIT messages. For the sake of completeness, I think we should add it in the BEGIN message as well.

Good suggestion. For multi-shard transactions this was straightforward — xrepl_origin_id is already available in ProcessIntents() before FillBeginRecord() is called. For single-shard transactions, I moved the xrepl_origin_id extraction from msg->write() earlier in PopulateCDCSDKWriteRecord(), before the FillBeginRecordForSingleShardTransaction() call. Both BEGIN functions now accept and set the origin_id. The test lambda and proto comment have been updated to reflect BEGIN as well. Done in the latest push.

ASSERT_EQ(record.row_message().xrepl_origin_id(), expected_origin_id)
<< "Wrong xrepl_origin_id on op=" << RowMessage::Op_Name(op);
} else {
// origin_id 0 means local — field should be absent or zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace the "—" with "-"

   Error  (TXT5) Bad Charset
    Source code should contain only ASCII bytes with ordinal decimal values
    between 32 and 126 inclusive, plus linefeed. Do not use UTF-8 or other
    multibyte charsets.

           13015               ASSERT_EQ(record.row_message().xrepl_origin_id(), expected_origin_id)
           13016                   << "Wrong xrepl_origin_id on op=" << RowMessage::Op_Name(op);
           13017             } else {
    >>>    13018               // origin_id 0 means local — field should be absent or zero.
           13019               ASSERT_TRUE(!record.row_message().has_xrepl_origin_id() ||
           13020                           record.row_message().xrepl_origin_id() == 0)
           13021                   << "Expected no xrepl_origin_id on op=" << RowMessage::Op_Name(op);

ASSERT_NO_FATAL_FAILURE(verify_origin_id_on_all_records(change_resp, 1));
cdc_sdk_checkpoint = change_resp.cdc_sdk_checkpoint();

// --- Local (no origin) path — verify origin_id is 0/absent ---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seme as above, replace it with "-"

@shishir2001-yb
Copy link
Member

Have triggered the Unit test suite for this, will update with the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants