Skip to content

Conversation

@rasifr
Copy link
Member

@rasifr rasifr commented Jan 5, 2026

When adding a third node in a Z0DAN sync scenario, the origin advancement
logic in spock.check_commit_timestamp_and_advance_slot was using
spock.lag_tracker to retrieve commit timestamps and convert them back to LSNs.
This approach no longer works because spock.progress is now an in-memory HTAB
instead of a catalog table, so lag_tracker doesn't retain historical
data after the SYNC process COPY operation.

Root Cause:
The procedure spock.create_disable_subscriptions_and_slots creates logical
slots on existing nodes (e.g., n2) when adding a new node (n3). In v5,
the commit LSN/timestamp from the source node (n1) was copied to n3 via
lag_tracker during SYNC, and spock.check_commit_timestamp_and_advance_slot
would use this to advance the origin. With the HTAB-based progress table, this
data is no longer available after COPY.

The Fix:

  1. Capture the LSN returned by pg_create_logical_replication_slot in create_disable_subscriptions_and_slots and store it in temp_sync_lsns
  2. Use this LSN directly in check_commit_timestamp_and_advance_slot to advance the origin, eliminating the dependency on lag_tracker and the timestamp→LSN conversion

This approach is more reliable because it uses the authoritative LSN from
slot creation - the exact point where the apply/sync process will begin
decoding when the subscription is enabled.

@ibrarahmad ibrarahmad self-requested a review January 5, 2026 16:05
@mason-sharp
Copy link
Member

Can you please also include zodan.py?

@rasifr
Copy link
Member Author

rasifr commented Jan 7, 2026

Can you please also include zodan.py?

Done.

@rasifr rasifr force-pushed the task/SPOC-394/zodan-lsn-fix branch from 65e12fc to 1396792 Compare January 8, 2026 10:24
rasifr added 2 commits January 8, 2026 19:58
  When adding a third node in a Z0DAN sync scenario, the origin advancement
  logic in spock.check_commit_timestamp_and_advance_slot was using
  spock.lag_tracker to retrieve commit timestamps and convert them back to LSNs.
  This approach no longer works because spock.progress is now an in-memory HTAB
  instead of a catalog table, so lag_tracker doesn't retain historical
  data after the SYNC process COPY operation.

  Root Cause:
  The procedure spock.create_disable_subscriptions_and_slots creates logical
  slots on existing nodes (e.g., n2) when adding a new node (n3). In v5,
  the commit LSN/timestamp from the source node (n1) was copied to n3 via
  lag_tracker during SYNC, and spock.check_commit_timestamp_and_advance_slot
  would use this to advance the origin. With the HTAB-based progress table, this
  data is no longer available after COPY.

  The Fix:
  1. Capture the LSN returned by pg_create_logical_replication_slot in
     create_disable_subscriptions_and_slots and store it in temp_sync_lsns
  2. Use this LSN directly in check_commit_timestamp_and_advance_slot to
     advance the origin, eliminating the dependency on lag_tracker and the
     timestamp→LSN conversion

  This approach is more reliable because it uses the authoritative LSN from
  slot creation - the exact point where the apply/sync process will begin
  decoding when the subscription is enabled.
Apply the same fix from commit 86acd7b to zodan.py:
- Use LSN from pg_create_logical_replication_slot()
- Advance slot to stored commit LSN instead of querying lag_tracker
- Store both sync_lsn and commit_lsn for later use
@rasifr rasifr force-pushed the task/SPOC-394/zodan-lsn-fix branch from 1396792 to 1db124d Compare January 8, 2026 15:00
Copy link
Contributor

@ibrarahmad ibrarahmad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

else:
# Parse the result to extract LSN (format: "slot_name|lsn")
parts = result.split('|')
lsn = parts[1].strip() if len(parts) > 1 else None
Copy link
Member

@mason-sharp mason-sharp Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just select lsn in the query on line 413 and simplify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants