Source Database as PG Version 11.4 and Target database as PG Version 17.7
FATAL Failed to parse KEEPALIVE message: -- KEEPALIVE {"lsn":"0/1C772DA0","timestamp":""}
This happens because streamKeepalive() in ld_stream.c uses context->sendTime, which is only set by real Postgres replication keepalive packets. For synthetic keepalives, sendTime is 0, so stream_write_internal_message() writes JSON without a timestamp field, which propagates as an empty string through the transform and crashes the apply.
Once the broken .sql file is on disk, restarts keep crashing on the same line, and replay_lsn never advances to endpos, blocking cutover.
Steps to Reproduce:
- Run pgcopydb follow (or pgcopydb clone --follow) with wal2json or test_decoding
- Perform DML on source, then stop DML activity
3)Wait for synthetic keepalive generation (flush timer or empty transaction skipping)
4)Apply process crashes; subsequent restarts fail on the same stale .sql file
5)replay_lsn stays behind endpos, cutover is blocked
Version:
v0.17 (also observed on earlier versions)
Proposed Fix:
Three-layer defense — fallback to feGetCurrentTimestamp() when timestamp is missing:
- ld_stream.c:streamKeepalive() — prevent writing JSON without timestamp
- ld_transform.c:stream_write_keepalive() — catch broken .json files during transform
- ld_apply.c:stream_apply_sql() — catch broken .sql files during apply (warning instead of fatal)
Also I attached the patch below
pgcopydb_keepalive_bugfix.patch
Source Database as PG Version 11.4 and Target database as PG Version 17.7
FATAL Failed to parse KEEPALIVE message: -- KEEPALIVE {"lsn":"0/1C772DA0","timestamp":""}
This happens because streamKeepalive() in ld_stream.c uses context->sendTime, which is only set by real Postgres replication keepalive packets. For synthetic keepalives, sendTime is 0, so stream_write_internal_message() writes JSON without a timestamp field, which propagates as an empty string through the transform and crashes the apply.
Once the broken .sql file is on disk, restarts keep crashing on the same line, and replay_lsn never advances to endpos, blocking cutover.
Steps to Reproduce:
3)Wait for synthetic keepalive generation (flush timer or empty transaction skipping)
4)Apply process crashes; subsequent restarts fail on the same stale .sql file
5)replay_lsn stays behind endpos, cutover is blocked
Version:
v0.17 (also observed on earlier versions)
Proposed Fix:
Three-layer defense — fallback to feGetCurrentTimestamp() when timestamp is missing:
Also I attached the patch below
pgcopydb_keepalive_bugfix.patch