-
-
Notifications
You must be signed in to change notification settings - Fork 25
DDD-16: Support On-demand LOB Retrieval for Large Object Columns in SQL-Server #22
Description
Motivation
Currently, Debezium connectors load entire LOB (Large Object) column values into JVM heap memory before emitting change events. This approach becomes problematic when tables contain large LOB values (hundreds of megabytes to multiple gigabytes). The challenges are:
- OutOfMemoryError. Large LOB columns (e.g., 5GB
varbinary(max)) are fully loaded into memory, causing JVM heap exhaustion and connector crashes. - Unpredictable Memory Usage. Memory consumption scales with LOB size, making capacity planning difficult. A single large LOB change can spike memory usage dramatically.
- Connector Instability. OOM errors cause task failures, requiring restarts and potentially causing duplicate events or data loss during long-running snapshots.
- Blocked Adoption. Organizations with tables containing large binary data (documents, media, scientific data) cannot use Debezium CDC without application-level workarounds.
Current behavior:
Source Table Row (5GB LOB)
│
▼
Debezium reads: columnValues[i] = rs.getObject("content")
│
▼
5GB loaded into JVM heap
│
▼
OutOfMemoryError
Desired behavior:
Source Table Row (5GB LOB)
│
▼
Debezium detects large LOB (> threshold)
│
▼
Debezium emits lobReference (metadata only, ~100 bytes)
│
▼
Consumer retrieves LOB on-demand using lob reference
│
▼
Memory bounded
Goals
- Bounded Memory. LOB processing should use constant memory regardless of LOB size.
- Consumer Flexibility. Storage agnostic; consumers decide how to handle large LOBs.
- No External Dependencies. Debezium should not depend on specific storage systems (S3, GCS, etc.).
Proposed Changes
Overview
Instead of loading large LOB content into memory, Debezium will:
- Detect LOB columns based on schema metadata.
- Check LOB size against a configurable threshold.
- For LOBs exceeding the threshold, emit a LOB reference containing retrieval metadata instead of the actual content.
- Consumer uses the LOB reference metadata to retrieve the LOB content on-demand by querying SQL-Server's CDC table.
New configuration parameters
| Name | Type | Default | Importance | Doc |
|---|---|---|---|---|
lob.detection.enabled |
BOOLEAN |
false |
MEDIUM |
When enabled, Debezium will emit a LOB reference instead of loading the data inline for columns exceeding the configured threshold. |
lob.detection.threshold.bytes |
LONG |
None |
MEDIUM |
When lob.detection.enabled is true, columns larger than this threshold (in bytes) will emit a LOB reference in the SourceRecord. |
SQL Server specifics
CDC table
SQL Server CDC stores complete row snapshots in CDC tables (cdc.<schema>_<table>_CT). Each CDC row is uniquely identified by:
__$start_lsn— Log Sequence Number (commit_lsnin Debezium)__$seqval— Sequence value within LSN (change_lsnin Debezium)__$operation— Operation type (1=delete, 2=insert, 3=update-before, 4=update-after)
These identifiers enable consumers to query the exact CDC row and retrieve the LOB content using JDBC streaming APIs (getBinaryStream()), avoiding memory issues.
CDC retention dependency
The source-side LOB retrieval depends on the CDC row still existing in the CDC table at the time of the lookup. SQL Server CDC has a configurable cleanup job that periodically purges old CDC rows based on a retention setting.
If the consumer work falls behind (due to backlog, downtime, slow processing, or large LOB uploads creating additional lag), the CDC cleanup job may purge rows before the consumer has a chance to look them up. When this happens, the lookup query returns zero rows and the LOB content is permanently lost for that event.
Retention Configuration. CDC retention must be set to exceed the maximum expected consumer processing lag. The default SQL Server CDC retention is 3 days (4320 minutes).
LOB reference schema
When a LOB exceeds the threshold, instead of the binary content, the connector emits a $lobRef structure:
{
"version": 1,
"connector": "sqlserver",
"database": "mydb",
"cdcTable": "cdc.dbo_documents_CT",
"column": "content",
"lsn": "0x0000002A000001E80003",
"seqval": "0x0000002A000001E80001",
"operation": 2
}Field descriptions
| Field | Type | Description |
|---|---|---|
version |
int |
Schema version for forward compatibility. |
connector |
string |
Connector type identifier. sqlserver for SQL Server. |
database |
string |
Source database name. |
cdcTable |
string |
Fully-qualified CDC table name (e.g., cdc.dbo_documents_CT). |
column |
string |
Column name containing the LOB data. |
lsn |
string |
Log Sequence Number (__$start_lsn) as hexadecimal string. Identifies the transaction. |
seqval |
string |
Sequence value (__$seqval) as hexadecimal string. Identifies the specific change within the transaction. |
operation |
int |
CDC operation code: 1=delete, 2=insert, 3=update (before), 4=update (after). |
Maintaining the LOB reference in the Debezium record
Option 1: Plaintext $lobReference in column value (Proposed)
This approach replaces the LOB column value with a serialized string containing lookup metadata. For example, the column value for content would be a byte array or string like:
__debezium_lob_reference:
{
"cdcTable": "cdc.dbo_documents_CT",
"lsn": "0x...",
"seqval": "0x...",
"operation": 2,
"column": "content"
}
instead of the actual LOB bytes.
The column schema remains unchanged — it's still BYTES or STRING. Debezium already does something similar with __debezium_unavailable_value when LOB content is not captured. This would be a natural extension of that pattern.
The consumer detects a LOB reference by checking if the column value starts with the __debezium_lob_reference: prefix, then parses the JSON payload to extract lookup metadata.
Pros:
- No schema changes anywhere, no envelope modification.
- Simple to implement.
- Follows existing Debezium placeholder pattern.
- Backward compatible — Debezium consumers that don't understand it just see a string value.
- No changes needed in
debezium-core; can be accomplished by only modifying the SQL Server connector.
Cons:
- Metadata is embedded in the data field, mixing concerns.
- The Debezium consumer must parse a serialized string; this is also different for each database type.
- There is a theoretical risk that actual data starts with the same prefix.
Alternative Option 2: New lobReferences field in envelope
This approach adds a new optional field called lobReferences to the Debezium envelope Struct. The LOB column value gets a placeholder (__debezium_lob_reference), and the actual lookup metadata lives in the envelope-level array. Each entry in the array identifies which field it belongs to (e.g., after.content) and contains the lookup properties.
The envelope schema gains a new optional field. The column schema remains unchanged.
Pros:
- Clean separation of data and metadata.
- Strongly typed metadata (
Structfields, not serialized JSON). Easier for the Debezium consumer to process. - More aligned with how Debezium handles other metadata (
sourceblock,transactionblock).
Cons:
- Envelope schema change — the
Envelopeclass indebezium-coremust be modified to include the new optional field. This is a more invasive change than Option 1. - Schema registry (if used in Kafka Connect) will see a new schema version. More code to write — need to define the
lobReferenceStruct schema, collect references during record building, and attach to envelope.
Proposal
For a simple and fast implementation, proceed with Option 1 first. Regardless of whether we serialize the LOB reference as a plaintext column value (Option 1) or as an envelope field (Option 2), the internal representation is the same. We will need to create a LobReference object in the Object[] array at the column reading stage. The only difference is what happens when that LobReference gets written into the SourceRecord.
Option 1 has the following advantages:
- Minimal code changes — just the column reading logic and a simple serialization.
- No envelope schema changes — nothing in
debezium-core'sEnvelopeclass needs to change. - Works in kafka connect mode and embedded engine mode identically.
- Follows the existing
__debezium_unavailable_valuepattern that Debezium already uses.
LOB size detection strategy
Option 1: Read-until-threshold (Proposed)
Read up to threshold bytes from the InputStream. If we read more than threshold bytes, the LOB is large and we emit $lobRef. Otherwise, we emit the content inline.
Pseudocode:
Open a stream to the LOB column from the result set.
If the column is NULL, return null.
Read the stream in small chunks, keeping track of total bytes read.
If at any point the total bytes read exceeds the threshold,
stop reading, close the stream, and return a LOB reference.
If the stream ends and total bytes read is zero, return empty bytes.
Otherwise, return the full content read so far as inline bytes.
Alternative Option 2: Add DATALENGTH() to Debezium query
Modify the Debezium SQL Server connector and change the existing query for fetching CDC to include DATALENGTH(). We can then utilize this metadata to decide whether a column is inline or an external LOB that will be retrieved later.
SELECT
[__$start_lsn],
[__$seqval],
[__$operation],
[__$update_mask],
[id],
[name],
[document],
TODATETIMEOFFSET(sys.fn_cdc_map_lsn_to_time([__$start_lsn]),
DATEPART(TZOFFSET, SYSDATETIMEOFFSET())),
DATALENGTH([document]) AS [document_len] -- Newly added
FROM [mydb].cdc.[fn_cdc_get_all_changes_dbo_mytable](?, ?, N'all update old')
ORDER BY [__$start_lsn] ASC, [__$seqval] ASC, [__$operation] ASCPros:
- Does not waste any data read into memory like in the read-until-threshold approach.
Cons:
- More code changes.
Handling NULL and empty LOBs
| LOB Value | Action |
|---|---|
NULL |
Emit null (no $lobRef) |
| Empty (0 bytes) | Emit empty byte array (no $lobRef) |
Small (<= threshold) |
Emit inline content |
Large (> threshold) |
Emit $lobRef |
Handling multiple LOB columns
Tables may have multiple LOB columns. Each is evaluated independently. For example, for a table with mixed LOB and non-LOB columns, this is the sample Debezium SourceRecord emitted:
{
"op": "c",
"source": {
"connector": "sqlserver",
"change_lsn": "0000002A:000001E8:0003",
"commit_lsn": "0000002A:000001E8:0004",
"..."
},
"after": {
"id": 1,
"name": "report",
"thumbnail": "base64...",
"content": "__debezium_lob_reference:{\"cdcTable\":\"cdc.dbo_documents_CT\",\"lsn\":\"0x...\",\"seqval\":\"0x...\",\"operation\":2,\"column\":\"content\"}",
"metadata": "__debezium_lob_reference:{\"cdcTable\":\"cdc.dbo_documents_CT\",\"lsn\":\"0x...\",\"seqval\":\"0x...\",\"operation\":2,\"column\":\"metadata\"}"
}
}Backward compatibility considerations
Source record schema
When lob.detection.enabled is set to false (the default), the connector behavior is unchanged. LOB columns are read fully into memory and emitted inline, preserving full backward compatibility.
When lob.detection.enabled is set to true, the column schema (BYTES or STRING) remains the same, but the column value may contain a serialized __debezium_lob_reference: string instead of the actual LOB content. Consumers that are unaware of this convention will see the reference string as an opaque value. Consumers that understand the convention will detect the prefix and resolve the LOB on demand.
CDC retention
Enabling LOB retrieval introduces a dependency on CDC row availability at fetch time. If the CDC cleanup job purges rows before the consumer retrieves them, the LOB content is permanently lost for that event. Operators must ensure that the CDC retention period exceeds the maximum expected consumer processing lag.