DDD-16: Support On-demand LOB Retrieval for Large Object Columns in SQL-Server

## Motivation

Currently, Debezium connectors load entire LOB (Large Object) column values into JVM heap memory before emitting change events. This approach becomes problematic when tables contain large LOB values (hundreds of megabytes to multiple gigabytes). The challenges are:

1. **OutOfMemoryError.** Large LOB columns (e.g., 5GB `varbinary(max)`) are fully loaded into memory, causing JVM heap exhaustion and connector crashes.
2. **Unpredictable Memory Usage.** Memory consumption scales with LOB size, making capacity planning difficult. A single large LOB change can spike memory usage dramatically.
3. **Connector Instability.** OOM errors cause task failures, requiring restarts and potentially causing duplicate events or data loss during long-running snapshots.
4. **Blocked Adoption.** Organizations with tables containing large binary data (documents, media, scientific data) cannot use Debezium CDC without application-level workarounds.

**Current behavior:**

```
Source Table Row (5GB LOB)
         │
         ▼
Debezium reads: columnValues[i] = rs.getObject("content")
         │
         ▼
5GB loaded into JVM heap
         │
         ▼
OutOfMemoryError
```

**Desired behavior:**

```
Source Table Row (5GB LOB)
         │
         ▼
Debezium detects large LOB (> threshold)
         │
         ▼
Debezium emits lobReference (metadata only, ~100 bytes)
         │
         ▼
Consumer retrieves LOB on-demand using lob reference
         │
         ▼
Memory bounded
```

### Goals

1. **Bounded Memory.** LOB processing should use constant memory regardless of LOB size.
2. **Consumer Flexibility.** Storage agnostic; consumers decide how to handle large LOBs.
3. **No External Dependencies.** Debezium should not depend on specific storage systems (S3, GCS, etc.).

## Proposed Changes

### Overview

Instead of loading large LOB content into memory, Debezium will:

1. Detect LOB columns based on schema metadata.
2. Check LOB size against a configurable threshold.
3. For LOBs exceeding the threshold, emit a LOB reference containing retrieval metadata instead of the actual content.
4. Consumer uses the LOB reference metadata to retrieve the LOB content on-demand by querying SQL-Server's CDC table.

### New configuration parameters

| Name | Type | Default | Importance | Doc |
| :--- | :--- | :------ | :--------- | :-- |
| `lob.detection.enabled` | `BOOLEAN` | `false` | `MEDIUM` | When enabled, Debezium will emit a LOB reference instead of loading the data inline for columns exceeding the configured threshold. |
| `lob.detection.threshold.bytes` | `LONG` | `None` | `MEDIUM` | When `lob.detection.enabled` is `true`, columns larger than this threshold (in bytes) will emit a LOB reference in the `SourceRecord`. |

### SQL Server specifics

#### CDC table

SQL Server CDC stores complete row snapshots in CDC tables (`cdc.<schema>_<table>_CT`). Each CDC row is uniquely identified by:

- `__$start_lsn` — Log Sequence Number (`commit_lsn` in Debezium)
- `__$seqval` — Sequence value within LSN (`change_lsn` in Debezium)
- `__$operation` — Operation type (1=delete, 2=insert, 3=update-before, 4=update-after)

These identifiers enable consumers to query the exact CDC row and retrieve the LOB content using JDBC streaming APIs (`getBinaryStream()`), avoiding memory issues.

#### CDC retention dependency

The source-side LOB retrieval depends on the CDC row still existing in the CDC table at the time of the lookup. SQL Server CDC has a configurable cleanup job that periodically purges old CDC rows based on a retention setting.

If the consumer work falls behind (due to backlog, downtime, slow processing, or large LOB uploads creating additional lag), the CDC cleanup job may purge rows before the consumer has a chance to look them up. When this happens, the lookup query returns zero rows and the LOB content is permanently lost for that event.

**Retention Configuration.** CDC retention must be set to exceed the maximum expected consumer processing lag. The default SQL Server CDC retention is 3 days (4320 minutes).

### LOB reference schema

When a LOB exceeds the threshold, instead of the binary content, the connector emits a `$lobRef` structure:

```json
{
    "version": 1,
    "connector": "sqlserver",
    "database": "mydb",
    "cdcTable": "cdc.dbo_documents_CT",
    "column": "content",
    "lsn": "0x0000002A000001E80003",
    "seqval": "0x0000002A000001E80001",
    "operation": 2
}
```

#### Field descriptions

| Field | Type | Description |
| :---- | :--- | :---------- |
| `version` | `int` | Schema version for forward compatibility. |
| `connector` | `string` | Connector type identifier. `sqlserver` for SQL Server. |
| `database` | `string` | Source database name. |
| `cdcTable` | `string` | Fully-qualified CDC table name (e.g., `cdc.dbo_documents_CT`). |
| `column` | `string` | Column name containing the LOB data. |
| `lsn` | `string` | Log Sequence Number (`__$start_lsn`) as hexadecimal string. Identifies the transaction. |
| `seqval` | `string` | Sequence value (`__$seqval`) as hexadecimal string. Identifies the specific change within the transaction. |
| `operation` | `int` | CDC operation code: 1=delete, 2=insert, 3=update (before), 4=update (after). |

### Maintaining the LOB reference in the Debezium record

#### Option 1: Plaintext `$lobReference` in column value (Proposed)

This approach replaces the LOB column value with a serialized string containing lookup metadata. For example, the column value for `content` would be a byte array or string like:

```
__debezium_lob_reference:
{
  "cdcTable": "cdc.dbo_documents_CT",
  "lsn": "0x...",
  "seqval": "0x...",
  "operation": 2,
  "column": "content"
}
```

instead of the actual LOB bytes.

The column schema remains unchanged — it's still `BYTES` or `STRING`. Debezium already does something similar with `__debezium_unavailable_value` when LOB content is not captured. This would be a natural extension of that pattern.

The consumer detects a LOB reference by checking if the column value starts with the `__debezium_lob_reference:` prefix, then parses the JSON payload to extract lookup metadata.

**Pros:**

- No schema changes anywhere, no envelope modification.
- Simple to implement.
- Follows existing Debezium placeholder pattern.
- Backward compatible — Debezium consumers that don't understand it just see a string value.
- No changes needed in `debezium-core`; can be accomplished by only modifying the SQL Server connector.

**Cons:**

- Metadata is embedded in the data field, mixing concerns.
- The Debezium consumer must parse a serialized string; this is also different for each database type.
- There is a theoretical risk that actual data starts with the same prefix.

#### Alternative Option 2: New `lobReferences` field in envelope

This approach adds a new optional field called `lobReferences` to the Debezium envelope `Struct`. The LOB column value gets a placeholder (`__debezium_lob_reference`), and the actual lookup metadata lives in the envelope-level array. Each entry in the array identifies which field it belongs to (e.g., `after.content`) and contains the lookup properties.

The envelope schema gains a new optional field. The column schema remains unchanged.

**Pros:**

- Clean separation of data and metadata.
- Strongly typed metadata (`Struct` fields, not serialized JSON). Easier for the Debezium consumer to process.
- More aligned with how Debezium handles other metadata (`source` block, `transaction` block).

**Cons:**

- Envelope schema change — the `Envelope` class in `debezium-core` must be modified to include the new optional field. This is a more invasive change than Option 1.
- Schema registry (if used in Kafka Connect) will see a new schema version. More code to write — need to define the `lobReference` Struct schema, collect references during record building, and attach to envelope.

#### Proposal

For a simple and fast implementation, proceed with Option 1 first. Regardless of whether we serialize the LOB reference as a plaintext column value (Option 1) or as an envelope field (Option 2), the internal representation is the same. We will need to create a `LobReference` object in the `Object[]` array at the column reading stage. The only difference is what happens when that `LobReference` gets written into the `SourceRecord`.

Option 1 has the following advantages:

- **Minimal code changes** — just the column reading logic and a simple serialization.
- **No envelope schema changes** — nothing in `debezium-core`'s `Envelope` class needs to change.
- **Works in kafka connect mode and embedded engine mode identically.**
- **Follows the existing `__debezium_unavailable_value` pattern** that Debezium already uses.

### LOB size detection strategy

#### Option 1: Read-until-threshold (Proposed)

Read up to `threshold` bytes from the `InputStream`. If we read more than `threshold` bytes, the LOB is large and we emit `$lobRef`. Otherwise, we emit the content inline.

**Pseudocode:**

```
Open a stream to the LOB column from the result set.

If the column is NULL, return null.

Read the stream in small chunks, keeping track of total bytes read.

If at any point the total bytes read exceeds the threshold,
    stop reading, close the stream, and return a LOB reference.

If the stream ends and total bytes read is zero, return empty bytes.

Otherwise, return the full content read so far as inline bytes.
```

#### Alternative Option 2: Add `DATALENGTH()` to Debezium query

Modify the Debezium SQL Server connector and change the existing query for fetching CDC to include `DATALENGTH()`. We can then utilize this metadata to decide whether a column is inline or an external LOB that will be retrieved later.

```sql
SELECT
    [__$start_lsn],
    [__$seqval],
    [__$operation],
    [__$update_mask],
    [id],
    [name],
    [document],
    TODATETIMEOFFSET(sys.fn_cdc_map_lsn_to_time([__$start_lsn]),
        DATEPART(TZOFFSET, SYSDATETIMEOFFSET())),
    DATALENGTH([document]) AS [document_len]      -- Newly added
FROM [mydb].cdc.[fn_cdc_get_all_changes_dbo_mytable](?, ?, N'all update old')
ORDER BY [__$start_lsn] ASC, [__$seqval] ASC, [__$operation] ASC
```

**Pros:**

- Does not waste any data read into memory like in the read-until-threshold approach.

**Cons:**

- More code changes.

### Handling NULL and empty LOBs

| LOB Value | Action |
| :-------- | :----- |
| `NULL` | Emit `null` (no `$lobRef`) |
| Empty (0 bytes) | Emit empty byte array (no `$lobRef`) |
| Small (`<= threshold`) | Emit inline content |
| Large (`> threshold`) | Emit `$lobRef` |

### Handling multiple LOB columns

Tables may have multiple LOB columns. Each is evaluated independently. For example, for a table with mixed LOB and non-LOB columns, this is the sample Debezium `SourceRecord` emitted:

```json
{
  "op": "c",
  "source": {
    "connector": "sqlserver",
    "change_lsn": "0000002A:000001E8:0003",
    "commit_lsn": "0000002A:000001E8:0004",
    "..."
  },
  "after": {
    "id": 1,
    "name": "report",
    "thumbnail": "base64...",
    "content": "__debezium_lob_reference:{\"cdcTable\":\"cdc.dbo_documents_CT\",\"lsn\":\"0x...\",\"seqval\":\"0x...\",\"operation\":2,\"column\":\"content\"}",
    "metadata": "__debezium_lob_reference:{\"cdcTable\":\"cdc.dbo_documents_CT\",\"lsn\":\"0x...\",\"seqval\":\"0x...\",\"operation\":2,\"column\":\"metadata\"}"
  }
}
```

### Backward compatibility considerations

#### Source record schema

When `lob.detection.enabled` is set to `false` (the default), the connector behavior is unchanged. LOB columns are read fully into memory and emitted inline, preserving full backward compatibility.

When `lob.detection.enabled` is set to `true`, the column schema (`BYTES` or `STRING`) remains the same, but the column value may contain a serialized `__debezium_lob_reference:` string instead of the actual LOB content. Consumers that are unaware of this convention will see the reference string as an opaque value. Consumers that understand the convention will detect the prefix and resolve the LOB on demand.

#### CDC retention

Enabling LOB retrieval introduces a dependency on CDC row availability at fetch time. If the CDC cleanup job purges rows before the consumer retrieves them, the LOB content is permanently lost for that event. Operators must ensure that the CDC retention period exceeds the maximum expected consumer processing lag.

Name	Type	Default	Importance	Doc
`lob.detection.enabled`	`BOOLEAN`	`false`	`MEDIUM`	When enabled, Debezium will emit a LOB reference instead of loading the data inline for columns exceeding the configured threshold.
`lob.detection.threshold.bytes`	`LONG`	`None`	`MEDIUM`	When `lob.detection.enabled` is `true`, columns larger than this threshold (in bytes) will emit a LOB reference in the `SourceRecord`.

Field	Type	Description
`version`	`int`	Schema version for forward compatibility.
`connector`	`string`	Connector type identifier. `sqlserver` for SQL Server.
`database`	`string`	Source database name.
`cdcTable`	`string`	Fully-qualified CDC table name (e.g., `cdc.dbo_documents_CT`).
`column`	`string`	Column name containing the LOB data.
`lsn`	`string`	Log Sequence Number (`__$start_lsn`) as hexadecimal string. Identifies the transaction.
`seqval`	`string`	Sequence value (`__$seqval`) as hexadecimal string. Identifies the specific change within the transaction.
`operation`	`int`	CDC operation code: 1=delete, 2=insert, 3=update (before), 4=update (after).

LOB Value	Action
`NULL`	Emit `null` (no `$lobRef`)
Empty (0 bytes)	Emit empty byte array (no `$lobRef`)
Small (`<= threshold`)	Emit inline content
Large (`> threshold`)	Emit `$lobRef`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDD-16: Support On-demand LOB Retrieval for Large Object Columns in SQL-Server #22

Motivation

Goals

Proposed Changes

Overview

New configuration parameters

SQL Server specifics

CDC table

CDC retention dependency

LOB reference schema

Field descriptions

Maintaining the LOB reference in the Debezium record

Option 1: Plaintext `$lobReference` in column value (Proposed)

Alternative Option 2: New `lobReferences` field in envelope

Proposal

LOB size detection strategy

Option 1: Read-until-threshold (Proposed)

Alternative Option 2: Add `DATALENGTH()` to Debezium query

Handling NULL and empty LOBs

Handling multiple LOB columns

Backward compatibility considerations

Source record schema

CDC retention

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DDD-16: Support On-demand LOB Retrieval for Large Object Columns in SQL-Server #22

Description

Motivation

Goals

Proposed Changes

Overview

New configuration parameters

SQL Server specifics

CDC table

CDC retention dependency

LOB reference schema

Field descriptions

Maintaining the LOB reference in the Debezium record

Option 1: Plaintext $lobReference in column value (Proposed)

Alternative Option 2: New lobReferences field in envelope

Proposal

LOB size detection strategy

Option 1: Read-until-threshold (Proposed)

Alternative Option 2: Add DATALENGTH() to Debezium query

Handling NULL and empty LOBs

Handling multiple LOB columns

Backward compatibility considerations

Source record schema

CDC retention

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option 1: Plaintext `$lobReference` in column value (Proposed)

Alternative Option 2: New `lobReferences` field in envelope

Alternative Option 2: Add `DATALENGTH()` to Debezium query