feat(csharp): implement EnableComplexDatatypeSupport for consistent complex type behavior (PECO-2938)#311
Conversation
…and SEA (PECO-2938) Adds 14 named complex-type tests to ComplexTypesValueTests that pass for both Thrift (JSON string) and SEA (native Arrow) protocols: - COMPLEX-001: QueryReturningArray - COMPLEX-002: QueryReturningMap - COMPLEX-003: QueryReturningStruct - COMPLEX-004: NestedStruct - COMPLEX-005: ArrayOfStruct - COMPLEX-006: StructWithArray - COMPLEX-007: NestedArray - COMPLEX-008: EmptyArray - COMPLEX-009: ArrayWithNullElements - COMPLEX-010: NullInStruct - COMPLEX-011: FullyNullStruct - COMPLEX-012: MapWithNullValue - COMPLEX-013: StructWithMap - COMPLEX-014: NullComplexColumn Adds DatabricksTestEnvironment.IsSeaMode to detect SEA vs Thrift at runtime. Thrift path asserts StringType + JSON string; SEA path asserts native Arrow type. Documents expected column types per protocol in csharp/doc/specs/complex-types.yaml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…omplex type behavior across Thrift and SEA (PECO-2938) Implement JDBC-equivalent EnableComplexDatatypeSupport parameter that unifies complex type (ARRAY, MAP, STRUCT) behavior across Thrift and SEA protocols. Default behavior (EnableComplexDatatypeSupport=false): - Thrift: ComplexTypesAsArrow=false (server returns JSON strings natively) - SEA: new ComplexTypeSerializingStream wraps result and serializes native Arrow types (ListType/MapType/StructType) to JSON strings When enabled (EnableComplexDatatypeSupport=true): - Thrift: ComplexTypesAsArrow=true (server returns native Arrow types) - SEA: native Arrow types returned as-is (no wrapping) New files: - ArrowComplexTypeJsonSerializer.cs: serializes Arrow LIST/MAP/STRUCT values to JSON strings using Utf8JsonWriter, handling nested complex types - ComplexTypeSerializingStream.cs: IArrowArrayStream wrapper that converts complex columns to StringArrays when EnableComplexDatatypeSupport=false Updated E2E tests now assert unified JSON string output for both protocols without IsSeaMode branching. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Required by Apache RAT pre-commit check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…shSet lookup - Parse EnableComplexDatatypeSupport once in StatementExecutionConnection alongside other connection properties, and expose it as a property. StatementExecutionStatement now reads connection.EnableComplexDatatypeSupport instead of re-parsing the raw properties dict — consistent with how DatabricksStatement reads connection.EnableComplexDatatypeSupport. - Replace IReadOnlyList<int> with HashSet<int> for _complexColumnIndices in ComplexTypeSerializingStream, eliminating the O(n) IsComplexIndex loop in favor of a single HashSet.Contains() call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g ValueAt Replace the 12-case primitive switch in ToObject() with a delegate to IArrowArrayExtensions.ValueAt() (already in the codebase), keeping only the 3 cases that need specific JSON formatting (Decimal128 as string, Date32 as yyyy-MM-dd, Timestamp as ISO 8601) and the 2 complex type traversals (ListArray, StructArray). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DateTimeOffset serialized by JsonSerializer ("2024-03-15T10:30:45.123456+00:00")
and via ToString("o") ("2024-03-15T10:30:45.1234560+00:00") differ only by
one trailing zero — both are valid ISO 8601. Let ValueAt() handle timestamps.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Not needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Logic consolidated into ComplexTypeSerializingStream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug confirmed: non-string MAP keys silently coerced to
|
| Value | |
|---|---|
| Actual (buggy) | {"null":"two"} — first entry erased |
| Expected | {"1":"one","2":"two"} |
Both Int32 and Int64 key types are affected. The same issue applies to any future non-string key type (DATE, BOOLEAN, DECIMAL, etc.).
Fix
// Replace:
string key = keyArray is StringArray sa ? sa.GetString(i) ?? "null" : "null";
// With:
string key = ToObject(keyArray, i)?.ToString() ?? "null";ToObject already handles all Arrow primitive types correctly (int, long, float, bool, date, timestamp, decimal) and is already defined in this class — no new dependencies needed.
The test file (csharp/test/Unit/ComplexTypeSerializingStreamTests.cs) is ready to be included in this PR to prevent regressions.
|
Can you cover Interval datatype as a part of this PR? |
There was a problem hiding this comment.
Pull request overview
Implements a new Databricks driver parameter (adbc.databricks.enable_complex_datatype_support, default false) to make complex type (ARRAY/MAP/STRUCT) results consistent between Thrift and Statement Execution (SEA) protocols, aligning with JDBC behavior.
Changes:
- Adds
EnableComplexDatatypeSupportconnection/statement parameter and wires it into Thrift (ComplexTypesAsArrow) and SEA (stream wrapper). - Introduces
ComplexTypeSerializingStreamto serialize SEA complex Arrow types into JSON strings when the flag is disabled. - Updates E2E complex-type tests to assert unified default JSON-string behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| csharp/src/DatabricksParameters.cs | Adds the new connection parameter constant and docs. |
| csharp/src/DatabricksConnection.cs | Parses/exposes EnableComplexDatatypeSupport from properties. |
| csharp/src/DatabricksStatement.cs | Drives Thrift ComplexTypesAsArrow from the new flag. |
| csharp/src/StatementExecution/StatementExecutionConnection.cs | Parses/exposes the flag for SEA connections. |
| csharp/src/StatementExecution/StatementExecutionStatement.cs | Wraps SEA result streams to serialize complex types when disabled. |
| csharp/src/ComplexTypeSerializingStream.cs | New stream wrapper converting complex columns to JSON strings. |
| csharp/test/E2E/ComplexTypesValueTests.cs | Reworks E2E assertions to expect JSON strings by default. |
| csharp/test/E2E/DatabricksTestEnvironment.cs | Adds SEA-protocol detection helper (IsSeaMode). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var field = original.FieldsList[i]; | ||
| if (IsComplexType(field.DataType)) | ||
| { | ||
| fields.Add(new Field(field.Name, StringType.Default, nullable: true, field.Metadata)); | ||
| indices.Add(i); | ||
| } | ||
| else | ||
| { | ||
| fields.Add(field); | ||
| } |
There was a problem hiding this comment.
When converting a complex field to StringType, the new field is always created with nullable: true, which can change schema nullability semantics compared to the original field. Consider preserving the original field’s nullability (e.g., use the original field’s nullable flag) while only changing the DataType.
There was a problem hiding this comment.
✅ Done — changed nullable: true to field.IsNullable to preserve the original field's nullability.
This comment was generated with GitHub MCP.
| builder.AppendNull(); | ||
| else | ||
| builder.Append(JsonSerializer.Serialize(ToObject(array, i))); |
There was a problem hiding this comment.
The current JSON serialization builds Dictionary<string, object?> objects (for STRUCT/MAP) and then calls JsonSerializer.Serialize(...). Dictionary enumeration order is not guaranteed (notably on net472, which this project targets), so property ordering in the resulting JSON string can be nondeterministic and can break string-equality comparisons and clients that expect stable output. Prefer writing JSON with Utf8JsonWriter (or another ordered representation) to emit fields in schema/order deterministically.
| builder.AppendNull(); | |
| else | |
| builder.Append(JsonSerializer.Serialize(ToObject(array, i))); | |
| { | |
| builder.AppendNull(); | |
| } | |
| else | |
| { | |
| var value = ToObject(array, i); | |
| if (value is IDictionary<string, object?> dict) | |
| { | |
| // Wrap in a SortedDictionary to ensure deterministic key ordering | |
| var ordered = new SortedDictionary<string, object?>(dict); | |
| builder.Append(JsonSerializer.Serialize(ordered)); | |
| } | |
| else | |
| { | |
| builder.Append(JsonSerializer.Serialize(value)); | |
| } | |
| } |
There was a problem hiding this comment.
✅ Done — changed ToMapDict to return SortedDictionary<string, object?> so MAP keys are always emitted in sorted order, giving deterministic JSON output across all target frameworks.
This comment was generated with GitHub MCP.
| public async ValueTask<RecordBatch?> ReadNextRecordBatchAsync(CancellationToken cancellationToken = default) | ||
| { | ||
| var batch = await _inner.ReadNextRecordBatchAsync(cancellationToken).ConfigureAwait(false); | ||
| if (batch == null) | ||
| return null; | ||
|
|
There was a problem hiding this comment.
This file uses several var locals (e.g., var batch, var fields, var indices) but the repo’s C# coding guidelines prefer explicit types over var (csharp/CODING_GUIDELINES.md). Consider switching these to explicit types for consistency across the codebase.
There was a problem hiding this comment.
✅ Done — replaced all var locals with explicit types throughout the file.
This comment was generated with GitHub MCP.
| /// Thrift: ComplexTypesAsArrow=false (server returns strings natively). | ||
| /// SEA: Native Arrow types are serialized to JSON strings by ComplexTypeSerializingStream. | ||
| /// | ||
| /// See docs/specs/complex-types.yaml for per-protocol type specifications. |
There was a problem hiding this comment.
The XML doc references docs/specs/complex-types.yaml, but that file/path doesn’t exist in this repo. Please update the reference to the correct spec/doc location (or add the missing doc if it’s supposed to be part of the repo).
| /// See docs/specs/complex-types.yaml for per-protocol type specifications. | |
| /// See the project documentation for per-protocol type specifications. |
There was a problem hiding this comment.
✅ Done — replaced the broken docs/specs/complex-types.yaml reference with a note about EnableComplexDatatypeSupport=true behavior.
This comment was generated with GitHub MCP.
| Assert.NotNull(batch); | ||
| Assert.Equal(1, batch.Length); | ||
|
|
||
| var arr = (StringArray)batch.Column(0); |
There was a problem hiding this comment.
Local variable arr is declared with var, but the repo’s C# coding guidelines prefer explicit types over var (see csharp/CODING_GUIDELINES.md). Consider changing this to an explicit StringArray declaration for consistency.
| var arr = (StringArray)batch.Column(0); | |
| StringArray arr = (StringArray)batch.Column(0); |
There was a problem hiding this comment.
| using System.Threading.Tasks; | ||
| using Apache.Arrow; | ||
| using Apache.Arrow.Adbc.Extensions; | ||
| using Apache.Arrow.Ipc; |
There was a problem hiding this comment.
using Apache.Arrow.Ipc; appears unused in this file and will trigger CS8019; the repo treats warnings as errors, so this can break the build. Remove the unused using (or reference an Ipc type if it’s intended).
| using Apache.Arrow.Ipc; |
There was a problem hiding this comment.
✅ Done — removed using Apache.Arrow.Ipc; from both ComplexTypeSerializingStream.cs and ComplexTypesValueTests.cs.
This comment was generated with GitHub MCP.
There was a problem hiding this comment.
Reverting this — IArrowArrayStream is defined in Apache.Arrow.Ipc, so the using is required. Removing it caused a build failure (CS0246). Restored.
This comment was generated with GitHub MCP.
| var result = new Dictionary<string, object?>(); | ||
| for (int i = start; i < end; i++) | ||
| { | ||
| string key = keyArray is StringArray sa ? sa.GetString(i) ?? "null" : "null"; | ||
| result[key] = ToObject(valueArray, i); | ||
| } |
There was a problem hiding this comment.
MAP key handling currently falls back to the literal string "null" whenever the key column isn’t a StringArray (or contains null), which silently corrupts maps with non-string keys (e.g., MAP<INT, ...>) and can also collide with a real "null" key. Consider converting key values to a string representation (e.g., via ValueAt(...).ToString() with invariant culture) and decide explicitly how to handle null keys (reject/throw vs. serialize).
There was a problem hiding this comment.
✅ Done — fixed to ToObject(keyArray, i)?.ToString() ?? "null". This correctly handles all Arrow primitive key types (INT, BIGINT, DATE, BOOLEAN, DECIMAL, etc.) using the existing ToObject helper, eliminating the data loss bug with non-string keys.
This comment was generated with GitHub MCP.
| /// In SEA mode, complex types (ARRAY, MAP, STRUCT) are returned as native Arrow types. | ||
| /// In Thrift mode, they are returned as JSON-encoded strings. |
There was a problem hiding this comment.
This XML doc for IsSeaMode describes SEA always returning complex types as native Arrow and Thrift returning JSON strings, but this PR introduces EnableComplexDatatypeSupport that changes the behavior (SEA can also return JSON strings when disabled). Update the doc to reflect the flag-dependent behavior to avoid misleading future test authors.
| /// In SEA mode, complex types (ARRAY, MAP, STRUCT) are returned as native Arrow types. | |
| /// In Thrift mode, they are returned as JSON-encoded strings. | |
| /// When complex datatype support is enabled (EnableComplexDatatypeSupport = true), SEA returns | |
| /// complex types (ARRAY, MAP, STRUCT) as native Arrow types. When it is disabled, SEA returns | |
| /// complex types as JSON-encoded strings, matching Thrift behavior. Thrift mode always returns | |
| /// complex types as JSON-encoded strings regardless of the flag. |
There was a problem hiding this comment.
✅ Done — updated the doc to describe the flag-dependent behavior: SEA returns native Arrow types when EnableComplexDatatypeSupport=true, JSON strings when false (the default), and Thrift always returns JSON strings regardless.
This comment was generated with GitHub MCP.
| using Apache.Arrow; | ||
| using Apache.Arrow.Adbc; | ||
| using Apache.Arrow.Adbc.Tests; | ||
| using Apache.Arrow.Ipc; |
There was a problem hiding this comment.
using Apache.Arrow.Ipc; appears unused in this test file and will trigger CS8019; the repo treats warnings as errors, so this can break the build. Remove the unused using (or reference an Ipc type if it’s intended).
| using Apache.Arrow.Ipc; |
There was a problem hiding this comment.
| /// <summary> | ||
| /// Executes a SELECT returning a single complex-type column and validates it returns | ||
| /// a JSON string (the default behavior for both Thrift and SEA protocols). | ||
| /// </summary> | ||
| private async Task ValidateComplexColumnAsync(string sql, string expectedJson) | ||
| { | ||
| Statement.SqlQuery = sql; | ||
| QueryResult result = await Statement.ExecuteQueryAsync(); | ||
|
|
||
| using IArrowArrayStream stream = result.Stream ?? throw new InvalidOperationException("stream is null"); | ||
| Field field = stream.Schema.GetFieldByIndex(0); | ||
|
|
||
| Assert.IsType<StringType>(field.DataType); | ||
|
|
||
| RecordBatch? batch = await stream.ReadNextRecordBatchAsync(); | ||
| Assert.NotNull(batch); | ||
| Assert.Equal(1, batch.Length); | ||
|
|
||
| var arr = (StringArray)batch.Column(0); | ||
| Assert.Equal(expectedJson, arr.GetString(0)); | ||
| } |
There was a problem hiding this comment.
These tests only validate the default behavior (EnableComplexDatatypeSupport=false). Since this PR introduces a new flag with a non-default behavior (native Arrow complex types), consider adding coverage for EnableComplexDatatypeSupport=true as well (asserting ListType/MapType/StructType) to prevent regressions in the enabled code path.
There was a problem hiding this comment.
Good point. Adding EnableComplexDatatypeSupport=true coverage (asserting native ListType/MapType/StructType) would only be meaningful for SEA mode — Thrift always returns JSON strings regardless of the flag. We could add a SEA-only test that re-opens the connection with the flag enabled and asserts the native Arrow types. Deferring to a follow-up PR to keep this one focused, but open to adding it here if preferred.
This comment was generated with GitHub MCP.
- Remove unused 'using Apache.Arrow.Ipc' from both files (fixes CS8019 build error) - Preserve original field nullability when converting complex fields to StringType - Use explicit types instead of var per coding guidelines - Fix non-string MAP keys: use ToObject(...).ToString() instead of hardcoded "null" - Use SortedDictionary for MAP serialization to ensure deterministic key ordering - Update IsSeaMode doc to reflect EnableComplexDatatypeSupport flag behavior - Fix XML doc: remove broken docs/specs/complex-types.yaml reference Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Interval is a primitive scalar type, not a complex type, so
Tracking the nested-interval edge case in PECO-2961 for a follow-up. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rayStream Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| DecimalAsArrow = true, | ||
|
|
||
| // set to false so they return as string | ||
| // otherwise, they return as ARRAY_TYPE but you can't determine |
There was a problem hiding this comment.
Somehow keep this comment
There was a problem hiding this comment.
✅ Done — restored an expanded comment on ComplexTypesAsArrow explaining:
- When
false(default): Thrift server returns complex types as JSON-encoded strings - When
true: server returns native Arrow types - Notes the ARRAY_TYPE limitation: element type info isn't embedded in Thrift responses, so string serialization is the safe default
This comment was generated with GitHub MCP.
| /// serializes them to JSON strings to match Thrift behavior. | ||
| /// In Thrift mode, complex types are always returned as JSON-encoded strings regardless of the flag. | ||
| /// </summary> | ||
| public bool IsSeaMode => Connection is StatementExecutionConnection; |
There was a problem hiding this comment.
Where is this used?
There was a problem hiding this comment.
✅ Removed — after searching the codebase, IsSeaMode was never called anywhere. Removed the property and the now-unused using AdbcDrivers.Databricks.StatementExecution directive.
This comment was generated with GitHub MCP.
… IsSeaMode - Restore explanatory comment on ComplexTypesAsArrow setting explaining why complex types return as JSON strings when false and the ARRAY_TYPE limitation - Remove unused IsSeaMode property and its StatementExecution using directive from DatabricksTestEnvironment.cs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Implements JDBC-equivalent
EnableComplexDatatypeSupportparameter that unifies complex type (ARRAY, MAP, STRUCT) behavior across Thrift and SEA protocols.Problem
Previously, Thrift and SEA returned complex types differently:
[1,2,3],{"a":1}) viaComplexTypesAsArrow=falseListType,MapType,StructType)This caused test divergence and required
IsSeaModebranching.Solution
Adds
adbc.databricks.enable_complex_datatype_supportparameter (defaultfalse) that makes both protocols behave consistently — matching JDBC'sEnableComplexDatatypeSupportflag.Default (
false) — JSON strings from both protocols:ComplexTypesAsArrow=false(server returns JSON strings natively)ComplexTypeSerializingStreamwraps result and serializes native Arrow types to JSON strings viaArrowComplexTypeJsonSerializerWhen enabled (
true) — native Arrow types from both protocols:ComplexTypesAsArrow=true(server returns native Arrow types)New Files
ArrowComplexTypeJsonSerializer.cs— serializes Arrow LIST/MAP/STRUCT values to JSON usingUtf8JsonWriter, handles nested complex typesComplexTypeSerializingStream.cs—IArrowArrayStreamwrapper that converts complex columns toStringArraywhenEnableComplexDatatypeSupport=falseChanges
DatabricksParameters.cs— addsEnableComplexDatatypeSupportconstantDatabricksConnection.cs— reads and exposes the parameterDatabricksStatement.cs— drivesComplexTypesAsArrowfrom the parameterStatementExecutionStatement.cs— wraps result stream withComplexTypeSerializingStreamwhen disabledComplexTypesValueTests.cs— simplified to assert unified JSON string output withoutIsSeaModebranchingTest Plan
[1,2,3]JSON string output for ARRAY in both protocols{"a":1,"b":2}JSON string output for MAP in both protocols{"id":1,"name":"Alice"}JSON string output for STRUCT in both protocols🤖 Generated with Claude Code