Skip to content

Commit 9be1721

Browse files
msrathore-dbclaude
andcommitted
feat(csharp): implement SEA metadata operations
Implement metadata operations for the Statement Execution API (REST) protocol, achieving parity with the existing Thrift implementation. ## Core Implementation - IGetObjectsDataProvider async implementation for SEA in StatementExecutionConnection (GetCatalogs, GetSchemas, GetTables, GetColumns via SHOW commands) - Statement-level metadata routing in StatementExecutionStatement (GetColumns, GetColumnsExtended, GetPrimaryKeys, GetCrossReference, GetTableSchema, GetTableTypes, GetInfo) - SQL command builders (ShowCatalogs/Schemas/Tables/Columns/Keys/ ForeignKeys) with pattern conversion (ADBC % → Databricks *) - ColumnMetadataHelper for computing column metadata from type name strings (SEA-only, Thrift gets these from server) - FlatColumnsResultBuilder for building flat GetColumns results - MetadataUtilities for shared catalog/pattern helpers ## Connection Parameters - EnablePKFK: gate PK/FK queries (default: true) - EnableMultipleCatalogSupport: single vs multi-catalog mode (default: true), matching Thrift DatabricksConnection behavior - DatabricksStatement-specific options silently accepted in SEA ## Design - Shared MetadataSchemaFactory (hiveserver2) for Arrow schemas - Shared GetObjectsResultBuilder for nested GetObjects structure - Async internally, blocks once at ADBC sync boundary - x-databricks-sea-can-run-fully-sync header for metadata queries - Design doc: csharp/doc/sea-metadata-design.md ## Testing - Unit tests for ShowCommands and MetadataUtilities (146 tests) - E2E tests comparing Thrift and SEA parity (17 tests) - Metadata comparator tool with schema type comparison ## Reviewed in - hiveserver2: #21, #22 (merged) - databricks: #257, #258, #259, #260, #261, #282 (reviewed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4115e5f commit 9be1721

25 files changed

+3725
-357
lines changed

csharp/doc/sea-metadata-design.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
<!--
2+
Copyright (c) 2025 ADBC Drivers Contributors
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
# SEA Metadata Architecture
18+
19+
## Overview
20+
21+
The Databricks ADBC driver supports two protocols for metadata retrieval:
22+
- **Thrift** (HiveServer2): Uses Thrift RPC calls to fetch metadata from the server
23+
- **SEA** (Statement Execution API): Uses SQL commands (`SHOW CATALOGS`, `SHOW SCHEMAS`, etc.) via the REST API
24+
25+
Both protocols share the same Arrow result structure for `GetObjects`, `GetColumns`, `GetPrimaryKeys`, and other metadata operations. This document describes how shared code is structured to avoid duplication while allowing each protocol to fetch data from its respective backend.
26+
27+
## Shared Interface: IGetObjectsDataProvider
28+
29+
```
30+
AdbcConnection.GetObjects() [sync ADBC API]
31+
32+
33+
GetObjectsResultBuilder.BuildGetObjectsResultAsync() [async orchestrator]
34+
35+
├─ provider.GetCatalogsAsync()
36+
├─ provider.GetSchemasAsync()
37+
├─ provider.GetTablesAsync()
38+
└─ provider.PopulateColumnInfoAsync()
39+
40+
41+
BuildResult() → HiveInfoArrowStream [Arrow structure construction]
42+
```
43+
44+
`IGetObjectsDataProvider` is the abstraction between "how to fetch metadata" and "how to build Arrow results":
45+
46+
- **GetObjectsResultBuilder** knows how to construct the nested Arrow structures (catalog → schema → table → column) required by the ADBC GetObjects spec.
47+
- **IGetObjectsDataProvider** implementations know how to retrieve the raw data from their protocol.
48+
49+
### Thrift Implementation (HiveServer2Connection)
50+
- Calls Thrift RPCs: `GetCatalogsAsync()`, `GetSchemasAsync()`, `GetTablesAsync()`, `GetColumnsAsync()`
51+
- Server returns typed result sets with precision, scale, column size
52+
- `SetPrecisionScaleAndTypeName` override handles per-connection type mapping
53+
54+
### SEA Implementation (StatementExecutionConnection)
55+
- Executes SQL: `SHOW CATALOGS`, `SHOW SCHEMAS IN ...`, `SHOW TABLES IN ...`, `SHOW COLUMNS IN ...`
56+
- Server returns type name strings only — metadata is computed locally via `ColumnMetadataHelper`
57+
- `ColumnMetadataHelper.PopulateTableInfoFromTypeName` derives data type codes, column sizes, decimal digits from type names
58+
59+
## Async Design
60+
61+
The ADBC base class defines `GetObjects()` as synchronous:
62+
```csharp
63+
public abstract IArrowArrayStream GetObjects(GetObjectsDepth depth, ...);
64+
```
65+
66+
Internally, the interface and builder are async:
67+
```csharp
68+
interface IGetObjectsDataProvider {
69+
Task<IReadOnlyList<string>> GetCatalogsAsync(...);
70+
// ...
71+
}
72+
73+
static async Task<HiveInfoArrowStream> BuildGetObjectsResultAsync(
74+
IGetObjectsDataProvider provider, ...) { ... }
75+
```
76+
77+
The sync ADBC boundary blocks once at the top level:
78+
```csharp
79+
public override IArrowArrayStream GetObjects(...) {
80+
return BuildGetObjectsResultAsync(this, ...).GetAwaiter().GetResult();
81+
}
82+
```
83+
84+
This avoids nested `.Result` blocking calls on every Thrift RPC while maintaining the sync ADBC API contract.
85+
86+
## Shared Schema Factories
87+
88+
`MetadataSchemaFactory` (in hiveserver2) provides schema definitions used by both protocols:
89+
90+
| Factory Method | Used By |
91+
|---|---|
92+
| `CreateCatalogsSchema()` | DatabricksStatement, StatementExecutionStatement |
93+
| `CreateSchemasSchema()` | DatabricksStatement, StatementExecutionStatement |
94+
| `CreateTablesSchema()` | DatabricksStatement, StatementExecutionStatement |
95+
| `CreateColumnMetadataSchema()` | DatabricksStatement, FlatColumnsResultBuilder |
96+
| `CreatePrimaryKeysSchema()` | MetadataSchemaFactory builders |
97+
| `CreateCrossReferenceSchema()` | MetadataSchemaFactory builders |
98+
| `BuildGetInfoResult()` | HiveServer2Connection, StatementExecutionConnection |
99+
100+
## Type Mapping
101+
102+
### Thrift Path
103+
```
104+
Server result → SetPrecisionScaleAndTypeName (per-connection override)
105+
├─ SparkConnection: parses DECIMAL/CHAR precision from type name
106+
└─ HiveServer2ExtendedConnection: uses server-provided values
107+
108+
For flat GetColumns, EnhanceGetColumnsResult (on HiveServer2Statement) adds
109+
a BASE_TYPE_NAME column and optionally overrides precision/scale by calling
110+
SetPrecisionScaleAndTypeName per row. This is Thrift-only — SEA builds the
111+
complete result from scratch via FlatColumnsResultBuilder.
112+
```
113+
114+
### SEA Path
115+
```
116+
SHOW COLUMNS response → ColumnMetadataHelper.PopulateTableInfoFromTypeName
117+
└─ Computes: data type code, column size, decimal digits, base type name
118+
```
119+
120+
### Shared GetArrowType
121+
`HiveServer2Connection.GetArrowType()` (internal static) converts a column type ID to an Apache Arrow type. Both Thrift and SEA use this — SEA derives the type ID via `ColumnMetadataHelper.GetDataTypeCode()` first.
122+
123+
## SQL Command Builders
124+
125+
SEA metadata uses `MetadataCommandBase` with command subclasses:
126+
127+
| Command | SQL Generated |
128+
|---|---|
129+
| `ShowCatalogsCommand` | `SHOW CATALOGS [LIKE 'pattern']` |
130+
| `ShowSchemasCommand` | `SHOW SCHEMAS IN \`catalog\` [LIKE 'pattern']` |
131+
| `ShowTablesCommand` | `SHOW TABLES IN CATALOG \`catalog\` [SCHEMA LIKE ...] [LIKE ...]` |
132+
| `ShowColumnsCommand` | `SHOW COLUMNS IN CATALOG \`catalog\` [SCHEMA LIKE ...] [TABLE LIKE ...] [LIKE ...]` |
133+
| `ShowKeysCommand` | `SHOW KEYS IN CATALOG ... IN SCHEMA ... IN TABLE ...` |
134+
| `ShowForeignKeysCommand` | `SHOW FOREIGN KEYS IN CATALOG ... IN SCHEMA ... IN TABLE ...` |
135+
136+
Pattern conversion: ADBC `%` → Databricks `*`, ADBC `_` → Databricks `.`
137+
138+
## GetObjects RPC Count
139+
140+
Each IGetObjectsDataProvider method makes one server call. Total RPCs by depth:
141+
142+
| Depth | Methods Called | RPCs |
143+
|---|---|---|
144+
| Catalogs | GetCatalogsAsync | 1 |
145+
| DbSchemas | + GetSchemasAsync | 2 |
146+
| Tables | + GetTablesAsync | 3 |
147+
| All | + PopulateColumnInfoAsync | 4 |

csharp/src/DatabricksConnection.cs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ namespace AdbcDrivers.Databricks
4747
{
4848
internal class DatabricksConnection : SparkHttpConnection
4949
{
50+
internal const string DatabricksDriverName = "ADBC Databricks Driver";
5051
internal static new readonly string s_assemblyName = ApacheUtility.GetAssemblyName(typeof(DatabricksConnection));
5152
internal static new readonly string s_assemblyVersion = ApacheUtility.GetAssemblyVersion(typeof(DatabricksConnection));
5253

@@ -408,7 +409,7 @@ protected override HttpMessageHandler CreateHttpHandler()
408409

409410
protected override bool GetObjectsPatternsRequireLowerCase => true;
410411

411-
protected override string DriverName => "ADBC Databricks Driver";
412+
protected override string DriverName => DatabricksDriverName;
412413

413414
internal override IArrowArrayStream NewReader<T>(T statement, Schema schema, IResponse response, TGetResultSetMetadataResp? metadataResp = null)
414415
{

0 commit comments

Comments
 (0)