Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions csharp/doc/sea-metadata-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# SEA Metadata Architecture

## Overview

The Databricks ADBC driver supports two protocols for metadata retrieval:
- **Thrift** (HiveServer2): Uses Thrift RPC calls to fetch metadata from the server
- **SEA** (Statement Execution API): Uses SQL commands (`SHOW CATALOGS`, `SHOW SCHEMAS`, etc.) via the REST API

Both protocols share the same Arrow result structure for `GetObjects`, `GetColumns`, `GetPrimaryKeys`, and other metadata operations. This document describes how shared code is structured to avoid duplication while allowing each protocol to fetch data from its respective backend.

## Shared Interface: IGetObjectsDataProvider

```
AdbcConnection.GetObjects() [sync ADBC API]
β”‚
β–Ό
GetObjectsResultBuilder.BuildGetObjectsResultAsync() [async orchestrator]
β”‚
β”œβ”€ provider.GetCatalogsAsync()
β”œβ”€ provider.GetSchemasAsync()
β”œβ”€ provider.GetTablesAsync()
└─ provider.PopulateColumnInfoAsync()
β”‚
β–Ό
BuildResult() β†’ HiveInfoArrowStream [Arrow structure construction]
```

`IGetObjectsDataProvider` is the abstraction between "how to fetch metadata" and "how to build Arrow results":

- **GetObjectsResultBuilder** knows how to construct the nested Arrow structures (catalog β†’ schema β†’ table β†’ column) required by the ADBC GetObjects spec.
- **IGetObjectsDataProvider** implementations know how to retrieve the raw data from their protocol.

### Thrift Implementation (HiveServer2Connection)
- Calls Thrift RPCs: `GetCatalogsAsync()`, `GetSchemasAsync()`, `GetTablesAsync()`, `GetColumnsAsync()`
- Server returns typed result sets with precision, scale, column size
- `SetPrecisionScaleAndTypeName` override handles per-connection type mapping

### SEA Implementation (StatementExecutionConnection)
- Executes SQL: `SHOW CATALOGS`, `SHOW SCHEMAS IN ...`, `SHOW TABLES IN ...`, `SHOW COLUMNS IN ...`
- Server returns type name strings only β€” metadata is computed locally via `ColumnMetadataHelper`
- `ColumnMetadataHelper.PopulateTableInfoFromTypeName` derives data type codes, column sizes, decimal digits from type names

## Async Design

The ADBC base class defines `GetObjects()` as synchronous:
```csharp
public abstract IArrowArrayStream GetObjects(GetObjectsDepth depth, ...);
```

Internally, the interface and builder are async:
```csharp
interface IGetObjectsDataProvider {
Task<IReadOnlyList<string>> GetCatalogsAsync(...);
// ...
}

static async Task<HiveInfoArrowStream> BuildGetObjectsResultAsync(
IGetObjectsDataProvider provider, ...) { ... }
```

The sync ADBC boundary blocks once at the top level:
```csharp
public override IArrowArrayStream GetObjects(...) {
return BuildGetObjectsResultAsync(this, ...).GetAwaiter().GetResult();
}
```

This avoids nested `.Result` blocking calls on every Thrift RPC while maintaining the sync ADBC API contract.

## Shared Schema Factories

`MetadataSchemaFactory` (in hiveserver2) provides schema definitions used by both protocols:

| Factory Method | Used By |
|---|---|
| `CreateCatalogsSchema()` | DatabricksStatement, StatementExecutionStatement |
| `CreateSchemasSchema()` | DatabricksStatement, StatementExecutionStatement |
| `CreateTablesSchema()` | DatabricksStatement, StatementExecutionStatement |
| `CreateColumnMetadataSchema()` | DatabricksStatement, FlatColumnsResultBuilder |
| `CreatePrimaryKeysSchema()` | MetadataSchemaFactory builders |
| `CreateCrossReferenceSchema()` | MetadataSchemaFactory builders |
| `BuildGetInfoResult()` | HiveServer2Connection, StatementExecutionConnection |

## Type Mapping

### Thrift Path
```
Server result β†’ SetPrecisionScaleAndTypeName (per-connection override)
β”œβ”€ SparkConnection: parses DECIMAL/CHAR precision from type name
└─ HiveServer2ExtendedConnection: uses server-provided values

For flat GetColumns, EnhanceGetColumnsResult (on HiveServer2Statement) adds
a BASE_TYPE_NAME column and optionally overrides precision/scale by calling
SetPrecisionScaleAndTypeName per row. This is Thrift-only β€” SEA builds the
complete result from scratch via FlatColumnsResultBuilder.
```

### SEA Path
```
SHOW COLUMNS response β†’ ColumnMetadataHelper.PopulateTableInfoFromTypeName
└─ Computes: data type code, column size, decimal digits, base type name
```

### Shared GetArrowType
`HiveServer2Connection.GetArrowType()` (internal static) converts a column type ID to an Apache Arrow type. Both Thrift and SEA use this β€” SEA derives the type ID via `ColumnMetadataHelper.GetDataTypeCode()` first.

## SQL Command Builders

SEA metadata uses `MetadataCommandBase` with command subclasses:

| Command | SQL Generated |
|---|---|
| `ShowCatalogsCommand` | `SHOW CATALOGS [LIKE 'pattern']` |
| `ShowSchemasCommand` | `SHOW SCHEMAS IN \`catalog\` [LIKE 'pattern']` |
| `ShowTablesCommand` | `SHOW TABLES IN CATALOG \`catalog\` [SCHEMA LIKE ...] [LIKE ...]` |
| `ShowColumnsCommand` | `SHOW COLUMNS IN CATALOG \`catalog\` [SCHEMA LIKE ...] [TABLE LIKE ...] [LIKE ...]` |
| `ShowKeysCommand` | `SHOW KEYS IN CATALOG ... IN SCHEMA ... IN TABLE ...` |
| `ShowForeignKeysCommand` | `SHOW FOREIGN KEYS IN CATALOG ... IN SCHEMA ... IN TABLE ...` |

Pattern conversion: ADBC `%` β†’ Databricks `*`, ADBC `_` β†’ Databricks `.`

## GetObjects RPC Count

Each IGetObjectsDataProvider method makes one server call. Total RPCs by depth:

| Depth | Methods Called | RPCs |
|---|---|---|
| Catalogs | GetCatalogsAsync | 1 |
| DbSchemas | + GetSchemasAsync | 2 |
| Tables | + GetTablesAsync | 3 |
| All | + PopulateColumnInfoAsync | 4 |
Loading
Loading