|
| 1 | +# SEA Metadata Architecture |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The Databricks ADBC driver supports two protocols for metadata retrieval: |
| 6 | +- **Thrift** (HiveServer2): Uses Thrift RPC calls to fetch metadata from the server |
| 7 | +- **SEA** (Statement Execution API): Uses SQL commands (`SHOW CATALOGS`, `SHOW SCHEMAS`, etc.) via the REST API |
| 8 | + |
| 9 | +Both protocols share the same Arrow result structure for `GetObjects`, `GetColumns`, `GetPrimaryKeys`, and other metadata operations. This document describes how shared code is structured to avoid duplication while allowing each protocol to fetch data from its respective backend. |
| 10 | + |
| 11 | +## Shared Interface: IGetObjectsDataProvider |
| 12 | + |
| 13 | +``` |
| 14 | +AdbcConnection.GetObjects() [sync ADBC API] |
| 15 | + │ |
| 16 | + ▼ |
| 17 | +GetObjectsResultBuilder.BuildGetObjectsResultAsync() [async orchestrator] |
| 18 | + │ |
| 19 | + ├─ provider.GetCatalogsAsync() |
| 20 | + ├─ provider.GetSchemasAsync() |
| 21 | + ├─ provider.GetTablesAsync() |
| 22 | + └─ provider.PopulateColumnInfoAsync() |
| 23 | + │ |
| 24 | + ▼ |
| 25 | + BuildResult() → HiveInfoArrowStream [Arrow structure construction] |
| 26 | +``` |
| 27 | + |
| 28 | +`IGetObjectsDataProvider` is the abstraction between "how to fetch metadata" and "how to build Arrow results": |
| 29 | + |
| 30 | +- **GetObjectsResultBuilder** knows how to construct the nested Arrow structures (catalog → schema → table → column) required by the ADBC GetObjects spec. |
| 31 | +- **IGetObjectsDataProvider** implementations know how to retrieve the raw data from their protocol. |
| 32 | + |
| 33 | +### Thrift Implementation (HiveServer2Connection) |
| 34 | +- Calls Thrift RPCs: `GetCatalogsAsync()`, `GetSchemasAsync()`, `GetTablesAsync()`, `GetColumnsAsync()` |
| 35 | +- Server returns typed result sets with precision, scale, column size |
| 36 | +- `SetPrecisionScaleAndTypeName` override handles per-connection type mapping |
| 37 | + |
| 38 | +### SEA Implementation (StatementExecutionConnection) |
| 39 | +- Executes SQL: `SHOW CATALOGS`, `SHOW SCHEMAS IN ...`, `SHOW TABLES IN ...`, `SHOW COLUMNS IN ...` |
| 40 | +- Server returns type name strings only — metadata is computed locally via `ColumnMetadataHelper` |
| 41 | +- `ColumnMetadataHelper.PopulateTableInfoFromTypeName` derives data type codes, column sizes, decimal digits from type names |
| 42 | + |
| 43 | +## Async Design |
| 44 | + |
| 45 | +The ADBC base class defines `GetObjects()` as synchronous: |
| 46 | +```csharp |
| 47 | +public abstract IArrowArrayStream GetObjects(GetObjectsDepth depth, ...); |
| 48 | +``` |
| 49 | + |
| 50 | +Internally, the interface and builder are async: |
| 51 | +```csharp |
| 52 | +interface IGetObjectsDataProvider { |
| 53 | + Task<IReadOnlyList<string>> GetCatalogsAsync(...); |
| 54 | + // ... |
| 55 | +} |
| 56 | + |
| 57 | +static async Task<HiveInfoArrowStream> BuildGetObjectsResultAsync( |
| 58 | + IGetObjectsDataProvider provider, ...) { ... } |
| 59 | +``` |
| 60 | + |
| 61 | +The sync ADBC boundary blocks once at the top level: |
| 62 | +```csharp |
| 63 | +public override IArrowArrayStream GetObjects(...) { |
| 64 | + return BuildGetObjectsResultAsync(this, ...).GetAwaiter().GetResult(); |
| 65 | +} |
| 66 | +``` |
| 67 | + |
| 68 | +This avoids nested `.Result` blocking calls on every Thrift RPC while maintaining the sync ADBC API contract. |
| 69 | + |
| 70 | +## Shared Schema Factories |
| 71 | + |
| 72 | +`MetadataSchemaFactory` (in hiveserver2) provides schema definitions used by both protocols: |
| 73 | + |
| 74 | +| Factory Method | Used By | |
| 75 | +|---|---| |
| 76 | +| `CreateCatalogsSchema()` | DatabricksStatement, StatementExecutionStatement | |
| 77 | +| `CreateSchemasSchema()` | DatabricksStatement, StatementExecutionStatement | |
| 78 | +| `CreateTablesSchema()` | DatabricksStatement, StatementExecutionStatement | |
| 79 | +| `CreateColumnMetadataSchema()` | DatabricksStatement, FlatColumnsResultBuilder | |
| 80 | +| `CreatePrimaryKeysSchema()` | MetadataSchemaFactory builders | |
| 81 | +| `CreateCrossReferenceSchema()` | MetadataSchemaFactory builders | |
| 82 | +| `BuildGetInfoResult()` | HiveServer2Connection, StatementExecutionConnection | |
| 83 | + |
| 84 | +## Type Mapping |
| 85 | + |
| 86 | +### Thrift Path |
| 87 | +``` |
| 88 | +Server result → SetPrecisionScaleAndTypeName (per-connection override) |
| 89 | + ├─ SparkConnection: parses DECIMAL/CHAR precision from type name |
| 90 | + └─ HiveServer2ExtendedConnection: uses server-provided values |
| 91 | +
|
| 92 | +For flat GetColumns, EnhanceGetColumnsResult (on HiveServer2Statement) adds |
| 93 | +a BASE_TYPE_NAME column and optionally overrides precision/scale by calling |
| 94 | +SetPrecisionScaleAndTypeName per row. This is Thrift-only — SEA builds the |
| 95 | +complete result from scratch via FlatColumnsResultBuilder. |
| 96 | +``` |
| 97 | + |
| 98 | +### SEA Path |
| 99 | +``` |
| 100 | +SHOW COLUMNS response → ColumnMetadataHelper.PopulateTableInfoFromTypeName |
| 101 | + └─ Computes: data type code, column size, decimal digits, base type name |
| 102 | +``` |
| 103 | + |
| 104 | +### Shared GetArrowType |
| 105 | +`HiveServer2Connection.GetArrowType()` (internal static) converts a column type ID to an Apache Arrow type. Both Thrift and SEA use this — SEA derives the type ID via `ColumnMetadataHelper.GetDataTypeCode()` first. |
| 106 | + |
| 107 | +## SQL Command Builders |
| 108 | + |
| 109 | +SEA metadata uses `MetadataCommandBase` with command subclasses: |
| 110 | + |
| 111 | +| Command | SQL Generated | |
| 112 | +|---|---| |
| 113 | +| `ShowCatalogsCommand` | `SHOW CATALOGS [LIKE 'pattern']` | |
| 114 | +| `ShowSchemasCommand` | `SHOW SCHEMAS IN \`catalog\` [LIKE 'pattern']` | |
| 115 | +| `ShowTablesCommand` | `SHOW TABLES IN CATALOG \`catalog\` [SCHEMA LIKE ...] [LIKE ...]` | |
| 116 | +| `ShowColumnsCommand` | `SHOW COLUMNS IN CATALOG \`catalog\` [SCHEMA LIKE ...] [TABLE LIKE ...] [LIKE ...]` | |
| 117 | +| `ShowKeysCommand` | `SHOW KEYS IN CATALOG ... IN SCHEMA ... IN TABLE ...` | |
| 118 | +| `ShowForeignKeysCommand` | `SHOW FOREIGN KEYS IN CATALOG ... IN SCHEMA ... IN TABLE ...` | |
| 119 | + |
| 120 | +Pattern conversion: ADBC `%` → Databricks `*`, ADBC `_` → Databricks `.` |
| 121 | + |
| 122 | +## GetObjects RPC Count |
| 123 | + |
| 124 | +Each IGetObjectsDataProvider method makes one server call. Total RPCs by depth: |
| 125 | + |
| 126 | +| Depth | Methods Called | RPCs | |
| 127 | +|---|---|---| |
| 128 | +| Catalogs | GetCatalogsAsync | 1 | |
| 129 | +| DbSchemas | + GetSchemasAsync | 2 | |
| 130 | +| Tables | + GetTablesAsync | 3 | |
| 131 | +| All | + PopulateColumnInfoAsync | 4 | |
0 commit comments