Skip to content

Commit 5af266b

Browse files
authored
Merge pull request #11 from exasol-labs/issue_10_infer_schema_parquet
Infer schema on import of Parquet files
2 parents f7e56e9 + 045b90b commit 5af266b

File tree

17 files changed

+1523
-213
lines changed

17 files changed

+1523
-213
lines changed

CHANGELOG.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Changelog
2+
3+
## 0.5.0
4+
5+
- Schema inference for Parquet imports
6+
7+
## 0.4.0
8+
9+
- Parallel CSV and Parquet file imports
10+
11+
## <=0.3.2
12+
13+
- ADBC driver implementation
14+
- Import/export capability via HTTP tunneling
15+
- Arrow type mapping for Exasol types

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "exarrow-rs"
3-
version = "0.4.0"
3+
version = "0.5.0"
44
edition = "2021"
55
license = "MIT"
66
authors = ["Exasol Labs"]

benches/rust/benchmark.rs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
66
use std::env;
77
use std::fs;
8-
use std::path::PathBuf;
8+
use std::path::{Path, PathBuf};
99
use std::time::Instant;
1010

1111
use clap::{Parser, ValueEnum};
@@ -123,7 +123,7 @@ async fn truncate_table(conn: &mut Connection) -> Result<(), Box<dyn std::error:
123123

124124
async fn import_csv(
125125
conn: &mut Connection,
126-
file_path: &PathBuf,
126+
file_path: &Path,
127127
) -> Result<(i64, f64), Box<dyn std::error::Error>> {
128128
truncate_table(conn).await?;
129129

@@ -140,7 +140,7 @@ async fn import_csv(
140140

141141
async fn import_parquet(
142142
conn: &mut Connection,
143-
file_path: &PathBuf,
143+
file_path: &Path,
144144
) -> Result<(i64, f64), Box<dyn std::error::Error>> {
145145
truncate_table(conn).await?;
146146

@@ -217,7 +217,7 @@ async fn select_to_polars(
217217
async fn run_import_benchmark(
218218
conn: &mut Connection,
219219
operation: &Operation,
220-
file_path: &PathBuf,
220+
file_path: &Path,
221221
iterations: usize,
222222
warmup: usize,
223223
file_size_mb: f64,

benches/rust/generate_data.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ fn generate_batch(rng: &mut StdRng, start_id: i64, count: usize) -> RecordBatch
8585
let ages: Vec<i32> = (0..count).map(|_| rng.gen_range(18..80)).collect();
8686

8787
let salaries: Vec<i128> = (0..count)
88-
.map(|_| rng.gen_range(30_000_00i128..500_000_00i128)) // cents
88+
.map(|_| rng.gen_range(3_000_000_i128..50_000_000_i128)) // cents
8989
.collect();
9090

9191
let timestamps: Vec<i64> = (0..count)

specs/import-export/spec.md

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -374,4 +374,106 @@ operations immediately upon first failure to prevent partial data imports.
374374

375375
- **WHEN** any Parquet file fails to convert to CSV
376376
- **THEN** system SHALL abort all other conversion tasks immediately
377-
- **AND** system SHALL return error indicating which file failed conversion
377+
- **AND** system SHALL return error indicating which file failed conversion
378+
379+
### Requirement: Arrow Schema Inference from Parquet Files
380+
381+
The system SHALL support inferring Arrow schemas from Parquet file metadata without reading the full data.
382+
383+
#### Scenario: Infer schema from single Parquet file
384+
385+
- **WHEN** user requests schema inference from a single Parquet file
386+
- **THEN** system SHALL read only the Parquet metadata (not data)
387+
- **AND** system SHALL return the Arrow schema with field names and types
388+
- **AND** system SHALL include nullability information for each field
389+
390+
#### Scenario: Infer union schema from multiple Parquet files
391+
392+
- **WHEN** user requests schema inference from multiple Parquet files
393+
- **THEN** system SHALL read metadata from all files
394+
- **AND** system SHALL compute a union schema that accommodates all files
395+
- **AND** system SHALL widen types when fields have different types across files
396+
- **AND** type widening SHALL follow these rules:
397+
- Identical types remain unchanged
398+
- DECIMAL types widen to max(precision), max(scale)
399+
- VARCHAR types widen to max(size)
400+
- DECIMAL + DOUBLE widens to DOUBLE
401+
- Incompatible types fall back to VARCHAR(2000000)
402+
403+
#### Scenario: Schema inference error handling
404+
405+
- **WHEN** schema inference encounters an error
406+
- **THEN** system SHALL return SchemaInferenceError with file path context
407+
- **AND** system SHALL indicate whether the error was in reading metadata or type conversion
408+
409+
### Requirement: Arrow to Exasol DDL Generation
410+
411+
The system SHALL support generating Exasol CREATE TABLE DDL statements from inferred schemas.
412+
413+
#### Scenario: Column name handling with Quoted mode
414+
415+
- **WHEN** generating DDL with Quoted column name mode
416+
- **THEN** column names SHALL be wrapped in double quotes
417+
- **AND** internal double quotes in names SHALL be escaped by doubling
418+
- **AND** original column names SHALL be preserved exactly
419+
420+
#### Scenario: Column name handling with Sanitize mode
421+
422+
- **WHEN** generating DDL with Sanitize column name mode
423+
- **THEN** column names SHALL be converted to uppercase
424+
- **AND** invalid identifier characters SHALL be replaced with underscore
425+
- **AND** names starting with digits SHALL be prefixed with underscore
426+
- **AND** Exasol reserved words SHALL be quoted
427+
428+
#### Scenario: DDL type generation
429+
430+
- **WHEN** generating DDL column types
431+
- **THEN** ExasolType SHALL be converted to valid DDL syntax
432+
- **AND** BOOLEAN SHALL generate "BOOLEAN"
433+
- **AND** VARCHAR(n) SHALL generate "VARCHAR(n)"
434+
- **AND** DECIMAL(p,s) SHALL generate "DECIMAL(p,s)"
435+
- **AND** DOUBLE SHALL generate "DOUBLE"
436+
- **AND** DATE SHALL generate "DATE"
437+
- **AND** TIMESTAMP SHALL generate "TIMESTAMP" or "TIMESTAMP WITH LOCAL TIME ZONE"
438+
439+
#### Scenario: Complete DDL statement generation
440+
441+
- **WHEN** generating CREATE TABLE DDL
442+
- **THEN** output SHALL include "CREATE TABLE schema.table (" prefix
443+
- **AND** output SHALL include column definitions separated by commas
444+
- **AND** output SHALL include closing ");"
445+
- **AND** schema prefix SHALL be optional (omit if not provided)
446+
447+
### Requirement: Auto Table Creation for Parquet Import
448+
449+
The system SHALL support automatically creating target tables before Parquet import when enabled.
450+
451+
#### Scenario: Auto-create table option enabled
452+
453+
- **WHEN** importing Parquet with create_table_if_not_exists=true
454+
- **AND** target table does not exist
455+
- **THEN** system SHALL infer schema from Parquet file(s)
456+
- **AND** system SHALL generate CREATE TABLE DDL
457+
- **AND** system SHALL execute DDL before IMPORT statement
458+
- **AND** import SHALL proceed normally after table creation
459+
460+
#### Scenario: Auto-create with existing table
461+
462+
- **WHEN** importing Parquet with create_table_if_not_exists=true
463+
- **AND** target table already exists
464+
- **THEN** system SHALL skip DDL execution
465+
- **AND** import SHALL proceed normally using existing table schema
466+
467+
#### Scenario: Auto-create option disabled (default)
468+
469+
- **WHEN** importing Parquet with create_table_if_not_exists=false (default)
470+
- **THEN** system SHALL NOT attempt schema inference
471+
- **AND** system SHALL NOT execute any CREATE TABLE DDL
472+
- **AND** import SHALL assume table already exists
473+
474+
#### Scenario: Multi-file auto-create
475+
476+
- **WHEN** importing multiple Parquet files with create_table_if_not_exists=true
477+
- **THEN** system SHALL compute union schema from all files
478+
- **AND** system SHALL create table with widened types
479+
- **AND** all files SHALL be importable into the created table

specs/type-mapping/spec.md

Lines changed: 38 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -155,30 +155,50 @@ The system SHALL preserve type metadata in Arrow schemas.
155155
- **THEN** it SHALL provide access to original Exasol type names
156156
- **AND** it SHALL expose type mapping used for each column
157157

158-
### Requirement: Exasol Type Limit Documentation
158+
### Requirement: Exasol Data Type Boundaries
159159

160-
The system SHALL document Exasol's actual data type limits as defined in official documentation.
160+
The system SHALL enforce Exasol's documented data type limits when generating DDL or validating type mappings.
161161

162-
#### Scenario: DECIMAL limits
162+
#### Scenario: VARCHAR type boundaries
163163

164-
- **WHEN** documenting DECIMAL type limits
165-
- **THEN** documentation SHALL state precision range is 1-36 digits
166-
- **AND** documentation SHALL note this differs from Arrow Decimal128's 38-digit limit
164+
- **WHEN** mapping to Exasol VARCHAR
165+
- **THEN** the maximum length SHALL be 2,000,000 characters
166+
- **AND** values exceeding this limit SHALL be truncated or rejected based on configuration
167167

168-
#### Scenario: TIMESTAMP limits
168+
#### Scenario: CHAR type boundaries
169169

170-
- **WHEN** documenting TIMESTAMP type limits
171-
- **THEN** documentation SHALL state fractional seconds precision range is 0-9
172-
- **AND** documentation SHALL explain the mapping to Arrow TimeUnit
170+
- **WHEN** mapping to Exasol CHAR
171+
- **THEN** the maximum length SHALL be 2,000 characters
172+
- **AND** CHAR is fixed-width with space padding
173173

174-
#### Scenario: String type limits
174+
#### Scenario: DECIMAL type boundaries
175175

176-
- **WHEN** documenting string type limits
177-
- **THEN** documentation SHALL note VARCHAR maximum practical size
178-
- **AND** documentation SHALL note CHAR fixed-size semantics
176+
- **WHEN** mapping to Exasol DECIMAL(p, s)
177+
- **THEN** precision SHALL be in range 1-36
178+
- **AND** scale SHALL be in range 0-36
179+
- **AND** scale SHALL NOT exceed precision
179180

180-
#### Scenario: INTERVAL limits
181+
#### Scenario: TIMESTAMP type boundaries
181182

182-
- **WHEN** documenting INTERVAL type limits
183-
- **THEN** documentation SHALL state INTERVAL DAY TO SECOND precision range is 0-9 for fractional seconds
184-
- **AND** documentation SHALL note fixed 8-byte storage for both interval types
183+
- **WHEN** mapping to Exasol TIMESTAMP
184+
- **THEN** fractional seconds precision SHALL be in range 0-9
185+
- **AND** TIMESTAMP WITH LOCAL TIME ZONE SHALL be used for timezone-aware timestamps
186+
187+
#### Scenario: Integer type mappings for DDL generation
188+
189+
- **WHEN** mapping Arrow integer types to Exasol DDL
190+
- **THEN** Int8, Int16, Int32 SHALL map to DECIMAL(18,0)
191+
- **AND** Int64 SHALL map to DECIMAL(36,0)
192+
- **AND** UInt8, UInt16, UInt32 SHALL map to DECIMAL(18,0)
193+
- **AND** UInt64 SHALL map to DECIMAL(36,0)
194+
195+
#### Scenario: Floating point type mappings for DDL generation
196+
197+
- **WHEN** mapping Arrow floating point types to Exasol DDL
198+
- **THEN** Float32 and Float64 SHALL map to DOUBLE
199+
200+
#### Scenario: INTERVAL type boundaries
201+
202+
- **WHEN** mapping to Exasol INTERVAL types
203+
- **THEN** INTERVAL DAY TO SECOND fractional precision SHALL be in range 0-9
204+
- **AND** both INTERVAL types use fixed 8-byte storage

0 commit comments

Comments
 (0)