@@ -4,11 +4,26 @@ title: INFER_SCHEMA
44
55Automatically detects the file metadata schema and retrieves the column definitions.
66
7+ ` infer_schema ` currently supports the following file formats:
8+ - ** Parquet** - Native support for schema inference
9+ - ** CSV** - With customizable delimiters and header detection
10+ - ** NDJSON** - Newline-delimited JSON files
711
8- ::: caution
12+ ** Compression Support ** : All formats also support compressed files with extensions ` .zip ` , ` .xz ` , ` .zst ` .
913
10- ` infer_schema ` currently only supports parquet file format.
14+ :::info File Size Limit
15+ Each individual file has a maximum size limit of ** 100MB** for schema inference.
16+ :::
17+
18+ :::info Schema Merging
19+ When processing multiple files, ` infer_schema ` automatically merges different schemas:
20+
21+ - ** Compatible types** are promoted (e.g., INT8 + INT16 → INT16)
22+ - ** Incompatible types** fall back to ** VARCHAR** (e.g., INT + FLOAT → VARCHAR)
23+ - ** Missing columns** in some files are marked as ** nullable**
24+ - ** New columns** from later files are added to the final schema
1125
26+ This ensures all files can be read using the unified schema.
1227:::
1328
1429## Syntax
@@ -17,81 +32,222 @@ Automatically detects the file metadata schema and retrieves the column definiti
1732INFER_SCHEMA(
1833 LOCATION => ' { internalStage | externalStage }'
1934 [ PATTERN => ' <regex_pattern>' ]
35+ [ FILE_FORMAT => ' <format_name>' ]
36+ [ MAX_RECORDS_PRE_FILE => < number > ]
37+ [ MAX_FILE_COUNT => < number > ]
2038)
2139```
2240
23- Where:
41+ ## Parameters
2442
25- ### internalStage
43+ | Parameter | Description | Default | Example |
44+ | -----------| -------------| ---------| ---------|
45+ | ` LOCATION ` | Stage location: ` @<stage_name>[/<path>] ` | Required | ` '@my_stage/data/' ` |
46+ | ` PATTERN ` | File name pattern to match | All files | ` '*.csv' ` , ` '*.parquet' ` |
47+ | ` FILE_FORMAT ` | File format name for parsing | Stage's format | ` 'csv_format' ` , ` 'NDJSON' ` |
48+ | ` MAX_RECORDS_PRE_FILE ` | Max records to sample per file | All records | ` 100 ` , ` 1000 ` |
49+ | ` MAX_FILE_COUNT ` | Max number of files to process | All files | ` 5 ` , ` 10 ` |
50+
51+ ## Examples
52+
53+ ### Parquet Files
2654
2755``` sql
28- internalStage ::= @< internal_stage_name> [/ < path > ]
56+ -- Create stage and export data
57+ CREATE STAGE test_parquet;
58+ COPY INTO @test_parquet FROM (SELECT number FROM numbers(10 )) FILE_FORMAT = (TYPE = ' PARQUET' );
59+
60+ -- Infer schema from parquet files using pattern
61+ SELECT * FROM INFER_SCHEMA(
62+ location => ' @test_parquet' ,
63+ pattern => ' *.parquet'
64+ );
65+ ```
66+
67+ Result:
68+ ```
69+ +-------------+-----------------+----------+----------+----------+
70+ | column_name | type | nullable | filenames| order_id |
71+ +-------------+-----------------+----------+----------+----------+
72+ | number | BIGINT UNSIGNED | false | data_... | 0 |
73+ +-------------+-----------------+----------+----------+----------+
2974```
3075
31- ### externalStage
76+ ### CSV Files
3277
3378``` sql
34- externalStage ::= @< external_stage_name> [/ < path > ]
79+ -- Create stage and export CSV data
80+ CREATE STAGE test_csv;
81+ COPY INTO @test_csv FROM (SELECT number FROM numbers(10 )) FILE_FORMAT = (TYPE = ' CSV' );
82+
83+ -- Create a CSV file format
84+ CREATE FILE FORMAT csv_format TYPE = ' CSV' ;
85+
86+ -- Infer schema using pattern and file format
87+ SELECT * FROM INFER_SCHEMA(
88+ location => ' @test_csv' ,
89+ pattern => ' *.csv' ,
90+ file_format => ' csv_format'
91+ );
3592```
3693
37- ### PATTERN = 'regex_pattern'
94+ Result:
95+ ```
96+ +-------------+---------+----------+----------+----------+
97+ | column_name | type | nullable | filenames| order_id |
98+ +-------------+---------+----------+----------+----------+
99+ | column_1 | BIGINT | true | data_... | 0 |
100+ +-------------+---------+----------+----------+----------+
101+ ```
38102
39- A [ PCRE2 ] ( https://www.pcre.org/current/doc/html/ ) -based regular expression pattern string, enclosed in single quotes, specifying the file names to match. Click [ here ] ( #loading-data- with-pattern-matching ) to see an example. For PCRE2 syntax, see http://www.pcre.org/current/doc/html/pcre2syntax.html .
103+ For CSV files with headers:
40104
41- ## Examples
105+ ``` sql
106+ -- Create CSV file format with header support
107+ CREATE FILE FORMAT csv_headers_format
108+ TYPE = ' CSV'
109+ field_delimiter = ' ,'
110+ skip_header = 1 ;
111+
112+ -- Export data with headers
113+ CREATE STAGE test_csv_headers;
114+ COPY INTO @test_csv_headers FROM (
115+ SELECT number as user_id, ' user_' || number ::string as user_name
116+ FROM numbers(5 )
117+ ) FILE_FORMAT = (TYPE = ' CSV' , output_header = true);
118+
119+ -- Infer schema with headers
120+ SELECT * FROM INFER_SCHEMA(
121+ location => ' @test_csv_headers' ,
122+ file_format => ' csv_headers_format'
123+ );
124+ ```
42125
43- Generate a parquet file in a stage :
126+ Limit records for faster inference :
44127
45128``` sql
46- CREATE STAGE infer_parquet FILE_FORMAT = (TYPE = PARQUET);
47- COPY INTO @infer_parquet FROM (SELECT * FROM numbers(10 )) FILE_FORMAT = (TYPE = PARQUET);
129+ -- Sample only first 5 records for schema inference
130+ SELECT * FROM INFER_SCHEMA(
131+ location => ' @test_csv' ,
132+ pattern => ' *.csv' ,
133+ file_format => ' csv_format' ,
134+ max_records_pre_file => 5
135+ );
48136```
49137
138+ ### NDJSON Files
139+
50140``` sql
51- LIST @infer_parquet;
52- + -- -----------------------------------------------------+------+------------------------------------+-------------------------------+---------+
53- | name | size | md5 | last_modified | creator |
54- + -- -----------------------------------------------------+------+------------------------------------+-------------------------------+---------+
55- | data_e0fd9cba- f45c- 4c43- aa07- d6d87d134378_0_0 .parquet | 258 | " 7DCC9FFE04EA1F6882AED2CF9640D3D4" | 2023 - 02 - 09 05 :21 :52 .000 + 0000 | NULL |
56- + -- -----------------------------------------------------+------+------------------------------------+-------------------------------+---------+
141+ -- Create stage and export NDJSON data
142+ CREATE STAGE test_ndjson;
143+ COPY INTO @test_ndjson FROM (SELECT number FROM numbers(10 )) FILE_FORMAT = (TYPE = ' NDJSON' );
144+
145+ -- Infer schema using pattern and NDJSON format
146+ SELECT * FROM INFER_SCHEMA(
147+ location => ' @test_ndjson' ,
148+ pattern => ' *.ndjson' ,
149+ file_format => ' NDJSON'
150+ );
57151```
58152
59- ### ` infer_schema `
153+ Result:
154+ ```
155+ +-------------+---------+----------+----------+----------+
156+ | column_name | type | nullable | filenames| order_id |
157+ +-------------+---------+----------+----------+----------+
158+ | number | BIGINT | true | data_... | 0 |
159+ +-------------+---------+----------+----------+----------+
160+ ```
60161
162+ Limit records for faster inference:
61163
62164``` sql
63- SELECT * FROM INFER_SCHEMA(location => ' @infer_parquet/data_e0fd9cba-f45c-4c43-aa07-d6d87d134378_0_0.parquet' );
64- + -- -----------+-----------------+----------+----------+
65- | column_name | type | nullable | order_id |
66- + -- -----------+-----------------+----------+----------+
67- | number | BIGINT UNSIGNED | 0 | 0 |
68- + -- -----------+-----------------+----------+----------+
165+ -- Sample only first 5 records for schema inference
166+ SELECT * FROM INFER_SCHEMA(
167+ location => ' @test_ndjson' ,
168+ pattern => ' *.ndjson' ,
169+ file_format => ' NDJSON' ,
170+ max_records_pre_file => 5
171+ );
69172```
70173
71- ### ` infer_schema ` with Pattern Matching
174+ ### Schema Merging with Multiple Files
175+
176+ When files have different schemas, ` infer_schema ` merges them intelligently:
72177
73178``` sql
74- SELECT * FROM infer_schema(location => ' @infer_parquet/' , pattern => ' .*parquet' );
75- + -- -----------+-----------------+----------+----------+
76- | column_name | type | nullable | order_id |
77- + -- -----------+-----------------+----------+----------+
78- | number | BIGINT UNSIGNED | 0 | 0 |
79- + -- -----------+-----------------+----------+----------+
179+ -- Suppose you have multiple CSV files with different schemas:
180+ -- file1.csv: id(INT), name(VARCHAR)
181+ -- file2.csv: id(INT), name(VARCHAR), age(INT)
182+ -- file3.csv: id(FLOAT), name(VARCHAR), age(INT)
183+
184+ SELECT * FROM INFER_SCHEMA(
185+ location => ' @my_stage/' ,
186+ pattern => ' *.csv' ,
187+ file_format => ' csv_format'
188+ );
80189```
81190
82- ### Create a Table From Parquet File
191+ Result shows merged schema:
192+ ```
193+ +-------------+---------+----------+-----------+----------+
194+ | column_name | type | nullable | filenames | order_id |
195+ +-------------+---------+----------+-----------+----------+
196+ | id | VARCHAR | true | file1,... | 0 | -- INT+FLOAT→VARCHAR
197+ | name | VARCHAR | true | file1,... | 1 |
198+ | age | BIGINT | true | file1,... | 2 | -- Missing in file1→nullable
199+ +-------------+---------+----------+-----------+----------+
200+ ```
83201
84- The ` infer_schema ` can only display the schema of a parquet file and cannot create a table from it.
202+ ### Pattern Matching and File Limits
85203
86- To create a table from a parquet file :
204+ Use pattern matching to infer schema from multiple files :
87205
88206``` sql
89- CREATE TABLE mytable AS SELECT * FROM @infer_parquet/ (pattern=> ' .*parquet' ) LIMIT 0 ;
90-
91- DESC mytable;
92- + -- ------+-----------------+------+---------+-------+
93- | Field | Type | Null | Default | Extra |
94- + -- ------+-----------------+------+---------+-------+
95- | number | BIGINT UNSIGNED | NO | 0 | |
96- + -- ------+-----------------+------+---------+-------+
207+ -- Infer schema from all CSV files in the directory
208+ SELECT * FROM INFER_SCHEMA(
209+ location => ' @my_stage/' ,
210+ pattern => ' *.csv'
211+ );
212+ ```
213+
214+ Limit the number of files processed to improve performance:
215+
216+ ``` sql
217+ -- Process only the first 5 matching files
218+ SELECT * FROM INFER_SCHEMA(
219+ location => ' @my_stage/' ,
220+ pattern => ' *.csv' ,
221+ max_file_count => 5
222+ );
223+ ```
224+
225+ ### Compressed Files
226+
227+ ` infer_schema ` automatically handles compressed files:
228+
229+ ``` sql
230+ -- Works with compressed CSV files
231+ SELECT * FROM INFER_SCHEMA(location => ' @my_stage/data.csv.zip' );
232+
233+ -- Works with compressed NDJSON files
234+ SELECT * FROM INFER_SCHEMA(
235+ location => ' @my_stage/data.ndjson.xz' ,
236+ file_format => ' NDJSON' ,
237+ max_records_pre_file => 50
238+ );
239+ ```
240+
241+ ### Create Table from Inferred Schema
242+
243+ The ` infer_schema ` function displays the schema but doesn't create tables. To create a table from the inferred schema:
244+
245+ ``` sql
246+ -- Create table structure from file schema
247+ CREATE TABLE my_table AS
248+ SELECT * FROM @my_stage/ (pattern=> ' *.parquet' )
249+ LIMIT 0 ;
250+
251+ -- Verify the table structure
252+ DESC my_table;
97253```
0 commit comments