Skip to content

Commit c987d4c

Browse files
authored
docs: query metadata from csv/tsv/ndjson/parquet (#1956)
1 parent 01a33db commit c987d4c

File tree

2 files changed

+172
-72
lines changed

2 files changed

+172
-72
lines changed
Lines changed: 59 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,106 +1,93 @@
11
---
2-
title: Query Metadata for Staged Files
2+
title: Working with File and Column Metadata
33
sidebar_label: Metadata
44
---
55

6-
## Why and What is Metadata?
6+
This guide explains how to query metadata from staged files. Metadata includes both file-level metadata (such as file name and row number) and column-level metadata (such as column names, types, and nullability).
77

8-
Databend allows you to retrieve metadata from your data files using the [INFER_SCHEMA](/sql/sql-functions/table-functions/infer-schema) function. This means you can extract column definitions from data files stored in internal or external stages. Retrieving metadata through the `INFER_SCHEMA` function provides a better understanding of the data structure, ensures data consistency, and enables automated data integration and analysis. The metadata for each column includes the following information:
8+
## Accessing File-Level Metadata
99

10-
- **column_name**: Indicates the name of the column.
11-
- **type**: Indicates the data type of the column.
12-
- **nullable**: Indicates whether the column allows null values.
13-
- **order_id**: Represents the column's position in the table.
10+
Databend supports accessing the following file-level metadata fields when reading staged files in the formats CSV, TSV, Parquet, and NDJSON:
1411

15-
:::note
16-
This feature is currently only available for the Parquet file format.
17-
:::
12+
| File Metadata | Type | Description |
13+
|----------------------------|---------|--------------------------------------------------|
14+
| `metadata$filename` | VARCHAR | The name of the file from which the row was read |
15+
| `metadata$file_row_number` | INT | The row number within the file (starting from 0) |
1816

19-
The syntax for `INFER_SCHEMA` is as follows. For more detailed information about this function, see [INFER_SCHEMA](/sql/sql-functions/table-functions/infer-schema).
17+
These metadata fields are available in:
2018

21-
```sql
22-
INFER_SCHEMA(
23-
LOCATION => '{ internalStage | externalStage }'
24-
[ PATTERN => '<regex_pattern>']
25-
)
26-
```
19+
- SELECT queries over stages (e.g., `SELECT FROM @stage`)
20+
- `COPY INTO <table>` statements
2721

28-
## Tutorial: Querying Column Definitions
22+
### Examples
2923

30-
In this tutorial, we will guide you through the process of uploading the sample file to an internal stage, querying the column definitions, and finally creating a table based on the staged file. Before you start, download and save the sample file [books.parquet](https://datafuse-1253727613.cos.ap-hongkong.myqcloud.com/data/books.parquet) to a local folder.
24+
1. Querying Metadata Fields
3125

32-
1. Create an internal stage named *my_internal_stage*:
26+
You can directly select metadata fields when reading from a stage:
3327

3428
```sql
35-
CREATE STAGE my_internal_stage;
29+
SELECT
30+
metadata$filename,
31+
metadata$file_row_number,
32+
*
33+
FROM @my_internal_stage/iris.parquet
34+
LIMIT 5;
3635
```
3736

38-
2. Stage the sample file using [BendSQL](../../30-sql-clients/00-bendsql/index.md):
39-
4037
```sql
41-
PUT fs:///Users/eric/Documents/books.parquet @my_internal_stage
38+
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
39+
│ metadata$filename │ metadata$file_row_number │ id │ sepal_length │ sepal_width │ petal_length │ petal_width │ species │ metadata$filename │ metadata$file_row_number │
40+
├───────────────────┼──────────────────────────┼─────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼──────────────────┼───────────────────┼──────────────────────────┤
41+
iris.parquet015.13.51.40.2 │ setosa │ iris.parquet0
42+
iris.parquet124.931.40.2 │ setosa │ iris.parquet1
43+
iris.parquet234.73.21.30.2 │ setosa │ iris.parquet2
44+
iris.parquet344.63.11.50.2 │ setosa │ iris.parquet3
45+
iris.parquet4553.61.40.2 │ setosa │ iris.parquet4
46+
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
4247
```
4348

44-
Result:
45-
```
46-
┌───────────────────────────────────────────────┐
47-
│ file │ status │
48-
│ String │ String │
49-
├─────────────────────────────────────┼─────────┤
50-
│ /Users/eric/Documents/books.parquet │ SUCCESS │
51-
└───────────────────────────────────────────────┘
52-
```
49+
2. Using Metadata in COPY INTO
5350

54-
3. Query the column definitions from the staged sample file:
51+
You can pass metadata fields into target table columns using COPY INTO:
5552

5653
```sql
57-
SELECT * FROM INFER_SCHEMA(location => '@my_internal_stage/books.parquet');
54+
COPY INTO iris_with_meta
55+
FROM (SELECT metadata$filename, metadata$file_row_number, $1, $2, $3, $4, $5 FROM @my_internal_stage/iris.parquet)
56+
FILE_FORMAT=(TYPE=parquet);
5857
```
5958

60-
Result:
61-
```
62-
┌─────────────┬─────────┬─────────┬─────────┐
63-
│ column_name │ type │ nullable│ order_id│
64-
├─────────────┼─────────┼─────────┼─────────┤
65-
│ title │ VARCHAR │ 0 │ 0 │
66-
│ author │ VARCHAR │ 0 │ 1 │
67-
│ date │ VARCHAR │ 0 │ 2 │
68-
└─────────────┴─────────┴─────────┴─────────┘
69-
```
59+
## Inferring Column Metadata from Files
7060

71-
4. Create a table named *mybooks* based on the staged sample file:
61+
Databend allows you to retrieve the following column-level metadata from your staged files in the Parquet format using the [INFER_SCHEMA](/sql/sql-functions/table-functions/infer-schema) function:
7262

73-
```sql
74-
CREATE TABLE mybooks AS SELECT * FROM @my_internal_stage/books.parquet;
75-
```
63+
| Column Metadata | Type | Description |
64+
|-----------------|---------|--------------------------------------------------|
65+
| `column_name` | String | Indicates the name of the column. |
66+
| `type` | String | Indicates the data type of the column. |
67+
| `nullable` | Boolean | Indicates whether the column allows null values. |
68+
| `order_id` | UInt64 | Represents the column's position in the table. |
7669

77-
Check the created table:
70+
### Examples
7871

79-
```sql
80-
DESC mybooks;
81-
```
72+
The following example retrieves column metadata from a Parquet file staged in `@my_internal_stage`:
8273

83-
Result:
84-
```
85-
┌─────────┬─────────┬──────┬─────────┬───────┐
86-
│ Field │ Type │ Null │ Default │ Extra │
87-
├─────────┼─────────┼──────┼─────────┼───────┤
88-
│ title │ VARCHAR │ NO │ '' │ │
89-
│ author │ VARCHAR │ NO │ '' │ │
90-
│ date │ VARCHAR │ NO │ '' │ │
91-
└─────────┴─────────┴──────┴─────────┴───────┘
74+
```sql
75+
SELECT * FROM INFER_SCHEMA(location => '@my_internal_stage/iris.parquet');
9276
```
9377

9478
```sql
95-
SELECT * FROM mybooks;
79+
┌──────────────────────────────────────────────┐
80+
│ column_name │ type │ nullable │ order_id │
81+
├──────────────┼─────────┼──────────┼──────────┤
82+
│ id │ BIGINT │ true │ 0
83+
│ sepal_length │ DOUBLE │ true │ 1
84+
│ sepal_width │ DOUBLE │ true │ 2
85+
│ petal_length │ DOUBLE │ true │ 3
86+
│ petal_width │ DOUBLE │ true │ 4
87+
│ species │ VARCHAR │ true │ 5
88+
└──────────────────────────────────────────────┘
9689
```
9790

98-
Result:
99-
```
100-
┌───────────────────────────┬───────────────────┬──────┐
101-
│ title │ author │ date │
102-
├───────────────────────────┼───────────────────┼──────┤
103-
│ Transaction Processing │ Jim Gray │ 1992 │
104-
│ Readings in Database Systems│ Michael Stonebraker│ 2004│
105-
└───────────────────────────┴───────────────────┴──────┘
106-
```
91+
## Tutorials
92+
93+
- [Querying Metadata](/tutorials/load/query-metadata)

0 commit comments

Comments
 (0)