Skip to content

Commit c29f4e9

Browse files
authored
add: query from avro (#2186)
* add: query from avro * make the query orc result smaller
1 parent cd14c75 commit c29f4e9

File tree

4 files changed

+152
-175
lines changed

4 files changed

+152
-175
lines changed

docs/en/guides/40-load-data/04-transform/03-querying-orc.md

Lines changed: 0 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -64,50 +64,8 @@ FROM @orc_query_stage
6464
│ sepal_length │ sepal_width │ petal_length │ petal_width │ species │
6565
├───────────────────┼───────────────────┼───────────────────┼───────────────────┼──────────────────┤
6666
5.13.51.40.2 │ setosa │
67-
4.931.40.2 │ setosa │
68-
4.73.21.30.2 │ setosa │
69-
4.63.11.50.2 │ setosa │
70-
53.61.40.2 │ setosa │
71-
5.43.91.70.4 │ setosa │
72-
4.63.41.40.3 │ setosa │
73-
53.41.50.2 │ setosa │
74-
4.42.91.40.2 │ setosa │
75-
4.93.11.50.1 │ setosa │
76-
5.43.71.50.2 │ setosa │
77-
4.83.41.60.2 │ setosa │
78-
4.831.40.1 │ setosa │
79-
4.331.10.1 │ setosa │
80-
5.841.20.2 │ setosa │
81-
5.74.41.50.4 │ setosa │
82-
5.43.91.30.4 │ setosa │
83-
5.13.51.40.3 │ setosa │
84-
5.73.81.70.3 │ setosa │
85-
5.13.81.50.3 │ setosa │
8667
│ · │ · │ · │ · │ · │
87-
│ · │ · │ · │ · │ · │
88-
│ · │ · │ · │ · │ · │
89-
7.42.86.11.9 │ virginica │
90-
7.93.86.42 │ virginica │
91-
6.42.85.62.2 │ virginica │
92-
6.32.85.11.5 │ virginica │
93-
6.12.65.61.4 │ virginica │
94-
7.736.12.3 │ virginica │
95-
6.33.45.62.4 │ virginica │
96-
6.43.15.51.8 │ virginica │
97-
634.81.8 │ virginica │
98-
6.93.15.42.1 │ virginica │
99-
6.73.15.62.4 │ virginica │
100-
6.93.15.12.3 │ virginica │
101-
5.82.75.11.9 │ virginica │
102-
6.83.25.92.3 │ virginica │
103-
6.73.35.72.5 │ virginica │
104-
6.735.22.3 │ virginica │
105-
6.32.551.9 │ virginica │
106-
6.535.22 │ virginica │
107-
6.23.45.42.3 │ virginica │
10868
5.935.11.8 │ virginica │
109-
150 rows │ │ │ │ │
110-
│ (40 shown) │ │ │ │ │
11169
└──────────────────────────────────────────────────────────────────────────────────────────────────┘
11270
```
11371

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: Querying Avro Files in Stage
3+
sidebar_label: Avro
4+
---
5+
6+
## Query Avro Files in Stage
7+
8+
Syntax:
9+
```sql
10+
SELECT [<alias>.]$1:<column> [, $1:<column> ...]
11+
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
12+
[(
13+
[<connection_parameters>],
14+
[ PATTERN => '<regex_pattern>'],
15+
[ FILE_FORMAT => 'AVRO'],
16+
[ FILES => ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
17+
)]
18+
```
19+
20+
:::info Tips
21+
Avro files can be queried directly as variants using `$1:<column>`.
22+
:::
23+
24+
## Avro Querying Features Overview
25+
26+
Databend provides comprehensive support for querying Avro files directly from stages. This allows for flexible data exploration and transformation without needing to load the data into a table first.
27+
28+
* **Variant Representation**: Each row in an Avro file is treated as a variant, referenced by `$1`. This allows for flexible access to nested structures within the Avro data.
29+
* **Type Mapping**: Each Avro type is mapped to a corresponding variant type in Databend.
30+
* **Metadata Access**: You can access metadata columns like `metadata$filename` and `metadata$file_row_number` for additional context about the source file and row.
31+
32+
## Tutorial
33+
34+
This tutorial demonstrates how to query Avro files stored in a stage.
35+
36+
### Step 1. Prepare an Avro File
37+
38+
Consider an Avro file with the following schema named `user`:
39+
40+
```json
41+
{
42+
"type": "record",
43+
"name": "user",
44+
"fields": [
45+
{
46+
"name": "id",
47+
"type": "long"
48+
},
49+
{
50+
"name": "name",
51+
"type": "string"
52+
}
53+
]
54+
}
55+
```
56+
57+
### Step 2. Create an External Stage
58+
59+
Create an external stage with your own S3 bucket and credentials where your Avro files are stored.
60+
61+
```sql
62+
CREATE STAGE avro_query_stage
63+
URL = 's3://load/avro/'
64+
CONNECTION = (
65+
ACCESS_KEY_ID = '<your-access-key-id>'
66+
SECRET_ACCESS_KEY = '<your-secret-access-key>'
67+
);
68+
```
69+
70+
### Step 3. Query Avro Files
71+
72+
#### Basic Query
73+
74+
Query Avro files directly from a stage:
75+
76+
```sql
77+
SELECT
78+
CAST($1:id AS INT) AS id,
79+
$1:name AS name
80+
FROM @avro_query_stage
81+
(
82+
FILE_FORMAT => 'AVRO',
83+
PATTERN => '.*[.]avro'
84+
);
85+
```
86+
87+
#### Query with Metadata
88+
89+
Query Avro files directly from a stage, including metadata columns like `metadata$filename` and `metadata$file_row_number`:
90+
91+
```sql
92+
SELECT
93+
metadata$filename AS file,
94+
metadata$file_row_number AS row,
95+
CAST($1:id AS INT) AS id,
96+
$1:name AS name
97+
FROM @avro_query_stage
98+
(
99+
FILE_FORMAT => 'AVRO',
100+
PATTERN => '.*[.]avro'
101+
);
102+
```
103+
104+
## Type Mapping to Variant
105+
106+
Variants in Databend are stored as JSONB. While most Avro types map straightforwardly, some special considerations apply:
107+
108+
* **Time Types**: `TimeMillis` and `TimeMicros` are mapped to `INT64` as JSONB does not have a native Time type. Users should be aware of the original type when processing these values.
109+
* **Decimal Types**: Decimals are loaded as `DECIMAL128` or `DECIMAL256`. An error may occur if the precision exceeds the supported limits.
110+
* **Enum Types**: Avro `ENUM` types are mapped to `STRING` values in Databend.

docs/en/guides/40-load-data/04-transform/04-querying-metadata.md

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,14 @@ title: Working with File and Column Metadata
33
sidebar_label: Metadata
44
---
55

6-
This guide explains how to query metadata from staged files. Metadata includes both file-level metadata (such as file name and row number) and column-level metadata (such as column names, types, and nullability).
6+
This guide explains how to query metadata from staged files. The supported file formats for metadata querying are summarized in the table below:
77

8-
## Accessing File-Level Metadata
8+
| Metadata Type | Supported File Formats |
9+
|---------------------|------------------------------------------------------|
10+
| File-level metadata | CSV, TSV, Parquet, NDJSON, Avro |
11+
| Column-level metadata (INFER_SCHEMA) | Parquet |
912

10-
Databend supports accessing the following file-level metadata fields when reading staged files in the formats CSV, TSV, Parquet, and NDJSON:
13+
The following file-level metadata fields are available for the supported file formats:
1114

1215
| File Metadata | Type | Description |
1316
|----------------------------|---------|--------------------------------------------------|
@@ -28,22 +31,15 @@ You can directly select metadata fields when reading from a stage:
2831
```sql
2932
SELECT
3033
metadata$filename,
31-
metadata$file_row_number,
32-
*
33-
FROM @my_internal_stage/iris.parquet
34-
LIMIT 5;
34+
metadata$file_row_number
35+
FROM @my_internal_stage
36+
LIMIT 1;
3537
```
3638

3739
```sql
38-
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
39-
│ metadata$filename │ metadata$file_row_number │ id │ sepal_length │ sepal_width │ petal_length │ petal_width │ species │ metadata$filename │ metadata$file_row_number │
40-
├───────────────────┼──────────────────────────┼─────────────────┼───────────────────┼───────────────────┼───────────────────┼───────────────────┼──────────────────┼───────────────────┼──────────────────────────┤
41-
iris.parquet015.13.51.40.2 │ setosa │ iris.parquet0
42-
iris.parquet124.931.40.2 │ setosa │ iris.parquet1
43-
iris.parquet234.73.21.30.2 │ setosa │ iris.parquet2
44-
iris.parquet344.63.11.50.2 │ setosa │ iris.parquet3
45-
iris.parquet4553.61.40.2 │ setosa │ iris.parquet4
46-
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
40+
│ metadata$filename │ metadata$file_row_number │
41+
├───────────────────┼───────────────────────────┤
42+
iris.parquet10
4743
```
4844

4945
2. Using Metadata in COPY INTO
@@ -58,7 +54,7 @@ FILE_FORMAT=(TYPE=parquet);
5854

5955
## Inferring Column Metadata from Files
6056

61-
Databend allows you to retrieve the following column-level metadata from your staged files in the Parquet format using the [INFER_SCHEMA](/sql/sql-functions/table-functions/infer-schema) function:
57+
Databend allows you to retrieve column-level metadata from your staged files using the [INFER_SCHEMA](/sql/sql-functions/table-functions/infer-schema) function. This is currently supported for **Parquet** files.
6258

6359
| Column Metadata | Type | Description |
6460
|-----------------|---------|--------------------------------------------------|

0 commit comments

Comments
 (0)