Skip to content

Commit ecaeeba

Browse files
authored
loading: refine the file format (#2213)
* loading: refine the file format * remove parquet & orc postion visit
1 parent 7e02654 commit ecaeeba

File tree

15 files changed

+116
-85
lines changed

15 files changed

+116
-85
lines changed

docs/en/guides/40-load-data/01-load/index.md

Lines changed: 18 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -2,50 +2,27 @@
22
title: Loading from Files
33
---
44

5-
import DetailsWrap from '@site/src/components/DetailsWrap';
5+
Databend offers simple, powerful commands to load data files into tables. Most operations require just a single command. Your data must be in a [supported format](/sql/sql-reference/file-format-options).
66

7-
Databend provides a variety of tools and commands that can help you load your data files into a table. Most of them are straightforward, meaning you can load your data with just a single command. Please note that your data files must be in one of the formats supported by Databend. See [Input & Output File Formats](/sql/sql-reference/file-format-options) for a list of supported file formats. The following is an overview of the data loading and unloading flows and their respective methods. Please refer to the topics in this chapter for detailed instructions.
7+
![Data Loading and Unloading Overview](/img/load/load-unload.jpeg)
88

9-
![Alt text](/img/load/load-unload.jpeg)
9+
## Supported File Formats
1010

11-
This topic does not cover all of the available data loading methods, but it provides recommendations based on the location where your data files are stored. To find the recommended method and a link to the corresponding details page, toggle the block below:
11+
| Format | Type | Description |
12+
|--------|------|-------------|
13+
| [**CSV**](/guides/load-data/load-semistructured/load-csv), [**TSV**](/guides/load-data/load-semistructured/load-tsv) | Delimited | Text files with customizable delimiters |
14+
| [**NDJSON**](/guides/load-data/load-semistructured/load-ndjson) | Semi-structured | JSON objects, one per line |
15+
| [**Parquet**](/guides/load-data/load-semistructured/load-parquet) | Semi-structured | Efficient columnar storage format |
16+
| [**ORC**](/guides/load-data/load-semistructured/load-orc) | Semi-structured | High-performance columnar format |
17+
| [**Avro**](/guides/load-data/load-semistructured/load-avro) | Semi-structured | Compact binary format with schema |
1218

13-
<DetailsWrap>
19+
## Loading by File Location
1420

15-
<details>
16-
<summary>I want to load staged data files ...</summary>
17-
<div>
18-
<div>If you have data files in an internal/external stage or the user stage, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
19-
<br/>
20-
<div>To learn more about using the COPY INTO command to load data from a stage, check out the <a href="stage">Loading from Stage</a> page. This page includes detailed tutorials that show you how to use the command to load data from a sample file in an internal/external stage or the user stage.</div>
21-
</div>
22-
</details>
21+
Select the location of your files to find the recommended loading method:
2322

24-
<details>
25-
<summary>I want to load data files in a bucket ...</summary>
26-
<div>
27-
<div>If you have data files in a bucket or container on your object storage, such as Amazon S3, Google Cloud Storage, and Microsoft Azure, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
28-
<br/>
29-
<div>To learn more about using the COPY INTO command to load data from a bucket or container, check out the <a href="s3">Loading from Bucket</a> page. This page includes a tutorial that shows you how to use the command to load data from a sample file in an Amazon S3 Bucket.</div>
30-
</div>
31-
</details>
32-
33-
<details>
34-
<summary>I want to load local data files ...</summary>
35-
<div>
36-
<div>If you have data files in your local system, Databend recommends that you load them using <a href="https://github.com/databendlabs/BendSQL">BendSQL</a>, the Databend native CLI tool, allowing you to establish a connection with Databend and execute queries directly from a CLI window.</div>
37-
<br/>
38-
<div>To learn more about using BendSQL to load your local data files, check out the <a href="local">Loading from Local File</a> page. This page includes tutorials that show you how to use the tool to load data from a local sample file.</div>
39-
</div>
40-
</details>
41-
42-
<details>
43-
<summary>I want to load remote data files ...</summary>
44-
<div>
45-
<div>If you have remote data files, Databend recommends that you load them using the COPY INTO command. The COPY INTO command is a powerful tool that can load large amounts of data quickly and efficiently.</div>
46-
<br/>
47-
<div>To learn more about using the COPY INTO command to load remote data files, check out the <a href="http">Loading from Remote File</a> page. This page includes a tutorial that shows you how to use the command to load data from a remote sample file.</div>
48-
</div>
49-
</details>
50-
51-
</DetailsWrap>
23+
| Data Source | Recommended Tool | Description | Documentation |
24+
|-------------|-----------------|-------------|---------------|
25+
| **Staged Data Files** | **COPY INTO** | Fast, efficient loading from internal/external stages or user stage | [Loading from Stage](stage) |
26+
| **Cloud Storage** | **COPY INTO** | Load from Amazon S3, Google Cloud Storage, Microsoft Azure | [Loading from Bucket](s3) |
27+
| **Local Files** | [**BendSQL**](https://github.com/databendlabs/BendSQL) | Databend's native CLI tool for local file loading | [Loading from Local File](local) |
28+
| **Remote Files** | **COPY INTO** | Load data from remote HTTP/HTTPS locations | [Loading from Remote File](http) |

docs/en/guides/40-load-data/03-load-semistructured/00-load-parquet.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ COPY INTO [<database>.]<table_name>
2020
FILE_FORMAT = (TYPE = PARQUET)
2121
```
2222

23-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
23+
- For more Parquet file format options, refer to [Parquet File Format Options](/sql/sql-reference/file-format-options#parquet-options).
24+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
2425

2526
## Tutorial: Loading Data from Parquet Files
2627

docs/en/guides/40-load-data/03-load-semistructured/01-load-csv.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
3131
) ]
3232
```
3333

34-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
34+
- For more CSV file format options, refer to [CSV File Format Options](/sql/sql-reference/file-format-options#csv-options).
35+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
3536

3637
## Tutorial: Loading Data from CSV Files
3738

docs/en/guides/40-load-data/03-load-semistructured/02-load-tsv.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
2828
) ]
2929
```
3030

31-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
31+
- For more TSV file format options, refer to [TSV File Format Options](/sql/sql-reference/file-format-options#tsv-options).
32+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
3233

3334
## Tutorial: Loading Data from TSV Files
3435

docs/en/guides/40-load-data/03-load-semistructured/03-load-ndjson.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ FROM { userStage | internalStage | externalStage | externalLocation }
2828
) ]
2929
```
3030

31-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
31+
- For more NDJSON file format options, refer to [NDJSON File Format Options](/sql/sql-reference/file-format-options#ndjson-options).
32+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
3233

3334
## Tutorial: Loading Data from NDJSON Files
3435

docs/en/guides/40-load-data/03-load-semistructured/04-load-orc.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ COPY INTO [<database>.]<table_name>
1818
FILE_FORMAT = (TYPE = ORC)
1919
```
2020

21-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
21+
- For more ORC file format options, refer to [ORC File Format Options](/sql/sql-reference/file-format-options#orc-options).
22+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
2223

2324
## Tutorial: Loading Data from ORC Files
2425

docs/en/guides/40-load-data/03-load-semistructured/05-load-avro.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ COPY INTO [<database>.]<table_name>
1818
FILE_FORMAT = (TYPE = AVRO)
1919
```
2020

21-
More details about the syntax can be found in [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
21+
- For more Avro file format options, refer to [Avro File Format Options](/sql/sql-reference/file-format-options#avro-options).
22+
- For more COPY INTO table options, refer to [COPY INTO table](/sql/sql-commands/dml/dml-copy-into-table).
2223

2324
## Tutorial: Loading Avro Data into Databend from Remote HTTP URL
2425

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,18 @@
11
---
22
title: Loading Semi-structured Formats
33
---
4-
import IndexOverviewList from '@site/src/components/IndexOverviewList';
54

65
## What is Semi-structured Data?
76

8-
Semi-structured data is a form of data that does not conform to a rigid structure like traditional databases but still contains tags or markers to separate semantic elements and enforce hierarchies of records and fields.
7+
Semi-structured data contains tags or markers to separate semantic elements while not conforming to rigid database structures. Databend efficiently loads these formats using the `COPY INTO` command, with optional on-the-fly data transformation.
98

10-
Databend facilitates the efficient and user-friendly loading of semi-structured data. It supports various formats such as **Parquet**, **CSV**, **TSV**, and **NDJSON**.
9+
## Supported File Formats
1110

12-
Additionally, Databend allows for on-the-fly transformation of data during the loading process.
13-
Copy from semi-structured data format is the most common way to load data into Databend, it is very efficient and easy to use.
14-
15-
16-
## Supported Formats
17-
18-
Databend supports several semi-structured data formats loaded using the `COPY INTO` command:
19-
20-
<IndexOverviewList />
11+
| File Format | Description | Guide |
12+
| ----------- | ----------- | ----- |
13+
| **Parquet** | Efficient columnar storage format | [Loading Parquet](load-parquet) |
14+
| **CSV** | Comma-separated values | [Loading CSV](load-csv) |
15+
| **TSV** | Tab-separated values | [Loading TSV](load-tsv) |
16+
| **NDJSON** | Newline-delimited JSON | [Loading NDJSON](load-ndjson) |
17+
| **ORC** | Optimized Row Columnar format | [Loading ORC](load-orc) |
18+
| **Avro** | Row-based format with schema definition | [Loading Avro](load-avro) |

docs/en/guides/40-load-data/04-transform/00-querying-parquet.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ sidebar_label: Parquet
77

88
Syntax:
99
```sql
10-
SELECT [<alias>.]<column> [, <column> ...] | [<alias>.]$<col_position> [, $<col_position> ...]
10+
SELECT [<alias>.]<column> [, <column> ...]
1111
FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1212
[(
1313
[<connection_parameters>],
@@ -19,7 +19,15 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1919
```
2020

2121
:::info Tips
22-
Parquet has schema information, so we can query the columns `<column> [, <column> ...]` directly.
22+
**Query Return Content Explanation:**
23+
24+
* **Return Format**: Column values in their native data types (not variants)
25+
* **Access Method**: Directly use column names `column_name`
26+
* **Example**: `SELECT id, name, age FROM @stage_name`
27+
* **Key Features**:
28+
* No need for path expressions (like `$1:name`)
29+
* No type casting required
30+
* Parquet files contain embedded schema information
2331
:::
2432

2533
## Tutorial

docs/en/guides/40-load-data/04-transform/01-querying-csv.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,16 @@ FROM {@<stage_name>[/<path>] [<table_alias>] | '<uri>' [<table_alias>]}
1919

2020

2121
:::info Tips
22-
CSV doesn't have schema information, so we can only query the columns `$<col_position> [, $<col_position> ...]` by position.
22+
**Query Return Content Explanation:**
23+
24+
* **Return Format**: Individual column values as strings by default
25+
* **Access Method**: Use positional references `$<col_position>` (e.g., `$1`, `$2`, `$3`)
26+
* **Example**: `SELECT $1, $2, $3 FROM @stage_name`
27+
* **Key Features**:
28+
* Columns accessed by position, not by name
29+
* Each `$<col_position>` refers to a single column, not the whole row
30+
* Type casting required for non-string operations (e.g., `CAST($1 AS INT)`)
31+
* No embedded schema information in CSV files
2332
:::
2433

2534
## Tutorial

0 commit comments

Comments
 (0)