Skip to content

Commit b6a5dbd

Browse files
authored
refine: COPY INTO <table> (#1951)
* refine: copy into table * refine the syntax
1 parent 252814f commit b6a5dbd

File tree

1 file changed

+182
-133
lines changed

1 file changed

+182
-133
lines changed

docs/en/sql-reference/10-sql-commands/10-dml/dml-copy-into-table.md

Lines changed: 182 additions & 133 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ sidebar_label: "COPY INTO <table>"
44
---
55

66
import FunctionDescription from '@site/src/components/FunctionDescription';
7+
import Tabs from '@theme/Tabs';
8+
import TabItem from '@theme/TabItem';
79

810
<FunctionDescription description="Introduced or updated: v1.2.704"/>
911

@@ -12,188 +14,239 @@ COPY INTO allows you to load data from files located in one of the following loc
1214
- User / Internal / External stages: See [What is Stage?](/guides/load-data/stage/what-is-stage) to learn about stages in Databend.
1315
- Buckets or containers created in a storage service.
1416
- Remote servers from where you can access the files by their URL (starting with "https://...").
15-
- [IPFS](https://ipfs.tech).
17+
- [IPFS](https://ipfs.tech) and Hugging Face repositories.
1618

1719
See also: [`COPY INTO <location>`](dml-copy-into-location.md)
1820

1921
## Syntax
2022

2123
```sql
22-
COPY INTO [<database_name>.]<table_name>
23-
FROM { userStage | internalStage | externalStage | externalLocation |
24-
( SELECT [<file_col> ... ]
25-
FROM { userStage | internalStage | externalStage } ) }
24+
/* Standard data load */
25+
COPY INTO [<database_name>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
26+
FROM { userStage | internalStage | externalStage | externalLocation }
2627
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
2728
[ PATTERN = '<regex_pattern>' ]
2829
[ FILE_FORMAT = (
2930
FORMAT_NAME = '<your-custom-format>'
3031
| TYPE = { CSV | TSV | NDJSON | PARQUET | ORC | AVRO } [ formatTypeOptions ]
3132
) ]
3233
[ copyOptions ]
33-
```
34-
35-
### FROM ...
36-
37-
The FROM clause specifies the source location (user stage, internal stage, external stage, or external location) from which data will be loaded into the specified table using the COPY INTO command. You can also nest a SELECT ... FROM subquery to transform the data you want to load. For more information, see [Transforming Data on Load](/guides/load-data/transform/data-load-transform).
38-
39-
:::note
40-
When you load data from a staged file and the stage path contains special characters such as spaces or parentheses, you can enclose the entire path in single quotes, as demonstrated in the following SQL statements:
4134

42-
COPY INTO mytable FROM 's3://mybucket/dataset(databend)/' ...
43-
COPY INTO mytable FROM 's3://mybucket/dataset databend/' ...
44-
:::
35+
/* Data load with transformation */
36+
COPY INTO [<database_name>.]<table_name> [ ( <col_name> [ , <col_name> ... ] ) ]
37+
FROM ( SELECT [<alias>.]$<file_col_num>[.<element>] [ , [<alias>.]$<file_col_num>[.<element>] ... ]
38+
FROM { userStage | internalStage | externalStage } )
39+
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
40+
[ PATTERN = '<regex_pattern>' ]
41+
[ FILE_FORMAT = (
42+
FORMAT_NAME = '<your-custom-format>'
43+
| TYPE = { CSV | TSV | NDJSON | PARQUET | ORC | AVRO } [ formatTypeOptions ]
44+
) ]
45+
[ copyOptions ]
46+
```
4547

46-
#### userStage
48+
Where:
4749

4850
```sql
4951
userStage ::= @~[/<path>]
50-
```
51-
52-
#### internalStage
5352

54-
```sql
5553
internalStage ::= @<internal_stage_name>[/<path>]
56-
```
57-
58-
#### externalStage
5954

60-
```sql
6155
externalStage ::= @<external_stage_name>[/<path>]
62-
```
63-
64-
#### externalLocation
65-
66-
import Tabs from '@theme/Tabs';
67-
import TabItem from '@theme/TabItem';
6856

69-
<Tabs groupId="externallocation">
70-
71-
<TabItem value="Amazon S3-like Storage" label="Amazon S3-like Storage">
72-
73-
```sql
74-
externalLocation ::=
75-
's3://<bucket>[<path>]'
76-
CONNECTION = (
77-
<connection_parameters>
78-
)
79-
```
80-
81-
For the connection parameters available for accessing Amazon S3-like storage services, see [Connection Parameters](/00-sql-reference/51-connect-parameters.md).
82-
</TabItem>
83-
84-
<TabItem value="Azure Blob Storage" label="Azure Blob Storage">
85-
86-
```sql
8757
externalLocation ::=
88-
'azblob://<container>[<path>]'
58+
/* Amazon S3-like Storage */
59+
's3://<bucket>[/<path>]'
8960
CONNECTION = (
90-
<connection_parameters>
61+
[ ENDPOINT_URL = '<endpoint-url>' ]
62+
[ ACCESS_KEY_ID = '<your-access-key-ID>' ]
63+
[ SECRET_ACCESS_KEY = '<your-secret-access-key>' ]
64+
[ ENABLE_VIRTUAL_HOST_STYLE = TRUE | FALSE ]
65+
[ MASTER_KEY = '<your-master-key>' ]
66+
[ REGION = '<region>' ]
67+
[ SECURITY_TOKEN = '<security-token>' ]
68+
[ ROLE_ARN = '<role-arn>' ]
69+
[ EXTERNAL_ID = '<external-id>' ]
9170
)
92-
```
71+
72+
/* Azure Blob Storage */
73+
| 'azblob://<container>[/<path>]'
74+
CONNECTION = (
75+
ENDPOINT_URL = '<endpoint-url>'
76+
ACCOUNT_NAME = '<account-name>'
77+
ACCOUNT_KEY = '<account-key>'
78+
)
79+
80+
/* Google Cloud Storage */
81+
| 'gcs://<bucket>[/<path>]'
82+
CONNECTION = (
83+
CREDENTIAL = '<your-base64-encoded-credential>'
84+
)
85+
86+
/* Alibaba Cloud OSS */
87+
| 'oss://<bucket>[/<path>]'
88+
CONNECTION = (
89+
ACCESS_KEY_ID = '<your-ak>'
90+
ACCESS_KEY_SECRET = '<your-sk>'
91+
ENDPOINT_URL = '<endpoint-url>'
92+
[ PRESIGN_ENDPOINT_URL = '<presign-endpoint-url>' ]
93+
)
94+
95+
/* Tencent Cloud Object Storage */
96+
| 'cos://<bucket>[/<path>]'
97+
CONNECTION = (
98+
SECRET_ID = '<your-secret-id>'
99+
SECRET_KEY = '<your-secret-key>'
100+
ENDPOINT_URL = '<endpoint-url>'
101+
)
102+
103+
/* Remote Files */
104+
| 'https://<url>'
105+
106+
/* IPFS */
107+
| 'ipfs://<your-ipfs-hash>'
108+
CONNECTION = (ENDPOINT_URL = 'https://<your-ipfs-gateway>')
109+
110+
/* Hugging Face */
111+
| 'hf://<repo-id>[/<path>]'
112+
CONNECTION = (
113+
[ REPO_TYPE = 'dataset' | 'model' ]
114+
[ REVISION = '<revision>' ]
115+
[ TOKEN = '<your-api-token>' ]
116+
)
93117

94-
For the connection parameters available for accessing Azure Blob Storage, see [Connection Parameters](/00-sql-reference/51-connect-parameters.md).
95-
</TabItem>
118+
formatTypeOptions ::=
119+
/* Common options for all formats */
120+
[ COMPRESSION = AUTO | GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE | XZ | NONE ]
121+
122+
/* CSV specific options */
123+
[ RECORD_DELIMITER = '<character>' ]
124+
[ FIELD_DELIMITER = '<character>' ]
125+
[ SKIP_HEADER = <integer> ]
126+
[ QUOTE = '<character>' ]
127+
[ ESCAPE = '<character>' ]
128+
[ NAN_DISPLAY = '<string>' ]
129+
[ NULL_DISPLAY = '<string>' ]
130+
[ ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE | FALSE ]
131+
[ EMPTY_FIELD_AS = null | string | field_default ]
132+
[ BINARY_FORMAT = HEX | BASE64 ]
133+
134+
/* TSV specific options */
135+
[ RECORD_DELIMITER = '<character>' ]
136+
[ FIELD_DELIMITER = '<character>' ]
137+
138+
/* NDJSON specific options */
139+
[ NULL_FIELD_AS = NULL | FIELD_DEFAULT ]
140+
[ MISSING_FIELD_AS = ERROR | NULL | FIELD_DEFAULT ]
141+
[ ALLOW_DUPLICATE_KEYS = TRUE | FALSE ]
142+
143+
/* PARQUET specific options */
144+
[ MISSING_FIELD_AS = ERROR | FIELD_DEFAULT ]
145+
146+
/* ORC specific options */
147+
[ MISSING_FIELD_AS = ERROR | FIELD_DEFAULT ]
148+
149+
/* AVRO specific options */
150+
[ MISSING_FIELD_AS = ERROR | FIELD_DEFAULT ]
96151

97-
<TabItem value="Google Cloud Storage" label="Google Cloud Storage">
152+
copyOptions ::=
153+
[ SIZE_LIMIT = <num> ]
154+
[ PURGE = <bool> ]
155+
[ FORCE = <bool> ]
156+
[ DISABLE_VARIANT_CHECK = <bool> ]
157+
[ ON_ERROR = { continue | abort | abort_N } ]
158+
[ MAX_FILES = <num> ]
159+
[ RETURN_FAILED_ONLY = <bool> ]
160+
[ COLUMN_MATCH_MODE = { case-sensitive | case-insensitive } ]
98161

99-
```sql
100-
externalLocation ::=
101-
'gcs://<bucket>[<path>]'
102-
CONNECTION = (
103-
<connection_parameters>
104-
)
105162
```
106163

107-
For the connection parameters available for accessing Google Cloud Storage, see [Connection Parameters](/00-sql-reference/51-connect-parameters.md).
108-
</TabItem>
109-
110-
<TabItem value="Alibaba Cloud OSS" label="Alibaba Cloud OSS">
164+
:::note
165+
For remote files, you can use glob patterns to specify multiple files. For example:
166+
- `ontime_200{6,7,8}.csv` represents `ontime_2006.csv`, `ontime_2007.csv`, `ontime_2008.csv`
167+
- `ontime_200[6-8].csv` represents the same files
168+
:::
111169

112-
```sql
113-
externalLocation ::=
114-
'oss://<bucket>[<path>]'
115-
CONNECTION = (
116-
<connection_parameters>
117-
)
118-
```
170+
## Key Parameters
119171

120-
For the connection parameters available for accessing Alibaba Cloud OSS, see [Connection Parameters](/00-sql-reference/51-connect-parameters.md).
121-
</TabItem>
172+
- **FILES**: Specifies one or more file names (separated by commas) to be loaded.
122173

123-
<TabItem value="Tencent Cloud Object Storage" label="Tencent Cloud Object Storage">
174+
- **PATTERN**: A [PCRE2](https://www.pcre.org/current/doc/html/)-based regular expression pattern string that specifies file names to match. See [Example 4: Filtering Files with Pattern](#example-4-filtering-files-with-pattern).
124175

125-
```sql
126-
externalLocation ::=
127-
'cos://<bucket>[<path>]'
128-
CONNECTION = (
129-
<connection_parameters>
130-
)
131-
```
176+
## Format Type Options
132177

133-
For the connection parameters available for accessing Tencent Cloud Object Storage, see [Connection Parameters](/00-sql-reference/51-connect-parameters.md).
134-
</TabItem>
178+
The `FILE_FORMAT` parameter supports different file types, each with specific formatting options. Below are the available options for each supported file format:
135179

136-
<TabItem value="Remote Files" label="Remote Files">
180+
### Common Options for All Formats
137181

138-
```sql
139-
externalLocation ::=
140-
'https://<url>'
141-
```
182+
| Option | Description | Values | Default |
183+
|--------|-------------|--------|--------|
184+
| COMPRESSION | Compression algorithm for data files | AUTO, GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, XZ, NONE | AUTO |
142185

143-
You can use glob patterns to specify more than one file. For example, use
186+
### TYPE = CSV
144187

145-
- `ontime_200{6,7,8}.csv` to represents `ontime_2006.csv`,`ontime_2007.csv`,`ontime_2008.csv`.
146-
- `ontime_200[6-8].csv` to represents `ontime_2006.csv`,`ontime_2007.csv`,`ontime_2008.csv`.
188+
| Option | Description | Default |
189+
|--------|-------------|--------|
190+
| RECORD_DELIMITER | Character(s) separating records | newline |
191+
| FIELD_DELIMITER | Character(s) separating fields | comma (,) |
192+
| SKIP_HEADER | Number of header lines to skip | 0 |
193+
| QUOTE | Character used to quote fields | double-quote (") |
194+
| ESCAPE | Escape character for enclosed fields | NONE |
195+
| NAN_DISPLAY | String representing NaN values | NaN |
196+
| NULL_DISPLAY | String representing NULL values | \N |
197+
| ERROR_ON_COLUMN_COUNT_MISMATCH | Error if column count doesn't match | TRUE |
198+
| EMPTY_FIELD_AS | How to handle empty fields | null |
199+
| BINARY_FORMAT | Encoding format for binary data | HEX |
147200

148-
</TabItem>
201+
### TYPE = TSV
149202

150-
<TabItem value="IPFS" label="IPFS">
203+
| Option | Description | Default |
204+
|--------|-------------|--------|
205+
| RECORD_DELIMITER | Character(s) separating records | newline |
206+
| FIELD_DELIMITER | Character(s) separating fields | tab (\t) |
151207

152-
```sql
153-
externalLocation ::=
154-
'ipfs://<your-ipfs-hash>'
155-
CONNECTION = (ENDPOINT_URL = 'https://<your-ipfs-gateway>')
156-
```
208+
### TYPE = NDJSON
157209

158-
</TabItem>
159-
</Tabs>
210+
| Option | Description | Default |
211+
|--------|-------------|--------|
212+
| NULL_FIELD_AS | How to handle null fields | NULL |
213+
| MISSING_FIELD_AS | How to handle missing fields | ERROR |
214+
| ALLOW_DUPLICATE_KEYS | Allow duplicate object keys | FALSE |
160215

161-
### FILES
216+
### TYPE = PARQUET
162217

163-
FILES specifies one or more file names (separated by commas) to be loaded.
218+
| Option | Description | Default |
219+
|--------|-------------|--------|
220+
| MISSING_FIELD_AS | How to handle missing fields | ERROR |
164221

165-
### PATTERN
222+
### TYPE = ORC
166223

167-
A [PCRE2](https://www.pcre.org/current/doc/html/)-based regular expression pattern string, enclosed in single quotes, specifying the file names to match. For PCRE2 syntax, see http://www.pcre.org/current/doc/html/pcre2syntax.html. See [Example 4: Filtering Files with Pattern](#example-4-filtering-files-with-pattern) for examples and useful tips about filtering files with the PATTERN parameter.
224+
| Option | Description | Default |
225+
|--------|-------------|--------|
226+
| MISSING_FIELD_AS | How to handle missing fields | ERROR |
168227

169-
### FILE_FORMAT
228+
### TYPE = AVRO
170229

171-
See [Input & Output File Formats](../../00-sql-reference/50-file-format-options.md) for details.
230+
| Option | Description | Default |
231+
|--------|-------------|--------|
232+
| MISSING_FIELD_AS | How to handle missing fields | ERROR |
172233

173-
### copyOptions
234+
## Copy Options
174235

175-
```sql
176-
copyOptions ::=
177-
[ SIZE_LIMIT = <num> ]
178-
[ PURGE = <bool> ]
179-
[ FORCE = <bool> ]
180-
[ DISABLE_VARIANT_CHECK = <bool> ]
181-
[ ON_ERROR = { continue | abort | abort_N } ]
182-
[ MAX_FILES = <num> ]
183-
[ RETURN_FAILED_ONLY = <bool> ]
184-
[ COLUMN_MATCH_MODE = { case-sensitive | case-insensitive } ]
185-
```
236+
| Parameter | Description | Default |
237+
|-----------|-------------|----------|
238+
| SIZE_LIMIT | Maximum rows of data to load | `0` (no limit) |
239+
| PURGE | Purges files after successful load | `false` |
240+
| FORCE | Allows reloading of duplicate files | `false` (skips duplicates) |
241+
| DISABLE_VARIANT_CHECK | Replaces invalid JSON with null | `false` (fails on invalid JSON) |
242+
| ON_ERROR | How to handle errors: `continue`, `abort`, or `abort_N` | `abort` |
243+
| MAX_FILES | Maximum number of files to load (up to 15,000) | - |
244+
| RETURN_FAILED_ONLY | Only returns failed files in output | `false` |
245+
| COLUMN_MATCH_MODE | For Parquet: column name matching mode | `case-insensitive` |
186246

187-
| Parameter | Description | Required |
188-
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------- |
189-
| SIZE_LIMIT | Specifies the maximum rows of data to be loaded for a given COPY statement. Defaults to `0` meaning no limits. | Optional |
190-
| PURGE | If `true`, the command will purge the files in the stage after they are loaded successfully into the table. Default: `false`. | Optional |
191-
| FORCE | COPY INTO ensures idempotence by automatically tracking and preventing the reloading of files for a default period of 12 hours. This can be customized using the `load_file_metadata_expire_hours` setting to control the expiration time for file metadata.<br/>This parameter defaults to `false` meaning COPY INTO will skip duplicate files when copying data. If `true`, duplicate files will not be skipped. | Optional |
192-
| DISABLE_VARIANT_CHECK | If `true`, invalid JSON data is replaced with null values during COPY INTO. If `false` (default), COPY INTO fails on invalid JSON data. | Optional |
193-
| ON_ERROR | Decides how to handle a file that contains errors: `continue` to skip and proceed, `abort` (default) to terminate on error, `abort_N` to terminate when errors ≥ N. Note: `abort_N` not available for Parquet files. | Optional |
194-
| MAX_FILES | Sets the maximum number of files to load that have not been loaded already. The value can be set up to 15,000; any value greater than 15,000 will be treated as 15,000. | Optional |
195-
| RETURN_FAILED_ONLY | When set to `true`, only files that failed to load will be returned in the output. Default: `false`. | Optional |
196-
| COLUMN_MATCH_MODE | (For Parquet only) Determines if column name matching during COPY INTO is `case-sensitive` or `case-insensitive` (default). | Optional |
247+
:::tip
248+
When importing large volumes of data (like logs), set both `PURGE` and `FORCE` to `true` for efficient data import without Meta server interaction. Note this may lead to duplicate data imports.
249+
:::
197250

198251
:::tip
199252
When importing large volumes of data, such as logs, it is recommended to set both `PURGE` and `FORCE` to `true`. This ensures efficient data import without the need for interaction with the Meta server (updating the copied-files set). However, it is important to be aware that this may lead to duplicate data imports.
@@ -213,10 +266,6 @@ COPY INTO provides a summary of the data loading results with these columns:
213266

214267
If `RETURN_FAILED_ONLY` is set to `true`, the output will only contain the files that failed to load.
215268

216-
## Distributed COPY INTO
217-
218-
The COPY INTO feature in Databend activates distributed execution automatically in cluster environments, enhancing data loading efficiency and scalability.
219-
220269
## Examples
221270

222271
### Example 1: Loading from Stages

0 commit comments

Comments
 (0)