Skip to content

Commit b52c85f

Browse files
committed
create catalog and iceberg table details
1 parent abdec1b commit b52c85f

File tree

2 files changed

+429
-0
lines changed

2 files changed

+429
-0
lines changed
Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
---
2+
type: languages
3+
title: "Beam SQL extension: CREATE CATALOG Statement"
4+
aliases:
5+
- /documentation/dsls/sql/create-catalog/
6+
- /documentation/dsls/sql/statements/create-catalog/
7+
---
8+
<!--
9+
Licensed under the Apache License, Version 2.0 (the "License");
10+
you may not use this file except in compliance with the License.
11+
You may obtain a copy of the License at
12+
13+
http://www.apache.org/licenses/LICENSE-2.0
14+
15+
Unless required by applicable law or agreed to in writing, software
16+
distributed under the License is distributed on an "AS IS" BASIS,
17+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18+
See the License for the specific language governing permissions and
19+
limitations under the License.
20+
-->
21+
22+
# Beam SQL extensions: CREATE CATALOG
23+
24+
Beam SQL's `CREATE CATALOG` statement creates and registers a catalog that manages metadata for external data sources. Catalogs provide a unified interface for accessing different types of data stores and enable features like schema management, table discovery, and cross-catalog queries.
25+
26+
Currently, Beam SQL supports the **Apache Iceberg** catalog type, which provides access to Iceberg tables with full ACID transaction support, schema evolution, and time travel capabilities.
27+
28+
## Syntax
29+
30+
```
31+
CREATE CATALOG [ IF NOT EXISTS ] catalogName
32+
TYPE catalogType
33+
[PROPERTIES (propertyKey = propertyValue [, propertyKey = propertyValue ]*)]
34+
```
35+
36+
* `IF NOT EXISTS`: Optional. If the catalog is already registered, Beam SQL
37+
ignores the statement instead of returning an error.
38+
* `catalogName`: The case sensitive name of the catalog to create and register,
39+
specified as an [Identifier](/documentation/dsls/sql/calcite/lexical#identifiers).
40+
* `catalogType`: The type of catalog to create. Currently supported values:
41+
* `iceberg`: Apache Iceberg catalog
42+
* `PROPERTIES`: Optional. Key-value pairs for catalog-specific configuration.
43+
Each property is specified as `'key' = 'value'` with string literals.
44+
45+
## Apache Iceberg Catalog
46+
47+
The Iceberg catalog provides access to [Apache Iceberg](https://iceberg.apache.org/) tables, which are high-performance table formats for huge analytic datasets.
48+
49+
### Syntax
50+
51+
```
52+
CREATE CATALOG [ IF NOT EXISTS ] catalogName
53+
TYPE iceberg
54+
PROPERTIES (
55+
'catalog-impl' = 'catalogImplementation',
56+
'warehouse' = 'warehouseLocation'
57+
[, additionalProperties...]
58+
)
59+
```
60+
61+
### Required Properties
62+
63+
* `catalog-impl`: The Iceberg catalog implementation class. Common values:
64+
* `org.apache.iceberg.hadoop.HadoopCatalog`: For Hadoop-compatible storage (HDFS, S3, GCS, etc.)
65+
* `org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog`: For BigQuery integration
66+
* `org.apache.iceberg.jdbc.JdbcCatalog`: For JDBC-based metadata storage
67+
* `org.apache.iceberg.rest.RESTCatalog`: For REST-based catalog access
68+
* `warehouse`: The root location where Iceberg tables and metadata are stored.
69+
Format depends on the storage system:
70+
* **Local filesystem**: `file:///path/to/warehouse`
71+
* **HDFS**: `hdfs://namenode:port/path/to/warehouse`
72+
* **S3**: `s3://bucket-name/path/to/warehouse`
73+
* **Google Cloud Storage**: `gs://bucket-name/path/to/warehouse`
74+
75+
### Optional Properties
76+
77+
The available optional properties depend on the catalog implementation:
78+
79+
#### Hadoop Catalog Properties
80+
81+
* `io-impl`: The file I/O implementation class. Common values:
82+
* `org.apache.iceberg.hadoop.HadoopFileIO`: For Hadoop-compatible storage
83+
* `org.apache.iceberg.aws.s3.S3FileIO`: For S3 storage
84+
* `org.apache.iceberg.gcp.gcs.GCSFileIO`: For Google Cloud Storage
85+
* `hadoop.*`: Any Hadoop configuration property (e.g., `hadoop.fs.s3a.access.key`)
86+
87+
#### BigQuery Metastore Catalog Properties
88+
89+
* `io-impl`: Must be `org.apache.iceberg.gcp.gcs.GCSFileIO` for GCS storage
90+
* `gcp_project`: Google Cloud Project ID
91+
* `gcp_region`: Google Cloud region (e.g., `us-central1`)
92+
* `gcp_location`: Alternative to `gcp_region` for specifying location
93+
94+
#### JDBC Catalog Properties
95+
96+
* `uri`: JDBC connection URI
97+
* `jdbc.user`: Database username
98+
* `jdbc.password`: Database password
99+
* `jdbc.driver`: JDBC driver class name
100+
101+
### Examples
102+
103+
#### Hadoop Catalog with Local Storage
104+
105+
```sql
106+
CREATE CATALOG my_iceberg_catalog
107+
TYPE iceberg
108+
PROPERTIES (
109+
'catalog-impl' = 'org.apache.iceberg.hadoop.HadoopCatalog',
110+
'warehouse' = 'file:///tmp/iceberg-warehouse'
111+
)
112+
```
113+
114+
#### Hadoop Catalog with S3 Storage
115+
116+
```sql
117+
CREATE CATALOG s3_iceberg_catalog
118+
TYPE iceberg
119+
PROPERTIES (
120+
'catalog-impl' = 'org.apache.iceberg.hadoop.HadoopCatalog',
121+
'warehouse' = 's3://my-bucket/iceberg-warehouse',
122+
'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO',
123+
'hadoop.fs.s3a.access.key' = 'your-access-key',
124+
'hadoop.fs.s3a.secret.key' = 'your-secret-key'
125+
)
126+
```
127+
128+
#### BigQuery Metastore Catalog
129+
130+
```sql
131+
CREATE CATALOG bigquery_iceberg_catalog
132+
TYPE iceberg
133+
PROPERTIES (
134+
'catalog-impl' = 'org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog',
135+
'io-impl' = 'org.apache.iceberg.gcp.gcs.GCSFileIO',
136+
'warehouse' = 'gs://my-bucket/iceberg-warehouse',
137+
'gcp_project' = 'my-gcp-project',
138+
'gcp_region' = 'us-central1'
139+
)
140+
```
141+
142+
#### JDBC Catalog
143+
144+
```sql
145+
CREATE CATALOG jdbc_iceberg_catalog
146+
TYPE iceberg
147+
PROPERTIES (
148+
'catalog-impl' = 'org.apache.iceberg.jdbc.JdbcCatalog',
149+
'uri' = 'jdbc:postgresql://localhost:5432/iceberg_metadata',
150+
'jdbc.user' = 'iceberg_user',
151+
'jdbc.password' = 'iceberg_password',
152+
'jdbc.driver' = 'org.postgresql.Driver',
153+
'warehouse' = 's3://my-bucket/iceberg-warehouse'
154+
)
155+
```
156+
157+
## Using Catalogs
158+
159+
After creating a catalog, you can use it to manage databases and tables:
160+
161+
### Switch to a Catalog
162+
163+
```sql
164+
USE CATALOG catalogName
165+
```
166+
167+
### Create and Use a Database
168+
169+
```sql
170+
-- Create a database (namespace)
171+
CREATE DATABASE my_database
172+
173+
-- Use the database
174+
USE DATABASE my_database
175+
```
176+
177+
### Create Tables in the Catalog
178+
179+
Once you've switched to a catalog and database, you can create tables:
180+
181+
```sql
182+
-- Switch to your catalog and database
183+
USE CATALOG my_iceberg_catalog
184+
USE DATABASE my_database
185+
186+
-- Create an Iceberg table
187+
CREATE EXTERNAL TABLE users (
188+
id BIGINT,
189+
username VARCHAR,
190+
email VARCHAR,
191+
created_at TIMESTAMP
192+
)
193+
TYPE iceberg
194+
```
195+
196+
## Catalog Management
197+
198+
### List Available Catalogs
199+
200+
```sql
201+
SHOW CATALOGS
202+
```
203+
204+
### Drop a Catalog
205+
206+
```sql
207+
DROP CATALOG [ IF EXISTS ] catalogName
208+
```
209+
210+
## Best Practices
211+
212+
### Security
213+
214+
* **Credentials**: Store sensitive credentials (access keys, passwords) in secure configuration systems rather than hardcoding them in SQL statements
215+
* **IAM Roles**: Use IAM roles and service accounts when possible instead of access keys
216+
* **Network Security**: Ensure proper network access controls for your storage systems
217+
218+
### Performance
219+
220+
* **Warehouse Location**: Choose a warehouse location that's geographically close to your compute resources
221+
* **Partitioning**: Use appropriate partitioning strategies for your data access patterns
222+
* **File Formats**: Iceberg automatically manages file formats, but consider compression settings for your use case
223+
224+
### Monitoring
225+
226+
* **Catalog Health**: Monitor catalog connectivity and performance
227+
* **Storage Usage**: Track warehouse storage usage and implement lifecycle policies
228+
* **Query Performance**: Monitor query performance and optimize table schemas as needed
229+
230+
## Troubleshooting
231+
232+
### Common Issues
233+
234+
#### Catalog Creation Fails
235+
236+
* **Check Dependencies**: Ensure all required Iceberg dependencies are available in your classpath
237+
* **Verify Properties**: Double-check that all required properties are provided and correctly formatted
238+
* **Storage Access**: Ensure your compute environment has access to the specified warehouse location
239+
240+
#### Table Operations Fail
241+
242+
* **Catalog Context**: Make sure you're using the correct catalog with `USE CATALOG`
243+
* **Database Context**: Ensure you're in the correct database with `USE DATABASE`
244+
* **Permissions**: Verify that your credentials have the necessary permissions for the storage system
245+
246+
#### Performance Issues
247+
248+
* **Partitioning**: Review your table partitioning strategy
249+
* **File Size**: Check if files are too large or too small for your use case
250+
* **Compression**: Consider adjusting compression settings for your data types
251+
252+
### Getting Help
253+
254+
For more information about Apache Iceberg:
255+
256+
* [Apache Iceberg Documentation](https://iceberg.apache.org/docs/)
257+
* [Iceberg Catalog Implementations](https://iceberg.apache.org/docs/latest/configuration/)
258+
* [Beam SQL Documentation](/documentation/dsls/sql/)

0 commit comments

Comments
 (0)