|
| 1 | +--- |
| 2 | +type: languages |
| 3 | +title: "Beam SQL extension: CREATE CATALOG Statement" |
| 4 | +aliases: |
| 5 | + - /documentation/dsls/sql/create-catalog/ |
| 6 | + - /documentation/dsls/sql/statements/create-catalog/ |
| 7 | +--- |
| 8 | +<!-- |
| 9 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 10 | +you may not use this file except in compliance with the License. |
| 11 | +You may obtain a copy of the License at |
| 12 | +
|
| 13 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 14 | +
|
| 15 | +Unless required by applicable law or agreed to in writing, software |
| 16 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 17 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 18 | +See the License for the specific language governing permissions and |
| 19 | +limitations under the License. |
| 20 | +--> |
| 21 | + |
| 22 | +# Beam SQL extensions: CREATE CATALOG |
| 23 | + |
| 24 | +Beam SQL's `CREATE CATALOG` statement creates and registers a catalog that manages metadata for external data sources. Catalogs provide a unified interface for accessing different types of data stores and enable features like schema management, table discovery, and cross-catalog queries. |
| 25 | + |
| 26 | +Currently, Beam SQL supports the **Apache Iceberg** catalog type, which provides access to Iceberg tables with full ACID transaction support, schema evolution, and time travel capabilities. |
| 27 | + |
| 28 | +## Syntax |
| 29 | + |
| 30 | +``` |
| 31 | +CREATE CATALOG [ IF NOT EXISTS ] catalogName |
| 32 | +TYPE catalogType |
| 33 | +[PROPERTIES (propertyKey = propertyValue [, propertyKey = propertyValue ]*)] |
| 34 | +``` |
| 35 | + |
| 36 | +* `IF NOT EXISTS`: Optional. If the catalog is already registered, Beam SQL |
| 37 | + ignores the statement instead of returning an error. |
| 38 | +* `catalogName`: The case sensitive name of the catalog to create and register, |
| 39 | + specified as an [Identifier](/documentation/dsls/sql/calcite/lexical#identifiers). |
| 40 | +* `catalogType`: The type of catalog to create. Currently supported values: |
| 41 | + * `iceberg`: Apache Iceberg catalog |
| 42 | +* `PROPERTIES`: Optional. Key-value pairs for catalog-specific configuration. |
| 43 | + Each property is specified as `'key' = 'value'` with string literals. |
| 44 | + |
| 45 | +## Apache Iceberg Catalog |
| 46 | + |
| 47 | +The Iceberg catalog provides access to [Apache Iceberg](https://iceberg.apache.org/) tables, which are high-performance table formats for huge analytic datasets. |
| 48 | + |
| 49 | +### Syntax |
| 50 | + |
| 51 | +``` |
| 52 | +CREATE CATALOG [ IF NOT EXISTS ] catalogName |
| 53 | +TYPE iceberg |
| 54 | +PROPERTIES ( |
| 55 | + 'catalog-impl' = 'catalogImplementation', |
| 56 | + 'warehouse' = 'warehouseLocation' |
| 57 | + [, additionalProperties...] |
| 58 | +) |
| 59 | +``` |
| 60 | + |
| 61 | +### Required Properties |
| 62 | + |
| 63 | +* `catalog-impl`: The Iceberg catalog implementation class. Common values: |
| 64 | + * `org.apache.iceberg.hadoop.HadoopCatalog`: For Hadoop-compatible storage (HDFS, S3, GCS, etc.) |
| 65 | + * `org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog`: For BigQuery integration |
| 66 | + * `org.apache.iceberg.jdbc.JdbcCatalog`: For JDBC-based metadata storage |
| 67 | + * `org.apache.iceberg.rest.RESTCatalog`: For REST-based catalog access |
| 68 | +* `warehouse`: The root location where Iceberg tables and metadata are stored. |
| 69 | + Format depends on the storage system: |
| 70 | + * **Local filesystem**: `file:///path/to/warehouse` |
| 71 | + * **HDFS**: `hdfs://namenode:port/path/to/warehouse` |
| 72 | + * **S3**: `s3://bucket-name/path/to/warehouse` |
| 73 | + * **Google Cloud Storage**: `gs://bucket-name/path/to/warehouse` |
| 74 | + |
| 75 | +### Optional Properties |
| 76 | + |
| 77 | +The available optional properties depend on the catalog implementation: |
| 78 | + |
| 79 | +#### Hadoop Catalog Properties |
| 80 | + |
| 81 | +* `io-impl`: The file I/O implementation class. Common values: |
| 82 | + * `org.apache.iceberg.hadoop.HadoopFileIO`: For Hadoop-compatible storage |
| 83 | + * `org.apache.iceberg.aws.s3.S3FileIO`: For S3 storage |
| 84 | + * `org.apache.iceberg.gcp.gcs.GCSFileIO`: For Google Cloud Storage |
| 85 | +* `hadoop.*`: Any Hadoop configuration property (e.g., `hadoop.fs.s3a.access.key`) |
| 86 | + |
| 87 | +#### BigQuery Metastore Catalog Properties |
| 88 | + |
| 89 | +* `io-impl`: Must be `org.apache.iceberg.gcp.gcs.GCSFileIO` for GCS storage |
| 90 | +* `gcp_project`: Google Cloud Project ID |
| 91 | +* `gcp_region`: Google Cloud region (e.g., `us-central1`) |
| 92 | +* `gcp_location`: Alternative to `gcp_region` for specifying location |
| 93 | + |
| 94 | +#### JDBC Catalog Properties |
| 95 | + |
| 96 | +* `uri`: JDBC connection URI |
| 97 | +* `jdbc.user`: Database username |
| 98 | +* `jdbc.password`: Database password |
| 99 | +* `jdbc.driver`: JDBC driver class name |
| 100 | + |
| 101 | +### Examples |
| 102 | + |
| 103 | +#### Hadoop Catalog with Local Storage |
| 104 | + |
| 105 | +```sql |
| 106 | +CREATE CATALOG my_iceberg_catalog |
| 107 | +TYPE iceberg |
| 108 | +PROPERTIES ( |
| 109 | + 'catalog-impl' = 'org.apache.iceberg.hadoop.HadoopCatalog', |
| 110 | + 'warehouse' = 'file:///tmp/iceberg-warehouse' |
| 111 | +) |
| 112 | +``` |
| 113 | + |
| 114 | +#### Hadoop Catalog with S3 Storage |
| 115 | + |
| 116 | +```sql |
| 117 | +CREATE CATALOG s3_iceberg_catalog |
| 118 | +TYPE iceberg |
| 119 | +PROPERTIES ( |
| 120 | + 'catalog-impl' = 'org.apache.iceberg.hadoop.HadoopCatalog', |
| 121 | + 'warehouse' = 's3://my-bucket/iceberg-warehouse', |
| 122 | + 'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO', |
| 123 | + 'hadoop.fs.s3a.access.key' = 'your-access-key', |
| 124 | + 'hadoop.fs.s3a.secret.key' = 'your-secret-key' |
| 125 | +) |
| 126 | +``` |
| 127 | + |
| 128 | +#### BigQuery Metastore Catalog |
| 129 | + |
| 130 | +```sql |
| 131 | +CREATE CATALOG bigquery_iceberg_catalog |
| 132 | +TYPE iceberg |
| 133 | +PROPERTIES ( |
| 134 | + 'catalog-impl' = 'org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog', |
| 135 | + 'io-impl' = 'org.apache.iceberg.gcp.gcs.GCSFileIO', |
| 136 | + 'warehouse' = 'gs://my-bucket/iceberg-warehouse', |
| 137 | + 'gcp_project' = 'my-gcp-project', |
| 138 | + 'gcp_region' = 'us-central1' |
| 139 | +) |
| 140 | +``` |
| 141 | + |
| 142 | +#### JDBC Catalog |
| 143 | + |
| 144 | +```sql |
| 145 | +CREATE CATALOG jdbc_iceberg_catalog |
| 146 | +TYPE iceberg |
| 147 | +PROPERTIES ( |
| 148 | + 'catalog-impl' = 'org.apache.iceberg.jdbc.JdbcCatalog', |
| 149 | + 'uri' = 'jdbc:postgresql://localhost:5432/iceberg_metadata', |
| 150 | + 'jdbc.user' = 'iceberg_user', |
| 151 | + 'jdbc.password' = 'iceberg_password', |
| 152 | + 'jdbc.driver' = 'org.postgresql.Driver', |
| 153 | + 'warehouse' = 's3://my-bucket/iceberg-warehouse' |
| 154 | +) |
| 155 | +``` |
| 156 | + |
| 157 | +## Using Catalogs |
| 158 | + |
| 159 | +After creating a catalog, you can use it to manage databases and tables: |
| 160 | + |
| 161 | +### Switch to a Catalog |
| 162 | + |
| 163 | +```sql |
| 164 | +USE CATALOG catalogName |
| 165 | +``` |
| 166 | + |
| 167 | +### Create and Use a Database |
| 168 | + |
| 169 | +```sql |
| 170 | +-- Create a database (namespace) |
| 171 | +CREATE DATABASE my_database |
| 172 | + |
| 173 | +-- Use the database |
| 174 | +USE DATABASE my_database |
| 175 | +``` |
| 176 | + |
| 177 | +### Create Tables in the Catalog |
| 178 | + |
| 179 | +Once you've switched to a catalog and database, you can create tables: |
| 180 | + |
| 181 | +```sql |
| 182 | +-- Switch to your catalog and database |
| 183 | +USE CATALOG my_iceberg_catalog |
| 184 | +USE DATABASE my_database |
| 185 | + |
| 186 | +-- Create an Iceberg table |
| 187 | +CREATE EXTERNAL TABLE users ( |
| 188 | + id BIGINT, |
| 189 | + username VARCHAR, |
| 190 | + email VARCHAR, |
| 191 | + created_at TIMESTAMP |
| 192 | +) |
| 193 | +TYPE iceberg |
| 194 | +``` |
| 195 | + |
| 196 | +## Catalog Management |
| 197 | + |
| 198 | +### List Available Catalogs |
| 199 | + |
| 200 | +```sql |
| 201 | +SHOW CATALOGS |
| 202 | +``` |
| 203 | + |
| 204 | +### Drop a Catalog |
| 205 | + |
| 206 | +```sql |
| 207 | +DROP CATALOG [ IF EXISTS ] catalogName |
| 208 | +``` |
| 209 | + |
| 210 | +## Best Practices |
| 211 | + |
| 212 | +### Security |
| 213 | + |
| 214 | +* **Credentials**: Store sensitive credentials (access keys, passwords) in secure configuration systems rather than hardcoding them in SQL statements |
| 215 | +* **IAM Roles**: Use IAM roles and service accounts when possible instead of access keys |
| 216 | +* **Network Security**: Ensure proper network access controls for your storage systems |
| 217 | + |
| 218 | +### Performance |
| 219 | + |
| 220 | +* **Warehouse Location**: Choose a warehouse location that's geographically close to your compute resources |
| 221 | +* **Partitioning**: Use appropriate partitioning strategies for your data access patterns |
| 222 | +* **File Formats**: Iceberg automatically manages file formats, but consider compression settings for your use case |
| 223 | + |
| 224 | +### Monitoring |
| 225 | + |
| 226 | +* **Catalog Health**: Monitor catalog connectivity and performance |
| 227 | +* **Storage Usage**: Track warehouse storage usage and implement lifecycle policies |
| 228 | +* **Query Performance**: Monitor query performance and optimize table schemas as needed |
| 229 | + |
| 230 | +## Troubleshooting |
| 231 | + |
| 232 | +### Common Issues |
| 233 | + |
| 234 | +#### Catalog Creation Fails |
| 235 | + |
| 236 | +* **Check Dependencies**: Ensure all required Iceberg dependencies are available in your classpath |
| 237 | +* **Verify Properties**: Double-check that all required properties are provided and correctly formatted |
| 238 | +* **Storage Access**: Ensure your compute environment has access to the specified warehouse location |
| 239 | + |
| 240 | +#### Table Operations Fail |
| 241 | + |
| 242 | +* **Catalog Context**: Make sure you're using the correct catalog with `USE CATALOG` |
| 243 | +* **Database Context**: Ensure you're in the correct database with `USE DATABASE` |
| 244 | +* **Permissions**: Verify that your credentials have the necessary permissions for the storage system |
| 245 | + |
| 246 | +#### Performance Issues |
| 247 | + |
| 248 | +* **Partitioning**: Review your table partitioning strategy |
| 249 | +* **File Size**: Check if files are too large or too small for your use case |
| 250 | +* **Compression**: Consider adjusting compression settings for your data types |
| 251 | + |
| 252 | +### Getting Help |
| 253 | + |
| 254 | +For more information about Apache Iceberg: |
| 255 | + |
| 256 | +* [Apache Iceberg Documentation](https://iceberg.apache.org/docs/) |
| 257 | +* [Iceberg Catalog Implementations](https://iceberg.apache.org/docs/latest/configuration/) |
| 258 | +* [Beam SQL Documentation](/documentation/dsls/sql/) |
0 commit comments