Skip to content

Commit 8626ef5

Browse files
authored
Docs: encryption (apache#14621)
* initial commit * clean up * brief how it works section * clean up * add refs * discussion updates * address review comments * add ref to custom catalogs doc * add line break
1 parent bfe06a5 commit 8626ef5

File tree

4 files changed

+166
-0
lines changed

4 files changed

+166
-0
lines changed

docs/docs/configuration.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,15 @@ Iceberg tables support table properties to configure table behavior, like the de
9090
| write.merge.isolation-level | serializable | Isolation level for merge commands: serializable or snapshot |
9191
| write.delete.granularity | partition | Controls the granularity of generated delete files: partition or file |
9292

93+
### Encryption properties
94+
95+
| Property | Default | Description |
96+
| --------------------------------- | ------------------ | ------------------------------------------------------------------------------------- |
97+
| encryption.key-id | (not set) | ID of the master key of the table |
98+
| encryption.data-key-length | 16 (bytes) | Length of keys used for encryption of table files. Valid values are 16, 24, 32 bytes |
99+
100+
See the [Encryption](encryption.md) document for additional details.
101+
93102
### Table behavior properties
94103

95104
| Property | Default | Description |
@@ -138,6 +147,7 @@ Iceberg catalogs support using catalog properties to configure catalog behaviors
138147
| cache-enabled | true | Whether to cache catalog entries |
139148
| cache.expiration-interval-ms | 30000 | How long catalog entries are locally cached, in milliseconds; 0 disables caching, negative values disable expiration |
140149
| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](metrics-reporting.md) section for additional details |
150+
| encryption.kms-impl | null | a custom `KeyManagementClient` implementation to use in a catalog for interactions with KMS (key management service). See the [Encryption](encryption.md) document for additional details |
141151

142152
`HadoopCatalog` and `HiveCatalog` can access the properties in their constructors.
143153
Any other custom catalog can access the properties by implementing `Catalog.initialize(catalogName, catalogProperties)`.

docs/docs/custom-catalog.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ It's possible to read an iceberg table either from an hdfs path or from a hive t
2828
- [Custom LocationProvider](#custom-location-provider-implementation)
2929
- [Custom IcebergSource](#custom-icebergsource)
3030

31+
Note: To work with encrypted tables, custom catalogs must address a number of security [requirements](encryption.md#catalog-security-requirements).
32+
3133
### Custom table operations implementation
3234
Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata
3335

docs/docs/encryption.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
---
2+
title: "Encryption"
3+
---
4+
<!--
5+
- Licensed to the Apache Software Foundation (ASF) under one or more
6+
- contributor license agreements. See the NOTICE file distributed with
7+
- this work for additional information regarding copyright ownership.
8+
- The ASF licenses this file to You under the Apache License, Version 2.0
9+
- (the "License"); you may not use this file except in compliance with
10+
- the License. You may obtain a copy of the License at
11+
-
12+
- http://www.apache.org/licenses/LICENSE-2.0
13+
-
14+
- Unless required by applicable law or agreed to in writing, software
15+
- distributed under the License is distributed on an "AS IS" BASIS,
16+
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17+
- See the License for the specific language governing permissions and
18+
- limitations under the License.
19+
-->
20+
21+
# Encryption
22+
23+
Iceberg table encryption protects confidentiality and integrity of table data in an untrusted storage. The `data`, `delete`, `manifest` and `manifest list` files are encrypted and tamper-proofed before being sent to the storage backend.
24+
25+
The `metadata.json` file does not contain data or stats, and is therefore not encrypted.
26+
27+
Currently, encryption is supported in the Hive and REST catalogs for tables with Parquet and Avro data formats.
28+
29+
Two parameters are required to activate encryption of a table
30+
1. Catalog property `encryption.kms-impl`, that specifies the class path for a client of a KMS ("key management service").
31+
2. Table property `encryption.key-id`, that specifies the ID of a master key used to encrypt and decrypt the table. Master keys are stored and managed in the KMS.
32+
33+
For more details on table encryption, see the "Appendix: Internals Overview" [subsection](#appendix-internals-overview).
34+
35+
## Example
36+
37+
```sh
38+
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-{{ sparkVersionMajor }}:{{ icebergVersion }}\
39+
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
40+
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
41+
--conf spark.sql.catalog.spark_catalog.type=hive \
42+
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
43+
--conf spark.sql.catalog.local.type=hive \
44+
--conf spark.sql.catalog.local.encryption.kms-impl=org.apache.iceberg.aws.AwsKeyManagementClient
45+
```
46+
47+
```sql
48+
CREATE TABLE local.db.table (id bigint, data string) USING iceberg
49+
TBLPROPERTIES ('encryption.key-id'='{{ master key id }}');
50+
```
51+
52+
Inserted data will be automatically encrypted,
53+
54+
```sql
55+
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
56+
```
57+
58+
To verify encryption, the contents of data, manifest and manifest list files can be dumped in the command line with
59+
60+
```sh
61+
hexdump -C {{ /path/to/file }} | more
62+
```
63+
64+
The Parquet files must start with the "PARE" magic string (PARquet Encrypted footer mode), and manifest/list files must start with "AGS1" magic string (Aes Gcm Stream version 1).
65+
66+
Queried data will be automatically decrypted,
67+
68+
```sql
69+
SELECT * FROM local.db.table;
70+
```
71+
72+
## Catalog security requirements
73+
74+
1. Catalogs must ensure the `encryption.key-id` property is not modified or removed during table lifetime.
75+
76+
2. To function properly, Iceberg table encryption requires the catalog implementations not to retrieve the metadata
77+
directly from metadata.json files, if these files are kept unprotected in a storage vulnerable to tampering.
78+
79+
* Catalogs may keep the metadata in a trusted independent object store.
80+
* Catalogs may work with metadata.json files in a tamper-proof storage.
81+
* Catalogs may use checksum techniques to verify integrity of metadata.json files in a storage vulnerable to tampering
82+
(the checksums must be kept in a separate trusted storage).
83+
84+
## Key Management Clients
85+
86+
Currently, Iceberg has clients for the AWS, GCP and Azure KMS systems. A custom client can be built for other key management systems by implementing the `org.apache.iceberg.encryption.KeyManagementClient` interface.
87+
88+
This interface has the following main methods,
89+
90+
```java
91+
/**
92+
* Initialize the KMS client with given properties.
93+
*
94+
* @param properties kms client properties (taken from catalog properties)
95+
*/
96+
void initialize(Map<String, String> properties);
97+
98+
/**
99+
* Wrap a secret key, using a wrapping/master key which is stored in KMS and referenced by an ID.
100+
* Wrapping means encryption of the secret key with the master key, and adding optional
101+
* KMS-specific metadata that allows the KMS to decrypt the secret key in an unwrapping call.
102+
*
103+
* @param key a secret key being wrapped
104+
* @param wrappingKeyId a key ID that represents a wrapping key stored in KMS
105+
* @return wrapped key material
106+
*/
107+
ByteBuffer wrapKey(ByteBuffer key, String wrappingKeyId);
108+
109+
/**
110+
* Unwrap a secret key, using a wrapping/master key which is stored in KMS and referenced by an
111+
* ID.
112+
*
113+
* @param wrappedKey wrapped key material (encrypted key and optional KMS metadata, returned by
114+
* the wrapKey method)
115+
* @param wrappingKeyId a key ID that represents a wrapping key stored in KMS
116+
* @return raw key bytes
117+
*/
118+
ByteBuffer unwrapKey(ByteBuffer wrappedKey, String wrappingKeyId);
119+
```
120+
121+
## Appendix: Internals Overview
122+
123+
The standard Iceberg encryption manager generates an encryption key and a unique file ID ("AAD prefix")
124+
for each data and delete file. The generation is performed in the worker nodes, by using a secure random
125+
number generator. For Parquet data files, these parameters are passed to the native Parquet Modular
126+
Encryption [mechanism](https://parquet.apache.org/docs/file-format/data-pages/encryption). For Avro data files,
127+
these parameters are passed to the AES GCM Stream encryption [mechanism](../../format/gcm-stream-spec.md).
128+
129+
The parent manifest file stores the encryption key and AAD prefix for each data and delete file in the
130+
`key_metadata` [field](../../format/spec.md#data-file-fields). For Avro data tables, the data file length
131+
is also added to the `key_metadata`.
132+
The manifest file is encrypted by the AES GCM Stream encryption mechanism, using an encryption key and an
133+
AAD prefix generated by the standard encryption manager. The generation is performed in the driver nodes,
134+
by using a secure random number generator.
135+
136+
The parent manifest list file stores the encryption key, AAD prefix and file length for each manifest file
137+
in the `key_metadata` [field](../../format/spec.md#manifest-lists). The manifest list file is encrypted by
138+
the AES GCM Stream encryption mechanism,
139+
using an encryption key and an AAD prefix generated by the standard encryption manager.
140+
141+
The manifest list encryption key, AAD prefix and file length are packed in a key metadata object. This object
142+
is serialized and encrypted with a "key encryption key" (KEK), using the KEK creation timestamp as the AES
143+
GCM AAD. A KEK and its unique KEK_ID are generated by using a secure random number generator. For each
144+
snapshot, the KEK_ID of the encryption key that encrypts the manifest list key metadata is kept in the
145+
`key-id` field in the table metadata snapshot [structure](../../format/spec.md#snapshots). The encrypted
146+
manifest list key metadata is kept in the `encryption-keys` list in the table metadata
147+
[structure](../../format/spec.md#table-metadata-fields).
148+
149+
The KEK is encrypted by the table master key via the KMS client. The result is kept in the `encryption-keys`
150+
list in the table metadata structure. The KEK is re-used for a period allowed by the NIST SP 800-57
151+
specification. Then, it is rotated - a new KEK and KEK_ID are generated for encryption of new manifest list
152+
key metadata objects. The new KEK is encrypted by the table master key and stored in the `encryption-keys`
153+
list in the table metadata structure. The previous KEKs are retained for the existing table snapshots.

docs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ nav:
2626
- Tables:
2727
- branching.md
2828
- configuration.md
29+
- encryption.md
2930
- evolution.md
3031
- maintenance.md
3132
- metrics-reporting.md

0 commit comments

Comments
 (0)