Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 92 additions & 18 deletions docs/lakehouse/meta-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -212,48 +212,97 @@ This cache, each Hudi Catalog has one.

After version 3.0.7, the configuration item name is changed to `external_cache_refresh_time_minutes`. The default value remains unchanged.

### Iceberg Table Information
### Iceberg Table/View Cache

Used to cache Iceberg table objects. The object is loaded and constructed through the Iceberg API.
Used to cache Iceberg table and view objects. These objects are loaded and constructed through the Iceberg API.

This cache, each Iceberg Catalog has one.
This cache is maintained in `IcebergMetadataCache`, where each Iceberg Catalog has its own instance with separate `tableCache` and `viewCache`.

The cached table object (`IcebergTableCacheValue`) also contains snapshot information, which is lazily loaded on demand (mainly for MTMV scenarios).

**Impact on Data Visibility:**

The Table Cache controls which version of the Iceberg table metadata is used. This affects:

- **Schema**: The `schemaId` is obtained from the cached table object. If the cache contains an older table object, you will see the old schema (column definitions).
- **Snapshot**: The current snapshot ID is obtained from the cached table object. If the cache contains an older table object, queries will use the old snapshot and may not see the latest data.
- **Partition**: Partition information is loaded using the cached table object's metadata (specs, snapshots). Older cache means outdated partition information.

:::tip
To see real-time schema, snapshot, and partition information, disable the table cache by setting `iceberg.table.meta.cache.ttl-second=0`. The Schema cache does not affect which version is used—it only caches the parsed result for performance.
:::

- Maximum cache count

Controlled by the FE configuration item `max_external_table_cache_num`, default is 1000.

You can adjust this parameter appropriately according to the number of Iceberg tables.

- Eviction time
- Eviction time (TTL)

Fixed at 28800 seconds. After version 3.0.7, configured by the FE parameter `external_cache_expire_time_seconds_after_access`, default is 86400 seconds.
Configured via Catalog property `iceberg.table.meta.cache.ttl-second` (in seconds). If not specified, defaults to the FE parameter `external_cache_expire_time_seconds_after_access` (default is 86400 seconds).

Set to `0` to disable the cache, forcing metadata to be fetched on every access.

- Minimum refresh time

Controlled by the FE configuration item `external_cache_expire_time_minutes_after_access`, in minutes. Default is 10 minutes. Reducing this time allows you to see the latest Iceberg table properties in Doris more in real time, but increases the frequency of accessing external data sources.
Controlled by the FE configuration item `external_cache_refresh_time_minutes`, in minutes. Default is 10 minutes. This is an asynchronous refresh that does not block current operations.

After version 3.0.7, the configuration item name is changed to `external_cache_refresh_time_minutes`. The default value remains unchanged.
### Iceberg Manifest Cache

### Iceberg Table Snapshot
Used to cache **parsed** Iceberg manifest file contents—specifically the `DataFile` and `DeleteFile` objects extracted from manifest files.

Used to cache the snapshot list of Iceberg tables. The object is loaded and constructed through the Iceberg API.
This cache, each Iceberg Catalog has one.
This cache is part of `IcebergMetadataCache`, with each Iceberg Catalog having its own instance.

- Maximum cache count
**What is Cached:**

Controlled by the FE configuration item `max_external_table_cache_num`, default is 1000.
The cache stores parsed manifest content (not raw file bytes):
- `DataFile` objects: File metadata including path, partition values, metrics, etc.
- `DeleteFile` objects: Delete metadata for equality deletes.

You can adjust this parameter appropriately according to the number of Iceberg tables.
:::tip Best Practice
For optimal performance, **combine Doris Manifest Cache with Iceberg native manifest cache** by setting:

- Eviction time
```sql
CREATE CATALOG iceberg_catalog PROPERTIES (
'type' = 'iceberg',
...
'iceberg.manifest.cache.enable' = 'true', -- Enable Doris Manifest Cache (default)
'io.manifest.cache-enabled' = 'true' -- Enable Iceberg native cache
);
```

Fixed at 28800 seconds. After version 3.0.7, configured by the FE parameter `external_cache_expire_time_seconds_after_access`, default is 86400 seconds.
This provides two-level caching:
1. **Iceberg native cache** (`io.manifest.cache-enabled`): Caches raw manifest file I/O
2. **Doris Manifest Cache**: Caches parsed `DataFile`/`DeleteFile` objects, avoiding repeated parsing
:::

- Minimum refresh time
**Important Note:**

Controlled by the FE configuration item `external_cache_expire_time_minutes_after_access`, in minutes. Default is 10 minutes. Reducing this time allows you to see the latest Iceberg table properties in Doris more in real time, but increases the frequency of accessing external data sources.
Iceberg manifest files are **immutable**—once created, they are never modified. New commits create new manifest files rather than modifying existing ones. Therefore:

After version 3.0.7, the configuration item name is changed to `external_cache_refresh_time_minutes`. The default value remains unchanged.
- The Manifest Cache **does not affect data correctness** or what users see.
- It only affects **query performance** (reducing I/O and parsing overhead).
- Even with cached (stale) manifest entries, queries will still see the correct data.
- Disabling this cache will not help you see "newer" data—it will only increase I/O and CPU overhead.

**Configuration:**

These properties are set when **creating an Iceberg Catalog**:

```sql
CREATE CATALOG iceberg_catalog PROPERTIES (
"iceberg.manifest.cache.enable" = "true",
"iceberg.manifest.cache.capacity-mb" = "1024",
"iceberg.manifest.cache.ttl-second" = "172800"
);
```

| Config | Default | Description |
|--------|---------|-------------|
| `iceberg.manifest.cache.enable` | `true` | Enable/disable manifest cache |
| `iceberg.manifest.cache.capacity-mb` | `1024` | Maximum cache capacity in MB |
| `iceberg.manifest.cache.ttl-second` | `48 * 60 * 60` (48 hours) | Cache entry expiration after access |

## Cache Refresh

Expand Down Expand Up @@ -335,6 +384,12 @@ For all types of External Catalogs, if you want to see the latest Table Schema i
"schema.cache.ttl-second" = "0" // For a specific Catalog, disable Schema cache (supported in 2.1.11, 3.0.6)
```

:::note
For **Iceberg Catalog**, disabling Schema Cache alone does **not** guarantee real-time schema visibility. The schemaId is obtained from the cached Table object (controlled by Table Cache). To see the latest schema, you must disable Table Cache (`iceberg.table.meta.cache.ttl-second=0`).

Schema Cache only affects whether to re-parse the schema (performance optimization), not which schema version is used.
:::

After setting, Doris will see the latest Table Schema in real time. However, this setting may increase the pressure on the metadata service.

### Disable Hive Catalog Metadata Cache
Expand Down Expand Up @@ -364,3 +419,22 @@ After setting the above parameters:

But this will increase the access pressure on external data sources (such as Hive Metastore and HDFS), which may cause unstable metadata access latency and other phenomena.

### Disable Iceberg Catalog Metadata Cache

For Iceberg Catalog, if you want to disable the cache to query real-time updated data, you can configure the following parameters:

- Disable at Catalog level

```text
-- Catalog property
"iceberg.table.meta.cache.ttl-second" = "0" // Disable table/view cache
"iceberg.manifest.cache.enable" = "false" // Disable manifest cache
```

After setting the above parameters:

- New table snapshots can be queried in real time.
- Changes to manifest files can be queried in real time.

But this will increase the access pressure on external data sources (such as Iceberg Catalog service and object storage), which may cause unstable metadata access latency.

Original file line number Diff line number Diff line change
Expand Up @@ -212,48 +212,97 @@

3.0.7 版本后,配置项名称修改为 `external_cache_refresh_time_minutes`。默认值不变。

### Iceberg 表信息
### Iceberg 表/视图缓存

用于缓存 Iceberg 表对象。该对象通过 Iceberg API 加载并构建。
用于缓存 Iceberg 表和视图对象,这些对象通过 Iceberg API 加载并构建。

该缓存,每个 Iceberg Catalog 有一个。
该缓存由 `IcebergMetadataCache` 维护,每个 Iceberg Catalog 都有自己独立的实例,包含 `tableCache` 和 `viewCache` 两个缓存。

缓存的表对象(`IcebergTableCacheValue`)中还包含 Snapshot 信息,该信息按需懒加载(主要用于 MTMV 场景)。

**对数据可见性的影响:**

Table Cache 控制使用哪个版本的 Iceberg 表元数据,这会影响:

- **Schema(结构)**:`schemaId` 从缓存的表对象中获取。如果缓存中是旧的表对象,您将看到旧的 Schema(列定义)。
- **Snapshot(快照)**:当前快照 ID 从缓存的表对象中获取。如果缓存中是旧的表对象,查询将使用旧快照,可能看不到最新数据。
- **Partition(分区)**:分区信息使用缓存的表对象的元数据(分区规范、快照)加载。缓存越旧,分区信息越滞后。

:::tip
要实时看到最新的 Schema、Snapshot 和 Partition 信息,需要禁用表缓存,设置 `iceberg.table.meta.cache.ttl-second=0`。Schema 缓存不影响使用的版本——它只是为了性能缓存已解析的结果。
:::

- 最大缓存数量

由 FE 配置项 `max_external_table_cache_num` 控制,默认为 1000。

可以根据 Iceberg 表的数量,适当调整这个参数。

- 淘汰时间
- 淘汰时间(TTL)

固定 28800 秒。3.0.7 版本之后,由 FE 参数 `external_cache_expire_time_seconds_after_access` 配置,默认 86400 秒。
通过 Catalog 属性 `iceberg.table.meta.cache.ttl-second` 配置(单位:秒)。如未指定,则使用 FE 参数 `external_cache_expire_time_seconds_after_access` 的默认值(86400 秒)。

设置为 `0` 可以禁用缓存,强制每次访问都重新获取元数据。

- 最短刷新时间

由 FE 配置项 `external_cache_expire_time_minutes_after_access` 控制单位为分钟。默认 10 分钟。减少该时间,可以更实时的在 Doris 中访问到最新的 Iceberg 表属性,但会增加访问外部数据源的频率
由 FE 配置项 `external_cache_refresh_time_minutes` 控制单位为分钟。默认为 10 分钟。这是异步刷新,不会阻塞当前操作

3.0.7 版本后,配置项名称修改为 `external_cache_refresh_time_minutes`。默认值不变。
### Iceberg Manifest 缓存

### Iceberg 表 Snapshot
用于缓存**已解析的** Iceberg Manifest 文件内容——具体是从 Manifest 文件中提取的 `DataFile` 和 `DeleteFile` 对象。

用于缓存 Iceberg 表的 Snapshot 列表。该对象通过 Iceberg API 加载并构建。
该缓存,每个 Iceberg Catalog 有一个。
该缓存是 `IcebergMetadataCache` 的一部分,每个 Iceberg Catalog 都有自己独立的实例。

- 最大缓存数量
**缓存内容:**

由 FE 配置项 `max_external_table_cache_num` 控制,默认为 1000。
该缓存存储的是已解析的 Manifest 内容(而不是原始文件字节):
- `DataFile` 对象:文件元数据,包括路径、分区值、统计信息等
- `DeleteFile` 对象:Equality Delete 的删除元数据

可以根据 Iceberg 表的数量,适当调整这个参数。
:::tip 最佳实践
为了获得最佳性能,**建议将 Doris Manifest Cache 与 Iceberg 原生 Manifest Cache 结合使用**:

- 淘汰时间
```sql
CREATE CATALOG iceberg_catalog PROPERTIES (
'type' = 'iceberg',
...
'iceberg.manifest.cache.enable' = 'true', -- 启用 Doris Manifest Cache(默认)
'io.manifest.cache-enabled' = 'true' -- 启用 Iceberg 原生缓存
);
```

固定 28800 秒。3.0.7 版本之后,由 FE 参数 `external_cache_expire_time_seconds_after_access` 配置,默认 86400 秒。
这样提供了两级缓存:
1. **Iceberg 原生缓存** (`io.manifest.cache-enabled`):缓存原始 Manifest 文件的 I/O
2. **Doris Manifest Cache**:缓存已解析的 `DataFile`/`DeleteFile` 对象,避免重复解析
:::

- 最短刷新时间
**重要说明:**

由 FE 配置项 `external_cache_expire_time_minutes_after_access` 控制。单位为分钟。默认 10 分钟。减少该时间,可以更实时的在 Doris 中访问到最新的 Iceberg 表属性,但会增加访问外部数据源的频率。
Iceberg 的 Manifest 文件是**不可变的**(immutable)——一旦创建就永远不会被修改。新的提交会创建新的 Manifest 文件,而不是修改现有文件。因此:

3.0.7 版本后,配置项名称修改为 `external_cache_refresh_time_minutes`。默认值不变。
- Manifest Cache **不影响数据正确性**,也不影响用户看到的数据。
- 它只影响**查询性能**(减少 I/O 和解析开销)。
- 即使使用缓存的(旧的)Manifest 条目,查询仍然会看到正确的数据。
- 禁用此缓存**不会**帮助您看到"更新的"数据——只会增加 I/O 和 CPU 开销。

**配置参数:**

这些属性在**创建 Iceberg Catalog 时**设置:

```sql
CREATE CATALOG iceberg_catalog PROPERTIES (
"iceberg.manifest.cache.enable" = "true",
"iceberg.manifest.cache.capacity-mb" = "1024",
"iceberg.manifest.cache.ttl-second" = "172800"
);
```

| 配置项 | 默认值 | 说明 |
|--------|--------|------|
| `iceberg.manifest.cache.enable` | `true` | 启用/禁用 Manifest 缓存 |
| `iceberg.manifest.cache.capacity-mb` | `1024` | 最大缓存容量(MB) |
| `iceberg.manifest.cache.ttl-second` | `48 * 60 * 60`(48 小时) | 访问后的缓存条目过期时间 |

## 缓存刷新

Expand Down Expand Up @@ -335,6 +384,12 @@ CREATE CATALOG hive PROPERTIES (
"schema.cache.ttl-second" = "0" // 针对某个 Catalog,关闭 Schema 缓存(2.1.11, 3.0.6 支持)
```

:::note
对于 **Iceberg Catalog**,仅关闭 Schema Cache **不能**保证实时看到最新的 Schema。schemaId 是从缓存的 Table 对象中获取的(由 Table Cache 控制)。要看到最新的 Schema,必须关闭 Table Cache(`iceberg.table.meta.cache.ttl-second=0`)。

Schema Cache 只影响是否重新解析 Schema(性能优化),不影响使用哪个版本的 Schema。
:::

设置完成后,Doris 会实时可见最新的 Table Schema。但此设置可能会增加元数据服务的压力。

### 关闭 Hive Catalog 元数据缓存
Expand Down Expand Up @@ -363,3 +418,22 @@ CREATE CATALOG hive PROPERTIES (
- 分区数据文件变动可以实时查询到。

但会增加外部源数据(如 Hive Metastore 和 HDFS)的访问压力,可能导致元数据访问延迟不稳定等现象。

### 关闭 Iceberg Catalog 元数据缓存

针对 Iceberg Catalog,如果想关闭缓存来查询到实时更新的数据,可以配置以下参数:

- Catalog 级别关闭

```text
-- Catalog property
"iceberg.table.meta.cache.ttl-second" = "0" // 关闭表/视图缓存
"iceberg.manifest.cache.enable" = "false" // 关闭 Manifest 缓存
```

设置以上参数后:

- 新的表 Snapshot 可以实时查询到。
- Manifest 文件的变更可以实时查询到。

但会增加外部数据源(如 Iceberg Catalog 服务和对象存储)的访问压力,可能导致元数据访问延迟不稳定等现象。