fix: optimize performance for platform-wide querying #23

scotwells · 2026-01-09T21:05:00Z

Summary

Optimize Clickhouse database schema for platform-wide and user-specific querying of multi-tenant audit log data.

Details

As we began running performance tests against the activity apiserver, we noticed that platform-wide queries were performing drastically worse than tenant-level queries. See datum-cloud/enhancements#536 (comment) for a comparison.

This was a result of our initial schema being designed to order data by tenant resulting in platform-wide queries scanning the entire data set instead of being able to skip over irrelevant rows.

This change makes several adjustments to the schema to improve querying performance.

Moved to daily partitions so that partitions are TTL'd at a finer granularity so older partitions are out sooner.
Removes unnecessary skip indexes on fields already present in the ordering of the data. Skip indexes won't provide much performance benefit if the ordering is used.
Adds skip indexes for fields used for common querying patterns to help skip over irrelevant rows.
Creates new projections that are designed to efficiently query audit logs across all tenants.
- The platform-wide query projection is designed to support platform administrators querying across all tenants. Queries will be most performant when they query by a specific api group and resource which will be the most common querying pattern for cross-tenant queries.
- Also introduced a query projection for user-specific queries to help platform administrators query for all audit logs related to a specific user.

I've modified the 001_initial_schema.sql migration instead of adding a new migration because this service has not been released yet.

This PR also contains a few other related changes:

Removed stage from the schema and the querying interface since we're only collecting the ResponseComplete stage from the system.
Adjusted the apiserver to intelligently change the order by used when querying clickhouse to ensure projections are used based on the query being performed by the end-user.
Updated the performance tests have been updated to better reflect real-world querying behavior where the api group / resource are present in the queries.

I also included a few unrelated changes:

Upgraded to the v1.9.0 version of our shared actions to resolve an issue with the wrong tag being injected to the kustomize builds.
Moves to using a ReplacingMergeTree database engine to ensure that all audit logs are unique. Removing duplicates is a background operation so users may see duplicates if a merge operation hasn't been performed. To help prevent duplicates, I adjusted the NATS configuration and Vector configuration to de-duplicate audit logs based on the audit ID. The audit ID is guaranteed to be unique since we only collect the ResponseComplete stage.

Performance test results

Previous Clickhouse schema

This shows a performance test that was run against the activity system that was focused on tenant-level querying. The graphs show that the activity api would struggle with a small number of platform-level queries (~4 RPS) and queries would immediately begin timing out.

New optimized Clickhouse schema

This performance test demonstrates the improvements that were made after the new schema was applied. The graphs show that the performance test was able to reach significantly higher throughout ( ~40 RPS) before queries would begin to time out.

Resources

Relates to datum-cloud/enhancements#536

As we began running performance tests against the activity apiserver, we noticed that platform-wide queries were performing drastically worse than tenant-level queries. This was a result of our initial schema being designed to order data by tenant resulting in platform-wide queries scanning the entire data set instead of being able to skip over irrelevant rows. This change makes several adjustments to the schema to improve querying performance of the clickhouse database. - Moved to daily partitions so that partitions can be TTL'd each day instead of only when the month is over. This should also ensure that queries only need to scan fewer partitions because all queries will be time-bound. - Removes unnecessary skip indexes on fields already present in the ordering of the data. Skip indexes won't provide much performance benefit if the ordering is used. - Moves to using a ReplacingMergeTree database engine to ensure that all audit logs are unique. Removing duplicates is a background operation so users _may_ see duplicates if a merge operation hasn't been performed. We'll mitigate this in the collection pipeline by putting guardrails in place to prevent duplicates from being sent to Clickhouse. - Adds indexes for fields used for common querying patterns to help skip over irrelevant rows. - Creates new projections that are designed to efficiently query audit logs across all tenants. The platform-wide query projection is designed to support platform administrators querying across all tenants. Queries will be most performant when they query by a specific api group and resource which will be the most common querying pattern for cross-tenant queries. Also introduces a query projection for user-specific queries to help platform administrators query for all audit logs related to a specific user. I've modified the 001_initial_schema.sql migration instead of adding a new migration because this service has not been released yet. I've also removed `stage` from the schema and the querying interface since we're only collecting the `ResponseComplete` stage from the system. The apiserver has also been adjusted to intelligently change the order by used when querying clickhouse to ensure projections are used based on the query being performed by the end-user. Lastly, the performance tests have been updated to better reflect real-world querying behavior where the api group / resource are present in the queries. See: https://clickhouse.com/docs/data-modeling/projections

We need to configure the merge behavior of projections since we swapped over to the replacing merge tree engine.

This change adjusts the NATS stream configuration to support a 10 minute de-duplication window. The NATS message ID has been set to the audit log ID since the ID will be unique across all audit logs.

Have to enable JetStream to take advantage of the message_id option.

scotwells added 2 commits January 9, 2026 14:56

chore: upgrade github actions version

c7d523e

scotwells requested review from JoseSzycho, OscarLlamas6, cc-datum, drewr, ecv and zachsmith1 January 9, 2026 21:05

scotwells mentioned this pull request Jan 9, 2026

Create an apiserver for audit and activity logs datum-cloud/enhancements#536

Open

scotwells added 2 commits January 9, 2026 15:46

fix: adjust migration to set merge behavior

81d55c1

We need to configure the merge behavior of projections since we swapped over to the replacing merge tree engine.

feat: de-dupe audit logs in collection pipeline

5ca4bc1

This change adjusts the NATS stream configuration to support a 10 minute de-duplication window. The NATS message ID has been set to the audit log ID since the ID will be unique across all audit logs.

scotwells force-pushed the fix/improve-platform-wide-query-performance branch from e63a8e6 to 21ce2cf Compare January 9, 2026 22:36

fix: adjust vector message_id configuration

a5b914a

Have to enable JetStream to take advantage of the message_id option.

scotwells force-pushed the fix/improve-platform-wide-query-performance branch from 21ce2cf to a5b914a Compare January 9, 2026 22:40

chore: regenerate migrations

f7172c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: optimize performance for platform-wide querying #23

fix: optimize performance for platform-wide querying #23

Uh oh!

scotwells commented Jan 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: optimize performance for platform-wide querying #23

Are you sure you want to change the base?

fix: optimize performance for platform-wide querying #23

Uh oh!

Conversation

scotwells commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Performance test results

Previous Clickhouse schema

New optimized Clickhouse schema

Resources

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scotwells commented Jan 9, 2026 •

edited

Loading