Cluster by event_name when creating new table by istreeter · Pull Request #402 · snowplow/snowplow-bigquery-loader

Ian Streeter (istreeter) · 2025-02-14T19:45:43Z

Clustering is the BigQuery table feature in which storage blocks are sorted by a user-defined column.

Taking the advice of our Analytics Engineers, event_name is a good default clustering column, because many typical queries have a filter on the event name.

Snowplow users can change the clustering column at any time, if they find a different column is more appropriate for their own query patterns.

Clustering is the BigQuery table feature in which storage blocks are sorted by a user-defined column. Taking the advice of our Analytics Engineers, `event_name` is a good default clustering column, because many typical queries have a filter on the event name. Snowplow users can change the clustering column at any time, if they find a different column is more appropriate for their own query patterns.

Benjamin BENOIST (benjben) · 2025-02-17T10:10:36Z

...es/core/src/main/scala/com.snowplowanalytics.snowplow.bigquery/processing/TableManager.scala

+      .setClustering {
+        Clustering
+          .newBuilder()
+          .setFields(List("event_name").asJava)


Shouldn't we put this in the config rather than have a hard coded value? Can't we foresee a prospect who will have a specific need and will want to use another field?

We do have users already who will want a different clustering.

But it's really easy to change the clustering in BigQuery. It's documented here.

bq update --clustering_fields=app_id snowplow.events

This PR affects creating the table for the first time. So there are two groups of affected people:

New Snowplow users. Who probably don't know yet what is the best clustering for their queries.

Established snowplow users, who are migrating to a new table. I think it's ok for this group to manually alter their clustering, as part of the migration.

In both cases, I think it is overkill to make it configurable in the loader.

Clustering is the BigQuery table feature in which storage blocks are sorted by a user-defined column. Taking the advice of our Analytics Engineers, `event_name` is a good default clustering column, because many typical queries have a filter on the event name. Snowplow users can change the clustering column at any time, if they find a different column is more appropriate for their own query patterns.

- Update license to SLULA 1.1 - Cluster by event_name when creating new table (#402) - Add parallelism to parseBytes and transform (#400) - Decrease default batching.maxBytes to 10 MB (#398) - Fix and improve ProcessingSpec for legacy column mode (#396) - Add legacyColumnMode configuration (#394) - Add e2e_latency_millis metric (#391) - Fix startup on missing existing table (#384) - Add option to exit on missing Iglu schemas (#382) - Refactor health monitoring (#381) - Feature flag to support the legacy column style -- bug fixes (#379 #380) - Require alter table when schema is evolved for contexts - Allow for delay in Writer discovering new columns - Stay healthy if BigQuery table exceeds column limit (#372) - Recover from server-side schema mismatch exceptions - Improve exception handling immediately after altering the table - Manage Writer resource to be consistent with Snowflake Loader

Benjamin BENOIST (benjben) approved these changes Feb 17, 2025

View reviewed changes

Ian Streeter (istreeter) merged commit 39bf15a into v2 Feb 17, 2025
2 checks passed

Ian Streeter (istreeter) deleted the cluster-by-event-name branch February 17, 2025 10:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster by event_name when creating new table#402

Cluster by event_name when creating new table#402
Ian Streeter (istreeter) merged 1 commit intov2from
cluster-by-event-name

Ian Streeter (istreeter) commented Feb 14, 2025

Uh oh!

Benjamin BENOIST (benjben) Feb 17, 2025

Uh oh!

Ian Streeter (istreeter) Feb 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ian Streeter (istreeter) commented Feb 14, 2025

Uh oh!

Benjamin BENOIST (benjben) Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

Ian Streeter (istreeter) Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants