Skip to content

Cluster by event_name when creating new table#402

Merged
Ian Streeter (istreeter) merged 1 commit intov2from
cluster-by-event-name
Feb 17, 2025
Merged

Cluster by event_name when creating new table#402
Ian Streeter (istreeter) merged 1 commit intov2from
cluster-by-event-name

Conversation

@istreeter
Copy link
Contributor

Clustering is the BigQuery table feature in which storage blocks are sorted by a user-defined column.

Taking the advice of our Analytics Engineers, event_name is a good default clustering column, because many typical queries have a filter on the event name.

Snowplow users can change the clustering column at any time, if they find a different column is more appropriate for their own query patterns.

Clustering is the BigQuery table feature in which storage blocks are
sorted by a user-defined column.

Taking the advice of our Analytics Engineers, `event_name` is a good
default clustering column, because many typical queries have a filter on
the event name.

Snowplow users can change the clustering column at any time, if they
find a different column is more appropriate for their own query
patterns.
.setClustering {
Clustering
.newBuilder()
.setFields(List("event_name").asJava)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we put this in the config rather than have a hard coded value? Can't we foresee a prospect who will have a specific need and will want to use another field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have users already who will want a different clustering.

But it's really easy to change the clustering in BigQuery. It's documented here.

 bq update --clustering_fields=app_id snowplow.events 

This PR affects creating the table for the first time. So there are two groups of affected people:

  1. New Snowplow users. Who probably don't know yet what is the best clustering for their queries.
  2. Established snowplow users, who are migrating to a new table. I think it's ok for this group to manually alter their clustering, as part of the migration.

In both cases, I think it is overkill to make it configurable in the loader.

@istreeter Ian Streeter (istreeter) merged commit 39bf15a into v2 Feb 17, 2025
2 checks passed
@istreeter Ian Streeter (istreeter) deleted the cluster-by-event-name branch February 17, 2025 10:50
Benjamin BENOIST (benjben) pushed a commit that referenced this pull request Feb 17, 2025
Clustering is the BigQuery table feature in which storage blocks are
sorted by a user-defined column.

Taking the advice of our Analytics Engineers, `event_name` is a good
default clustering column, because many typical queries have a filter on
the event name.

Snowplow users can change the clustering column at any time, if they
find a different column is more appropriate for their own query
patterns.
Benjamin BENOIST (benjben) pushed a commit that referenced this pull request Feb 17, 2025
- Update license to SLULA 1.1
- Cluster by event_name when creating new table (#402)
- Add parallelism to parseBytes and transform (#400)
- Decrease default batching.maxBytes to 10 MB (#398)
- Fix and improve ProcessingSpec for legacy column mode (#396)
- Add legacyColumnMode configuration (#394)
- Add e2e_latency_millis metric (#391)
- Fix startup on missing existing table (#384)
- Add option to exit on missing Iglu schemas (#382)
- Refactor health monitoring (#381)
- Feature flag to support the legacy column style -- bug fixes (#379 #380)
- Require alter table when schema is evolved for contexts
- Allow for delay in Writer discovering new columns
- Stay healthy if BigQuery table exceeds column limit (#372)
- Recover from server-side schema mismatch exceptions
- Improve exception handling immediately after altering the table
- Manage Writer resource to be consistent with Snowflake Loader
Benjamin BENOIST (benjben) pushed a commit that referenced this pull request Feb 17, 2025
- Update license to SLULA 1.1
- Cluster by event_name when creating new table (#402)
- Add parallelism to parseBytes and transform (#400)
- Decrease default batching.maxBytes to 10 MB (#398)
- Fix and improve ProcessingSpec for legacy column mode (#396)
- Add legacyColumnMode configuration (#394)
- Add e2e_latency_millis metric (#391)
- Fix startup on missing existing table (#384)
- Add option to exit on missing Iglu schemas (#382)
- Refactor health monitoring (#381)
- Feature flag to support the legacy column style -- bug fixes (#379 #380)
- Require alter table when schema is evolved for contexts
- Allow for delay in Writer discovering new columns
- Stay healthy if BigQuery table exceeds column limit (#372)
- Recover from server-side schema mismatch exceptions
- Improve exception handling immediately after altering the table
- Manage Writer resource to be consistent with Snowflake Loader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants