Cluster by event_name when creating new table#402
Merged
Ian Streeter (istreeter) merged 1 commit intov2from Feb 17, 2025
Merged
Cluster by event_name when creating new table#402Ian Streeter (istreeter) merged 1 commit intov2from
Ian Streeter (istreeter) merged 1 commit intov2from
Conversation
Clustering is the BigQuery table feature in which storage blocks are sorted by a user-defined column. Taking the advice of our Analytics Engineers, `event_name` is a good default clustering column, because many typical queries have a filter on the event name. Snowplow users can change the clustering column at any time, if they find a different column is more appropriate for their own query patterns.
Benjamin BENOIST (benjben)
approved these changes
Feb 17, 2025
| .setClustering { | ||
| Clustering | ||
| .newBuilder() | ||
| .setFields(List("event_name").asJava) |
Contributor
There was a problem hiding this comment.
Shouldn't we put this in the config rather than have a hard coded value? Can't we foresee a prospect who will have a specific need and will want to use another field?
Contributor
Author
There was a problem hiding this comment.
We do have users already who will want a different clustering.
But it's really easy to change the clustering in BigQuery. It's documented here.
bq update --clustering_fields=app_id snowplow.events
This PR affects creating the table for the first time. So there are two groups of affected people:
- New Snowplow users. Who probably don't know yet what is the best clustering for their queries.
- Established snowplow users, who are migrating to a new table. I think it's ok for this group to manually alter their clustering, as part of the migration.
In both cases, I think it is overkill to make it configurable in the loader.
Benjamin BENOIST (benjben)
pushed a commit
that referenced
this pull request
Feb 17, 2025
Clustering is the BigQuery table feature in which storage blocks are sorted by a user-defined column. Taking the advice of our Analytics Engineers, `event_name` is a good default clustering column, because many typical queries have a filter on the event name. Snowplow users can change the clustering column at any time, if they find a different column is more appropriate for their own query patterns.
Benjamin BENOIST (benjben)
pushed a commit
that referenced
this pull request
Feb 17, 2025
- Update license to SLULA 1.1 - Cluster by event_name when creating new table (#402) - Add parallelism to parseBytes and transform (#400) - Decrease default batching.maxBytes to 10 MB (#398) - Fix and improve ProcessingSpec for legacy column mode (#396) - Add legacyColumnMode configuration (#394) - Add e2e_latency_millis metric (#391) - Fix startup on missing existing table (#384) - Add option to exit on missing Iglu schemas (#382) - Refactor health monitoring (#381) - Feature flag to support the legacy column style -- bug fixes (#379 #380) - Require alter table when schema is evolved for contexts - Allow for delay in Writer discovering new columns - Stay healthy if BigQuery table exceeds column limit (#372) - Recover from server-side schema mismatch exceptions - Improve exception handling immediately after altering the table - Manage Writer resource to be consistent with Snowflake Loader
Benjamin BENOIST (benjben)
pushed a commit
that referenced
this pull request
Feb 17, 2025
- Update license to SLULA 1.1 - Cluster by event_name when creating new table (#402) - Add parallelism to parseBytes and transform (#400) - Decrease default batching.maxBytes to 10 MB (#398) - Fix and improve ProcessingSpec for legacy column mode (#396) - Add legacyColumnMode configuration (#394) - Add e2e_latency_millis metric (#391) - Fix startup on missing existing table (#384) - Add option to exit on missing Iglu schemas (#382) - Refactor health monitoring (#381) - Feature flag to support the legacy column style -- bug fixes (#379 #380) - Require alter table when schema is evolved for contexts - Allow for delay in Writer discovering new columns - Stay healthy if BigQuery table exceeds column limit (#372) - Recover from server-side schema mismatch exceptions - Improve exception handling immediately after altering the table - Manage Writer resource to be consistent with Snowflake Loader
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Clustering is the BigQuery table feature in which storage blocks are sorted by a user-defined column.
Taking the advice of our Analytics Engineers,
event_nameis a good default clustering column, because many typical queries have a filter on the event name.Snowplow users can change the clustering column at any time, if they find a different column is more appropriate for their own query patterns.