Cluster by event_name when creating new table (#402)

istreeter · benjben · commit 4124f80b08f1 · 2025-02-17T12:07:13.000+01:00
Clustering is the BigQuery table feature in which storage blocks are
sorted by a user-defined column.

Taking the advice of our Analytics Engineers, `event_name` is a good
default clustering column, because many typical queries have a filter on
the event name.

Snowplow users can change the clustering column at any time, if they
find a different column is more appropriate for their own query
patterns.
diff --git a/modules/core/src/main/scala/com.snowplowanalytics.snowplow.bigquery/processing/TableManager.scala b/modules/core/src/main/scala/com.snowplowanalytics.snowplow.bigquery/processing/TableManager.scala
@@ -17,6 +17,7 @@ import org.typelevel.log4cats.slf4j.Slf4jLogger
 import com.google.cloud.bigquery.{
   BigQuery,
   BigQueryOptions,
+  Clustering,
   FieldList,
   Schema,
   StandardTableDefinition,
@@ -201,6 +202,12 @@ object TableManager {
           .setField("load_tstamp")
           .build()
       }
+      .setClustering {
+        Clustering
+          .newBuilder()
+          .setFields(List("event_name").asJava)
+          .build()
+      }
       .build()
     TableInfo.of(BigQueryUtils.tableIdOf(config), tableDefinition)
   }