@@ -19,7 +19,7 @@ In Java and Scala applications, you can use different dependency management
1919tools (e.g., Maven, sbt, or Gradle) to access the
2020connector ` com.google.cloud.spark.bigtable:spark-bigtable_2.13:<version> ` or
2121` com.google.cloud.spark.bigtable:spark-bigtable_2.12:<version> ` (current
22- ` <version> ` is ` 0.4 .0 ` ) and package it inside your application JAR
22+ ` <version> ` is ` 0.5 .0 ` ) and package it inside your application JAR
2323using libraries such as Maven Shade Plugin. For PySpark applications, you can
2424use the ` --jars ` flag to pass the GCS address of the connector when submitting
2525it.
@@ -31,7 +31,7 @@ For Maven, you can add the following snippet to your `pom.xml` file:
3131<dependency >
3232 <groupId >com.google.cloud.spark.bigtable</groupId >
3333 <artifactId >spark-bigtable_2.13</artifactId >
34- <version >0.4 .0</version >
34+ <version >0.5 .0</version >
3535</dependency >
3636```
3737
@@ -40,20 +40,20 @@ For Maven, you can add the following snippet to your `pom.xml` file:
4040<dependency >
4141 <groupId >com.google.cloud.spark.bigtable</groupId >
4242 <artifactId >spark-bigtable_2.12</artifactId >
43- <version >0.4 .0</version >
43+ <version >0.5 .0</version >
4444</dependency >
4545```
4646
4747For sbt, you can add the following to your ` build.sbt ` file:
4848
4949```
5050// for scala 2.13
51- libraryDependencies += "com.google.cloud.spark.bigtable" % "spark-bigtable_2.13" % "0.4 .0"
51+ libraryDependencies += "com.google.cloud.spark.bigtable" % "spark-bigtable_2.13" % "0.5 .0"
5252```
5353
5454```
5555// for scala 2.12
56- libraryDependencies += "com.google.cloud.spark.bigtable" % "spark-bigtable_2.12" % "0.4 .0"
56+ libraryDependencies += "com.google.cloud.spark.bigtable" % "spark-bigtable_2.12" % "0.5 .0"
5757```
5858
5959Finally, you can add the following to your ` build.gradle ` file when using
@@ -62,14 +62,14 @@ Gradle:
6262```
6363// for scala 2.13
6464dependencies {
65- implementation group: 'com.google.cloud.bigtable', name: 'spark-bigtable_2.13', version: '0.4 .0'
65+ implementation group: 'com.google.cloud.bigtable', name: 'spark-bigtable_2.13', version: '0.5 .0'
6666}
6767```
6868
6969```
7070// for scala 2.12
7171dependencies {
72- implementation group: 'com.google.cloud.bigtable', name: 'spark-bigtable_2.12', version: '0.4 .0'
72+ implementation group: 'com.google.cloud.bigtable', name: 'spark-bigtable_2.12', version: '0.5 .0'
7373}
7474```
7575
@@ -157,6 +157,44 @@ columns and the `id` column is used as the row key. Note that you could also
157157specify * compound* row keys,
158158which are created by concatenating multiple DataFrame columns together.
159159
160+ #### Catalog with variable column definitions
161+
162+ You can also use ` regexColumns ` to match multiple columns in the same column
163+ family to a single data frame column. This can be useful in scenarios where
164+ you don't know the exact column qualifiers for your data ahead of time, like
165+ when your column qualifier is partially composed of other pieces of data.
166+
167+ For example this catalog:
168+ ```
169+ {
170+ "table": {"name": "t1"},
171+ "rowkey": "id_rowkey",
172+ "columns": {
173+ "id": {"cf": "rowkey", "col": "id_rowkey", "type": "string"},
174+ },
175+ "regexColumns": {
176+ "metadata": {"cf": "info", "pattern": "\C*", "type": "long" }
177+ }
178+ }
179+ ```
180+
181+ Would match all columns in the column family "info" and the result would be a
182+ DataFrame column named "metadata", where it's contents would be a Map of String
183+ to Long with the keys being the column qualifiers and the values are the results
184+ in those columns in Bigtable.
185+
186+ A few caveats:
187+
188+ - The values of all matching columns must be deserializable to the type defined
189+ in the catalog. If you expect to need more complex deserialization you can
190+ also define the type as ` bytes ` and run custom deserialization logic.
191+ - A catalog with regex columns cannot be used for writes.
192+ - Bigtable uses [ RE2] ( https://github.com/google/re2/wiki/Syntax ) for it's regex
193+ implementation, which has slight differences from other implementations.
194+ - Because columns may contain arbitrary characters, including new lines, it is
195+ advisable to use ` \C ` as the wildcard expression, since ` . ` will not match on
196+ those
197+
160198### Writing to Bigtable
161199
162200You can use the ` bigtable ` format along with specifying the Bigtable
0 commit comments