You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-26121][STRUCTURED STREAMING] Allow users to define prefix of Kafka's consumer group (group.id)
## What changes were proposed in this pull request?
Allow the Spark Structured Streaming user to specify the prefix of the consumer group (group.id), compared to force consumer group ids of the form `spark-kafka-source-*`
## How was this patch tested?
Unit tests provided by Spark (backwards compatible change, i.e., user can optionally use the functionality)
`mvn test -pl external/kafka-0-10`
Closesapache#23103 from zouzias/SPARK-26121.
Authored-by: Anastasios Zouzias <[email protected]>
Signed-off-by: cody koeninger <[email protected]>
Copy file name to clipboardExpand all lines: docs/structured-streaming-kafka-integration.md
+23-14Lines changed: 23 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -123,7 +123,7 @@ df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
123
123
</div>
124
124
</div>
125
125
126
-
### Creating a Kafka Source for Batch Queries
126
+
### Creating a Kafka Source for Batch Queries
127
127
If you have a use case that is better suited to batch processing,
128
128
you can create a Dataset/DataFrame for a defined range of offsets.
129
129
@@ -374,17 +374,24 @@ The following configurations are optional:
374
374
<td>streaming and batch</td>
375
375
<td>Rate limit on maximum number of offsets processed per trigger interval. The specified total number of offsets will be proportionally split across topicPartitions of different volume.</td>
376
376
</tr>
377
+
<tr>
378
+
<td>groupIdPrefix</td>
379
+
<td>string</td>
380
+
<td>spark-kafka-source</td>
381
+
<td>streaming and batch</td>
382
+
<td>Prefix of consumer group identifiers (`group.id`) that are generated by structured streaming queries</td>
383
+
</tr>
377
384
</table>
378
385
379
386
## Writing Data to Kafka
380
387
381
-
Here, we describe the support for writing Streaming Queries and Batch Queries to Apache Kafka. Take note that
388
+
Here, we describe the support for writing Streaming Queries and Batch Queries to Apache Kafka. Take note that
382
389
Apache Kafka only supports at least once write semantics. Consequently, when writing---either Streaming Queries
383
390
or Batch Queries---to Kafka, some records may be duplicated; this can happen, for example, if Kafka needs
384
391
to retry a message that was not acknowledged by a Broker, even though that Broker received and wrote the message record.
385
-
Structured Streaming cannot prevent such duplicates from occurring due to these Kafka write semantics. However,
392
+
Structured Streaming cannot prevent such duplicates from occurring due to these Kafka write semantics. However,
386
393
if writing the query is successful, then you can assume that the query output was written at least once. A possible
387
-
solution to remove duplicates when reading the written data could be to introduce a primary (unique) key
394
+
solution to remove duplicates when reading the written data could be to introduce a primary (unique) key
388
395
that can be used to perform de-duplication when reading.
389
396
390
397
The Dataframe being written to Kafka should have the following columns in schema:
@@ -405,8 +412,8 @@ The Dataframe being written to Kafka should have the following columns in schema
405
412
</table>
406
413
\* The topic column is required if the "topic" configuration option is not specified.<br>
407
414
408
-
The value column is the only required option. If a key column is not specified then
409
-
a ```null``` valued key column will be automatically added (see Kafka semantics on
415
+
The value column is the only required option. If a key column is not specified then
416
+
a ```null``` valued key column will be automatically added (see Kafka semantics on
410
417
how ```null``` valued key values are handled). If a topic column exists then its value
411
418
is used as the topic when writing the given row to Kafka, unless the "topic" configuration
412
419
option is set i.e., the "topic" configuration option overrides the topic column.
@@ -568,31 +575,33 @@ df.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)") \
0 commit comments