WIP: Adding Kafka datasink. by justinrmiller · Pull Request #60307 · ray-project/ray

justinrmiller · 2026-01-20T00:59:12Z

Description

This PR adds a Kafka Datasink to Ray, complementing the already existing Kafka Datasource.

Related issues

Closes #58725

Additional information

I will add additional information and tests later.

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a KafkaDatasink for Ray Data, which is a valuable addition. The implementation is generally well-structured. I've identified a few areas for improvement to enhance robustness and maintainability. My feedback includes suggestions to refactor duplicated code, correct potentially buggy logic in object-to-dictionary conversion, add parameter validation, and fix an incorrect docstring example. Addressing these points will strengthen the new datasink implementation.

python/ray/data/_internal/datasource/kafka_datasink.py

python/ray/data/dataset.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Justin Miller <justinrmiller@users.noreply.github.com>

python/ray/data/dataset.py

python/ray/data/_internal/datasource/kafka_datasink.py

owenowenisme

Thanks for the contribution!

python/ray/data/_internal/datasource/kafka_datasink.py

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

python/ray/data/_internal/datasource/kafka_datasink.py

…n't happen again. Signed-off-by: Justin Miller <justinrmiller@gmail.com>

…er/ray into 58725-Kafka-Datasync

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

cursor · 2026-02-05T05:06:48Z

python/ray/data/_internal/datasource/kafka_datasink.py

+
+if TYPE_CHECKING:
+    from kafka import KafkaProducer
+    from kafka.errors import KafkaError, KafkaTimeoutError


Missing runtime imports for Kafka classes

High Severity

KafkaProducer, KafkaError, and KafkaTimeoutError are imported only under TYPE_CHECKING, meaning they won't exist at runtime. When the write method executes, it will fail with a NameError because these names are undefined. The existing kafka_datasource.py correctly handles this by importing inside the function that uses them (e.g., from kafka import KafkaConsumer on line 336 of that file).

Additional Locations (2)

python/ray/data/_internal/datasource/kafka_datasink.py#L142-L146

python/ray/data/_internal/datasource/kafka_datasink.py#L191-L195

This is valid

cursor · 2026-02-05T05:06:48Z

python/ray/data/_internal/datasource/kafka_datasink.py

+                    key = self._extract_key(row)
+
+                    # Serialize value
+                    value = self._serialize_value(row)


Redundant row-to-dict conversion per row

Low Severity

_row_to_dict is called twice for each row during processing—once in _extract_key and once in _serialize_value. The PR discussion explicitly noted this redundancy and the author marked it as "addressed", but the duplicate conversion remains. The row could be converted once and passed to both methods.

Additional Locations (2)

python/ray/data/_internal/datasource/kafka_datasink.py#L96-L97

python/ray/data/_internal/datasource/kafka_datasink.py#L117-L118

This is valid

cursor · 2026-02-05T05:06:48Z

python/ray/data/_internal/datasource/kafka_datasink.py

+                try:
+                    future.get(timeout=0)  # Non-blocking check since we already flushed
+                except Exception:
+                    failed_messages += 1


Failed messages silently counted instead of raising exception

Medium Severity

When message delivery fails, the exception is caught and the failure is silently counted in failed_messages. Unlike other datasinks (e.g., BigQuery which raises RuntimeError on write failure), no exception is raised. Since write_kafka returns None, users have no way to know messages failed. This inconsistency with other datasinks could cause silent data loss.

owenowenisme · 2026-02-09T06:19:44Z

python/ray/data/_internal/datasource/kafka_datasink.py

+            producer_config: Additional Kafka producer configuration (kafka-python format)
+            delivery_callback: Optional callback for delivery reports (called with metadata or exception)
+        """
+        VALID_SERIALIZERS = {"json", "string", "bytes"}


Add _check_import here

owenowenisme · 2026-02-09T06:34:27Z

python/ray/data/dataset.py

+        self,
+        topic: str,
+        bootstrap_servers: str,
+        key_field: str | None = None,


Use Optional

owenowenisme · 2026-02-09T06:37:35Z

python/ray/data/dataset.py

+        key_serializer: str = "string",
+        value_serializer: str = "json",
+        producer_config: dict[str, Any] | None = None,
+        delivery_callback: Callable | None = None,


We should also accept and pass concurrency & ray_remote_args

owenowenisme · 2026-02-09T06:39:19Z

python/ray/data/tests/datasource/test_kafka_datasink.py

@@ -0,0 +1,537 @@
+import json


Let's move this test into test_kafka.py

owenowenisme · 2026-02-09T07:03:52Z

python/ray/data/_internal/datasource/kafka_datasink.py

+                    key = self._extract_key(row)
+
+                    # Serialize value
+                    value = self._serialize_value(row)


This is valid

owenowenisme · 2026-02-09T07:06:35Z

python/ray/data/_internal/datasource/kafka_datasink.py

+        self,
+        topic: str,
+        bootstrap_servers: str,
+        key_field: str | None = None,


Use Optional

owenowenisme · 2026-02-09T07:12:19Z

python/ray/data/_internal/datasource/kafka_datasink.py

+                        future.add_errback(
+                            lambda e: self.delivery_callback(exception=e)
+                        )
+                    futures.append(future)


Do you think we should have flush the future buffer when there are N items in buffer? If we have millions of rows this could accumulate.

owenowenisme · 2026-02-09T07:26:29Z

python/ray/data/_internal/datasource/kafka_datasink.py

+    from kafka import KafkaProducer
+    from kafka.errors import KafkaError, KafkaTimeoutError
+
+from ray.data import Datasink


This will give circular import error

Suggested change

from ray.data import Datasink

from ray.data.datasource.datasink import Datasink

owenowenisme · 2026-02-09T07:29:45Z

python/ray/data/dataset.py

+        """
+        Convenience method to write Ray Dataset to Kafka.
+
+        Example:


Suggested change

Example:

Examples:

ConsumptionAPi use word "Examples" to match inorder to insert doc

owenowenisme · 2026-02-09T07:31:04Z

python/ray/data/_internal/datasource/kafka_datasink.py

+            # Close the producer
+            producer.close(timeout=5.0)
+
+        return {"total_records": total_records, "failed_messages": failed_messages}


We should log these

Adding Kafka datasink.

bc79a73

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

justinrmiller requested a review from a team as a code owner January 20, 2026 00:59

gemini-code-assist bot reviewed Jan 20, 2026

View reviewed changes

Update python/ray/data/dataset.py

6f74cdb

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Justin Miller <justinrmiller@users.noreply.github.com>

cursor bot reviewed Jan 20, 2026

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

python/ray/data/_internal/datasource/kafka_datasink.py Outdated Show resolved Hide resolved

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 20, 2026

owenowenisme reviewed Jan 30, 2026

View reviewed changes

iamjustinhsu assigned owenowenisme Feb 4, 2026

justinrmiller added 2 commits February 4, 2026 20:40

Addressing PR comments and adding test for Kafka datasink.

a6b042d

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

Addressing PR comments.

d482c96

Signed-off-by: Justin Miller <justinrmiller@gmail.com>

justinrmiller requested a review from owenowenisme February 5, 2026 04:44

Merge branch 'master' into 58725-Kafka-Datasync

eeb7f13

cursor bot reviewed Feb 5, 2026

View reviewed changes

python/ray/data/_internal/datasource/kafka_datasink.py Outdated Show resolved Hide resolved

justinrmiller added 2 commits February 4, 2026 20:57

Missed using the _serialize_key value, added a test to ensure that wo…

75a572a

…n't happen again. Signed-off-by: Justin Miller <justinrmiller@gmail.com>

Merge branch '58725-Kafka-Datasync' of https://github.com/justinrmill…

f4d9610

…er/ray into 58725-Kafka-Datasync

cursor bot reviewed Feb 5, 2026

View reviewed changes

Merge branch 'master' into 58725-Kafka-Datasync

286feea

owenowenisme reviewed Feb 9, 2026

View reviewed changes

	from ray.data import Datasink
	from ray.data.datasource.datasink import Datasink

Conversation

justinrmiller commented Jan 20, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

owenowenisme left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 5, 2026

Choose a reason for hiding this comment

Missing runtime imports for Kafka classes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 5, 2026

Choose a reason for hiding this comment

Redundant row-to-dict conversion per row

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 5, 2026

Choose a reason for hiding this comment

Failed messages silently counted instead of raising exception

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants