Skip to content

WIP: Adding Kafka datasink.#60307

Open
justinrmiller wants to merge 8 commits intoray-project:masterfrom
justinrmiller:58725-Kafka-Datasync
Open

WIP: Adding Kafka datasink.#60307
justinrmiller wants to merge 8 commits intoray-project:masterfrom
justinrmiller:58725-Kafka-Datasync

Conversation

@justinrmiller
Copy link
Contributor

Description

This PR adds a Kafka Datasink to Ray, complementing the already existing Kafka Datasource.

Related issues

Closes #58725

Additional information

I will add additional information and tests later.

Signed-off-by: Justin Miller <justinrmiller@gmail.com>
@justinrmiller justinrmiller requested a review from a team as a code owner January 20, 2026 00:59
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a KafkaDatasink for Ray Data, which is a valuable addition. The implementation is generally well-structured. I've identified a few areas for improvement to enhance robustness and maintainability. My feedback includes suggestions to refactor duplicated code, correct potentially buggy logic in object-to-dictionary conversion, add parameter validation, and fix an incorrect docstring example. Addressing these points will strengthen the new datasink implementation.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Justin Miller <justinrmiller@users.noreply.github.com>
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Jan 20, 2026
Copy link
Member

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Signed-off-by: Justin Miller <justinrmiller@gmail.com>
Signed-off-by: Justin Miller <justinrmiller@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.


if TYPE_CHECKING:
from kafka import KafkaProducer
from kafka.errors import KafkaError, KafkaTimeoutError
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing runtime imports for Kafka classes

High Severity

KafkaProducer, KafkaError, and KafkaTimeoutError are imported only under TYPE_CHECKING, meaning they won't exist at runtime. When the write method executes, it will fail with a NameError because these names are undefined. The existing kafka_datasource.py correctly handles this by importing inside the function that uses them (e.g., from kafka import KafkaConsumer on line 336 of that file).

Additional Locations (2)

Fix in Cursor Fix in Web

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid

key = self._extract_key(row)

# Serialize value
value = self._serialize_value(row)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant row-to-dict conversion per row

Low Severity

_row_to_dict is called twice for each row during processing—once in _extract_key and once in _serialize_value. The PR discussion explicitly noted this redundancy and the author marked it as "addressed", but the duplicate conversion remains. The row could be converted once and passed to both methods.

Additional Locations (2)

Fix in Cursor Fix in Web

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid

try:
future.get(timeout=0) # Non-blocking check since we already flushed
except Exception:
failed_messages += 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed messages silently counted instead of raising exception

Medium Severity

When message delivery fails, the exception is caught and the failure is silently counted in failed_messages. Unlike other datasinks (e.g., BigQuery which raises RuntimeError on write failure), no exception is raised. Since write_kafka returns None, users have no way to know messages failed. This inconsistency with other datasinks could cause silent data loss.

Fix in Cursor Fix in Web

producer_config: Additional Kafka producer configuration (kafka-python format)
delivery_callback: Optional callback for delivery reports (called with metadata or exception)
"""
VALID_SERIALIZERS = {"json", "string", "bytes"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add _check_import here

self,
topic: str,
bootstrap_servers: str,
key_field: str | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Optional

key_serializer: str = "string",
value_serializer: str = "json",
producer_config: dict[str, Any] | None = None,
delivery_callback: Callable | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also accept and pass concurrency & ray_remote_args

@@ -0,0 +1,537 @@
import json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this test into test_kafka.py

key = self._extract_key(row)

# Serialize value
value = self._serialize_value(row)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid

self,
topic: str,
bootstrap_servers: str,
key_field: str | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Optional

future.add_errback(
lambda e: self.delivery_callback(exception=e)
)
futures.append(future)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should have flush the future buffer when there are N items in buffer? If we have millions of rows this could accumulate.

from kafka import KafkaProducer
from kafka.errors import KafkaError, KafkaTimeoutError

from ray.data import Datasink
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will give circular import error

Suggested change
from ray.data import Datasink
from ray.data.datasource.datasink import Datasink

"""
Convenience method to write Ray Dataset to Kafka.

Example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Example:
Examples:

ConsumptionAPi use word "Examples" to match inorder to insert doc

# Close the producer
producer.close(timeout=5.0)

return {"total_records": total_records, "failed_messages": failed_messages}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log these

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Add Kafka as a new Datasink

2 participants