rfc: spark structured streaming sink and source platform #11

cgpoh · 2025-04-21T08:59:09Z

Summary by CodeRabbit

New Features
- Added support for specifying the data platform for Spark Structured Streaming sources and sinks via new configuration options, improving lineage capture for platforms without recognizable prefixes.
- Introduced configuration to handle multiple streaming environments and platform instances within a single Spark application for better dataset URN distinction.
Documentation
- Updated documentation to describe the new configuration properties for platform specification in streaming sources and sinks.

coderabbitai · 2025-04-21T09:20:59Z

Walkthrough

This change introduces a configuration mechanism to explicitly specify the data platform for Spark Structured Streaming sources and sinks within DataHub lineage capture. It addresses cases where the existing regular expression-based platform detection fails, such as with Iceberg tables lacking recognizable prefixes. A new Spark configuration property spark.datahub.streaming.platform.instance is added as a fallback platform identifier. The core logic in the generateUrnFromStreamingDescription method is enhanced to use this property when regex parsing does not yield a platform, enabling consistent dataset URN generation for streaming sources and sinks. Additionally, the change supports grouping streams via streaming_alias and controlling platform instance injection in dataset names.

Changes

File(s)	Change Summary
active/000-spark-streaming-sink-source-platform.md	Added new RFC detailing configuration mechanism for specifying streaming source/sink platforms, fallback logic, and streaming alias concept.
.../SparkStreamingEventToDatahub.java	Enhanced `generateUrnFromStreamingDescription` method to use new fallback platform config; added overloaded methods supporting explicit sink/source distinction.
(Configuration references)	Added new Spark config property: `spark.datahub.streaming.platform.instance` to specify fallback streaming platform; introduced `streaming_alias` and related configs.

Sequence Diagram(s)

sequenceDiagram
    participant SparkApp
    participant DataHubLineage

    SparkApp->>DataHubLineage: Trigger lineage capture with streaming description
    DataHubLineage->>DataHubLineage: Parse streaming description with regex
    alt Regex matches
        DataHubLineage->>DataHubLineage: Extract namespace, path, platform and generate URN
    else Regex fails
        DataHubLineage->>DataHubLineage: Check for fallback platform config (spark.datahub.streaming.platform.instance)
        alt Fallback config found
            DataHubLineage->>DataHubLineage: Use fallback platform to generate URN (source or sink)
        else No config found
            DataHubLineage->>DataHubLineage: Return empty optional
        end
    end

Poem

🐇 A hop, a skip, a streaming spark,
New configs guide us through the dark.
When regex fails to find its way,
Fallbacks now will save the day.
Platforms clear, no more confusion,
Lineage strong—no more illusion!
✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

active/000-spark-streaming-sink-source-platform.md (4)
1-5: Refine metadata formatting and link styling.

The header currently uses bare URLs and placeholders, e.g.:

RFC PR: https://github.com/datahub-project/rfcs/pull/11

Discussion Issue: (GitHub issue this was discussed in before the RFC, if any)

Implementation PR(s): (leave this empty)

Proposed changes:

Convert bare URLs to markdown links.

Remove or clearly mark unavailable fields (e.g., use “None” or “N/A”).

Example diff:
- `RFC PR: https://github.com/datahub-project/rfcs/pull/11`
+ `RFC PR: [DataHub RFC #11](https://github.com/datahub-project/rfcs/pull/11)`
- `Discussion Issue: (GitHub issue this was discussed in before the RFC, if any)`
+ `Discussion Issue: None`
- `Implementation PR(s): (leave this empty)`
+ `Implementation PR(s): N/A`
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

2-2: Bare URL used
null

(MD034, no-bare-urls)

10-12: Enhance the Summary with explicit fallback configuration keys.

The summary currently states:

Allows configuration of Spark structured streaming sink and source platform.

Consider adding the new config properties by name to make the intent crystal clear:

Introduce spark.datahub.streaming.source.platform and spark.datahub.streaming.sink.platform as explicit fallback configurations when regex-based platform detection fails.

26-33: Optimize regex usage by precompiling the Pattern.

Currently the regex is compiled on each invocation:
String pattern = "(.*?)\\[(.*)]";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(description);
This can be expensive in high-throughput streaming contexts. Instead, move it to a static final field:
+ private static final Pattern STREAM_DESC_PATTERN =
+     Pattern.compile("(.*?)\\[(.*)]");
...
-   String pattern = "(.*?)\\[(.*)]";
-   Pattern r = Pattern.compile(pattern);
-   Matcher m = r.matcher(description);
+   Matcher m = STREAM_DESC_PATTERN.matcher(description);
41-46: Use multi-catch to simplify exception handling.

The two catch blocks under file/delta lake path parsing both return Optional.empty(). You can collapse them:
-        } catch (InstantiationException e) {
-          return Optional.empty();
-        } catch (URISyntaxException e) {
-          log.error("Failed to parse path {}", path, e);
-          return Optional.empty();
-        }
+        } catch (InstantiationException | URISyntaxException e) {
+          log.error("Failed to parse path {}", path, e);
+          return Optional.empty();
+        }
This reduces duplication and makes the intent clearer.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8852b24 and eec5125.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

active/000-spark-streaming-sink-source-platform.md

2-2: Bare URL used
null

(MD034, no-bare-urls)

active/000-spark-streaming-sink-source-platform.md

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

active/000-spark-streaming-sink-source-platform.md (1)

61-75: ⚠️ Potential issue

Fix infinite recursion and incorrect overload in fallback logic.

The fallback block passes a String where the method expects a boolean isSink, causing a compile error and endless recursion. Please replace the recursive calls with direct URN construction or introduce a proper overload. For example:

-if (sparkLineageConf.getOpenLineageConf().getStreamingSinkPlatform() != null && isSink) {
-  return generateUrnFromStreamingDescription(
-    description,
-    sparkLineageConf,
-    sparkLineageConf.getOpenLineageConf().getStreamingSinkPlatform()
-  );
-} else if (sparkLineageConf.getOpenLineageConf().getStreamingSourcePlatform() != null && !isSink) {
-  return generateUrnFromStreamingDescription(
-    description,
-    sparkLineageConf,
-    sparkLineageConf.getOpenLineageConf().getStreamingSourcePlatform()
-  );
-} else {
-  return Optional.empty();
-}
+// Fallback: directly construct URN with provided platform
+if (sparkLineageConf.getOpenLineageConf().getStreamingSinkPlatform() != null && isSink) {
+  String platform = sparkLineageConf.getOpenLineageConf().getStreamingSinkPlatform();
+  return Optional.of(new DatasetUrn(
+      new DataPlatformUrn(platform),
+      description,
+      sparkLineageConf.getOpenLineageConf().getFabricType()));
+} else if (sparkLineageConf.getOpenLineageConf().getStreamingSourcePlatform() != null && !isSink) {
+  String platform = sparkLineageConf.getOpenLineageConf().getStreamingSourcePlatform();
+  return Optional.of(new DatasetUrn(
+      new DataPlatformUrn(platform),
+      description,
+      sparkLineageConf.getOpenLineageConf().getFabricType()));
+} else {
+  return Optional.empty();
+}

This removes the erroneous overload call and prevents infinite recursion.

🧹 Nitpick comments (4)

active/000-spark-streaming-sink-source-platform.md (4)
1-5: Populate metadata fields for Discussion Issue and Implementation PR(s).

The Discussion Issue and Implementation PR(s) entries are currently empty placeholders. Please either fill in the relevant GitHub issue/PR links or remove these lines to avoid confusion.

🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

2-2: Bare URL used
null

(MD034, no-bare-urls)

2-2: Convert bare RFC PR URL to a Markdown link.

The RFC PR URL is a bare link. For consistency and readability, format it as a Markdown hyperlink, for example:
- RFC PR: [#11](https://github.com/datahub-project/rfcs/pull/11)
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

2-2: Bare URL used
null

(MD034, no-bare-urls)

14-15: Convert inline bare URL to a Markdown link.

The DataHub code reference is currently a bare URL in parentheses. Please convert it to a Markdown link, e.g.:
in the DataHub [code](https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/acryl-spark-lineage/src/main/java/datahub/spark/converter/SparkStreamingEventToDatahub.java#L145)
23-29: Add language identifier to the code fence.

The Java snippet for the new overload is missing a language specifier. Adding java will enable proper syntax highlighting:
-``` 
+```java
 public static Optional<DatasetUrn> generateUrnFromStreamingDescription(
       String description, SparkLineageConf sparkLineageConf) {
     return SparkStreamingEventToDatahub.generateUrnFromStreamingDescription(
             description, sparkLineageConf, false);
 }
-```
+```

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eec5125 and 4423942.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

active/000-spark-streaming-sink-source-platform.md

2-2: Bare URL used
null

(MD034, no-bare-urls)

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

active/000-spark-streaming-sink-source-platform.md (6)
24-28: Add Default Overload for Streaming Description
The new two‐parameter overload cleanly delegates to the three‐parameter version with isSink=false, which is sensible for source contexts. Consider adding a brief Javadoc explaining that this entrypoint treats descriptions as non‐sinks by default.

37-44: Avoid Repeated Group Access and Improve Clarity
You extract m.group(2) twice when handling Kafka paths. Reuse the path variable to make the intent clearer:
- String path = m.group(2);
  log.debug("Streaming description Platform: {}, Path: {}", platform, path);
  if (platform.equals(KAFKA_PLATFORM)) {
-   path = getKafkaTopicFromPath(m.group(2));
+   path = getKafkaTopicFromPath(path);
  } else if (platform.equals(FILE_PLATFORM) || platform.equals(DELTA_LAKE_PLATFORM)) {
This reduces duplication and scope for error.

44-54: Consolidate Exception Handling for File/Delta Paths
Both InstantiationException and URISyntaxException lead to returning Optional.empty(). You could merge them into a single multi‐catch and log uniformly:
- } catch (InstantiationException e) {
-   return Optional.empty();
- } catch (URISyntaxException e) {
-   log.error("Failed to parse path {}", path, e);
-   return Optional.empty();
- }
+ } catch (InstantiationException | URISyntaxException e) {
+   log.error("Failed to instantiate or parse HDFS path {}", path, e);
+   return Optional.empty();
+ }
This simplifies the control flow and ensures both errors are logged consistently.

55-60: Clarify Default Platform URN Construction
After regex matching, all other platforms map directly to a DatasetUrn. Consider adding a comment or Javadoc here to note that this branch handles non‐Kafka, non‐file, non‐delta cases detected via regex.

68-73: Symmetric Fallback Logic for Source & Input Validation
Similarly, the source fallback uses the (String) overload. It may be worthwhile to check that the configured streamingSourcePlatform is non‐empty (not just non‐null) before recursing; an empty string could produce an invalid URN.

80-89: Validate Provided Streaming Platform Value
In the (String) overload, consider guarding against blank or malformed streamingPlatform values. For example:
if (streamingPlatform == null || streamingPlatform.trim().isEmpty()) {
  log.warn("Empty streamingPlatform config; cannot generate URN for description {}", description);
  return Optional.empty();
}
This avoids producing URNs with empty platform identifiers.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4423942 and d1e49a4.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🔇 Additional comments (1)

active/000-spark-streaming-sink-source-platform.md (1)

61-67: Verify Overload Resolution for Sink Fallback
The fallback invokes generateUrnFromStreamingDescription(description, sparkLineageConf, streamingSinkPlatform) which relies on the (String) overload. Please confirm the compiler picks the intended method and that there’s no ambiguity with the boolean overload.

coderabbitai · 2025-04-21T10:03:52Z

active/000-spark-streaming-sink-source-platform.md

+public static Optional<DatasetUrn> generateUrnFromStreamingDescription(
+  String description, SparkLineageConf sparkLineageConf, boolean isSink) {
+    String pattern = "(.*?)\\[(.*)]";
+    Pattern r = Pattern.compile(pattern);
+    Matcher m = r.matcher(description);
+    if (m.find()) {


🛠️ Refactor suggestion

Refine the Regex for Robust Matching
The pattern (.*?)\[(.*)] may over‐match when descriptions contain multiple or nested brackets. It would be safer to anchor the regex and ensure it only matches the entire string. For example:

String pattern = "^(.+?)\\[(.+)]$";

This prevents unintended partial matches.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

active/000-spark-streaming-sink-source-platform.md (1)
33-34: Anchor and refine the regex for robust matching.
The current pattern "(.*?)\\[(.*)]" may over‐match or accept unintended inputs (e.g., nested brackets). It’s safer to anchor it and escape properly:
String pattern = "^(.*?)\\[(.*)\\]$";
This ensures the entire description matches exactly one namespace and one path.

🧹 Nitpick comments (8)

active/000-spark-streaming-sink-source-platform.md (8)
1-5: Remove or complete placeholder fields in the RFC header.
The Discussion Issue and Implementation PR(s) are currently empty placeholders. To avoid confusion, either fill these with the actual issue/PR links or remove these lines entirely.

10-11: Clarify summary with new configuration property.
The summary should explicitly mention the spark.datahub.streaming_platform fallback property so readers immediately see the core change. For example:

Adds spark.datahub.streaming_platform as a fallback platform identifier when regex parsing of streaming descriptions fails.

44-52: Consolidate exception handling in HDFS path parsing.
The two catch blocks for InstantiationException and URISyntaxException both result in Optional.empty(), but only one logs the error. Consider collapsing into a single multi‐catch to reduce duplication and ensure all failures get logged for diagnostics:
-} catch (InstantiationException e) {
-    return Optional.empty();
-} catch (URISyntaxException e) {
-    log.error("Failed to parse path {}", path, e);
-    return Optional.empty();
-}
+} catch (InstantiationException | URISyntaxException e) {
+    log.error("Failed to parse HDFS path '{}' in description '{}'", path, description, e);
+    return Optional.empty();
+}
50-51: Enhance error logging context.
When logging a bad URI, include both the description and whether it was a sink or source to help pinpoint issues in production:
log.error(
    "Failed to parse path '{}' for {} with description '{}'",
    path,
    isSink ? "sink" : "source",
    description,
    e
);
59-61: Handle nullable CatalogTableDataset more idiomatically.
Rather than checking for null, leverage Optional directly to keep the API consistent:
- CatalogTableDataset catalogTableDataset = CatalogTableDataset.create(...);
- if (catalogTableDataset == null) {
-   return Optional.empty();
- } else {
-   return Optional.of(catalogTableDataset.urn());
- }
+ return Optional.ofNullable(CatalogTableDataset.create(...))
+     .map(CatalogTableDataset::urn);
20-23: Hyphenate “Iceberg-based” for correct spelling.
LanguageTool flags that “Iceberg based” should be “Iceberg-based dataset URNs”.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~22-~22: This expression is usually spelled with a hyphen.
Context: ...ec which is defined in the Configuring Iceberg based dataset URNs section below. Within the...

(BASED_HYPHEN)

77-77: Hyphenate heading: “Configuring Iceberg-based dataset URNs”.
For consistency and accuracy, update the heading to:
### Configuring Iceberg-based dataset URNs
🧰 Tools

🪛 LanguageTool

[uncategorized] ~77-~77: This expression is usually spelled with a hyphen.
Context: ...; } } } ``` ### Configuring Iceberg based dataset URNs This section follows the ...

(BASED_HYPHEN)

112-112: Specify code block language.
The fenced block showing configuration examples would benefit from a language tag (e.g., bash or ini) for syntax highlighting.

🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

112-112: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1e49a4 and 1452d62.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

active/000-spark-streaming-sink-source-platform.md

[uncategorized] ~22-~22: This expression is usually spelled with a hyphen.
Context: ...ec which is defined in the Configuring Iceberg based dataset URNs section below. Within the...

(BASED_HYPHEN)

[uncategorized] ~77-~77: This expression is usually spelled with a hyphen.
Context: ...; } } } ``` ### Configuring Iceberg based dataset URNs This section follows the ...

(BASED_HYPHEN)

🪛 markdownlint-cli2 (0.17.2)

active/000-spark-streaming-sink-source-platform.md

112-112: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

🔇 Additional comments (1)

active/000-spark-streaming-sink-source-platform.md (1)

57-61: Verify fallback configuration property mapping.
Make sure sparkLineageConf.getOpenLineageConf().getStreamingPlatform() actually reads from the newly introduced spark.datahub.streaming_platform Spark config key. It’s easy for naming mismatches to sneak in.

active/000-spark-streaming-sink-source-platform.md

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (11)

active/000-spark-streaming-sink-source-platform.md (11)
1-5: Use a proper YAML front-matter block and remove placeholders
Rather than plain list items, consider using a fenced YAML front matter at the top of the RFC and either fill in or remove the “Discussion Issue” and “Implementation PR(s)” placeholders. This makes the metadata machine-readable and follows existing RFC conventions.

Proposed diff:
- - Start Date: 2025-04-21
- - RFC PR: [#11](https://github.com/datahub-project/rfcs/pull/11)
- - Discussion Issue: (GitHub issue this was discussed in before the RFC, if any)
- - Implementation PR(s): (leave this empty)
+ ---
+ Start Date: 2025-04-21
+ RFC PR: 11
+ Discussion Issue: <fill-or-remove>
+ Implementation PR(s): <fill-or-remove>
+ ---
10-10: Refine the summary grammar
Current:

Allows configuration of Spark structured streaming sink and source platform.

Suggested:

Allows configuring the data platform for Spark Structured Streaming sources and sinks.

14-15: Clarify hyperlink context in Motivation
Rather than linking with generic text [code], add context to the link, e.g.:

...in the DataHub converter implementation (see SparkStreamingEventToDatahub.java#L145)...

22-22: Hyphenate “Iceberg-based”
The phrase “Configuring Iceberg based dataset URNs” should read “Configuring Iceberg-based dataset URNs“.

🧰 Tools

🪛 LanguageTool

[uncategorized] ~22-~22: This expression is usually spelled with a hyphen.
Context: ...ec which is defined in the Configuring Iceberg based dataset URNs section below. Within the...

(BASED_HYPHEN)

23-29: Add language identifier to code fence
The Java snippet is missing a language marker. Prefix the fence with java for proper syntax highlighting.

Example:
- ```  
+ ```java
  public static Optional<DatasetUrn> generateUrnFromStreamingDescription(
        String description, SparkLineageConf sparkLineageConf) {
    return SparkStreamingEventToDatahub.generateUrnFromStreamingDescription(
            description, sparkLineageConf, false);
}
- ```  
+ ```
31-75: Add java to the second code block fence
The longer snippet showing the overload needs a java fence as well to improve readability and tooling support.

32-36: Anchor the regex to prevent over-matching
The pattern "(.*?)\\[(.*)]" may match unwanted substrings if there are multiple brackets. Consider:
- String pattern = "(.*?)\\[(.*)]";
+ String pattern = "^(.+?)\\[(.+)]$";
This ensures the entire description is of the form namespace[path] rather than a partial match.

77-77: Hyphenate “Iceberg-based” in section header
Change
### Configuring Iceberg based dataset URNs
to
### Configuring Iceberg-based dataset URNs
🧰 Tools

🪛 LanguageTool

[uncategorized] ~77-~77: This expression is usually spelled with a hyphen.
Context: ...; } } } ``` ### Configuring Iceberg based dataset URNs This section follows the ...

(BASED_HYPHEN)

112-112: Specify language for the properties code fence
The example block should be marked as properties (or ini) to aid in syntax highlighting:
- ```
+ ```properties
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

112-112: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

100-104: Clarify streaming_alias usage
The prose refers to “spark.datahub.streaming.platform.<$platform>.<streaming_alias>.env and ….platformInstance” but omits streaming_io_platform_type. It would be helpful to list all four configurable properties (env, platformInstance, streaming_io_platform_type, plus the fallback spark.datahub.streaming_platform) in one place.

3-4: Remove or complete placeholder for Discussion Issue
Line reads:

Discussion Issue: (GitHub issue this was discussed in before the RFC, if any)
Either remove the parenthetical or fill in the actual issue link to avoid confusion.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1452d62 and 8e7b940.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

active/000-spark-streaming-sink-source-platform.md

[uncategorized] ~22-~22: This expression is usually spelled with a hyphen.
Context: ...ec which is defined in the Configuring Iceberg based dataset URNs section below. Within the...

(BASED_HYPHEN)

[uncategorized] ~77-~77: This expression is usually spelled with a hyphen.
Context: ...; } } } ``` ### Configuring Iceberg based dataset URNs This section follows the ...

(BASED_HYPHEN)

🪛 markdownlint-cli2 (0.17.2)

active/000-spark-streaming-sink-source-platform.md

112-112: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

active/000-spark-streaming-sink-source-platform.md

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

active/000-spark-streaming-sink-source-platform.md (1)
33-33: Duplicate: Anchor the regex to avoid partial matches
Your pattern "(.*?)\\[(.*)]" may over‐match when descriptions contain multiple or nested brackets. It’s safer to anchor it, for example:
String pattern = "^(.+?)\\[(.+?)\\]$";
This ensures you only match the entire string with one pair of brackets.

🧹 Nitpick comments (6)

active/000-spark-streaming-sink-source-platform.md (6)
1-4: Populate metadata placeholders
The “Discussion Issue” and “Implementation PR(s)” fields are currently empty. Please fill these in (or remove them if not applicable) to ensure full traceability of this RFC.

10-10: Clarify summary wording
Consider rephrasing to explicitly mention “data platforms” and mirror the RFC title. For example:

Allows configuration of the data platform for Spark Structured Streaming sources and sinks.

22-23: Clarify “streaming spec” reference
The phrase

and streaming spec which is defined in the Configuring Iceberg based dataset URNs section below
is a bit vague. Consider renaming “streaming spec” to something more descriptive (e.g., “alias‐based streaming configuration”) or add a brief definition here:
- …two new Spark configurations: `spark.datahub.streaming_platform` for specifying the streaming platform, and streaming spec which is defined in the *Configuring Iceberg based dataset URNs* section below.
+ …two new Spark configurations: `spark.datahub.streaming_platform` for specifying the streaming platform, and alias‐based streaming configuration (see *Configuring Iceberg-based dataset URNs* below).
🧰 Tools

🪛 LanguageTool

[uncategorized] ~22-~22: This expression is usually spelled with a hyphen.
Context: ...ec which is defined in the Configuring Iceberg based dataset URNs section below. Within the...

(BASED_HYPHEN)

77-77: Hyphenate for clarity
MarkdownLint suggests hyphenating “Iceberg based” in the section title. Apply this diff:
- ## Configuring Iceberg based dataset URNs
+ ## Configuring Iceberg-based dataset URNs
🧰 Tools

🪛 LanguageTool

[uncategorized] ~77-~77: This expression is usually spelled with a hyphen.
Context: ...; } } } ``` ### Configuring Iceberg based dataset URNs This section follows the ...

(BASED_HYPHEN)

101-102: Explain the config-key naming convention
The RFC introduces both spark.datahub.streaming_platform (underscore) and spark.datahub.streaming.platform.<…> (dots). To avoid confusion, please either unify the naming scheme or explicitly call out why the patterns differ.

112-112: Specify fenced code block language
The example block at line 112 is missing a language hint. For better syntax highlighting, add the properties tag:
- ```
+ ```properties
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

112-112: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e7b940 and 5a9a937.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

active/000-spark-streaming-sink-source-platform.md

[uncategorized] ~22-~22: This expression is usually spelled with a hyphen.
Context: ...ec which is defined in the Configuring Iceberg based dataset URNs section below. Within the...

(BASED_HYPHEN)

[uncategorized] ~77-~77: This expression is usually spelled with a hyphen.
Context: ...; } } } ``` ### Configuring Iceberg based dataset URNs This section follows the ...

(BASED_HYPHEN)

🪛 markdownlint-cli2 (0.17.2)

active/000-spark-streaming-sink-source-platform.md

112-112: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

🔇 Additional comments (1)

active/000-spark-streaming-sink-source-platform.md (1)

57-63: Verify fallback configuration mapping
The code checks getOpenLineageConf().getStreamingPlatform() for null before using it as a fallback. Please confirm that getStreamingPlatform() correctly reads the spark.datahub.streaming_platform Spark config as documented—and update the RFC to document this mapping explicitly.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

active/000-spark-streaming-sink-source-platform.md (4)
1-4: Ensure RFC metadata completeness
The header fields for "Discussion Issue" and "Implementation PR(s)" are placeholders. Please either populate them with valid links/values or remove the entries to avoid confusion.

24-29: Add JavaDoc to explain the overload behavior
Consider documenting the new one-argument overload in Javadoc (or in code comments) to clarify that it delegates to the three-argument version with isSink=false. This will help future readers understand the intended fallback semantics.

33-36: Refine the regex to anchor bracket parsing and use full-string matching
The current "(.*?)\\[(.*)]" combined with m.find() may over-match on nested or partial bracketed strings. To ensure the entire description matches the expected format, consider:
- String pattern = "(.*?)\\[(.*)]";
+ private static final String STREAM_DESC_REGEX = "^(.+?)\\[(.+)]$";
...
- Matcher m = r.matcher(description);
+ Matcher m = COMPILED_STREAM_DESC_PATTERN.matcher(description);
And replace m.find() with m.matches().

33-35: Move regex compilation to a static constant
Currently, the pattern is recompiled on every invocation. For better performance and readability, define it once as a private static final Pattern at class scope.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a9a937 and 0c909d3.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🔇 Additional comments (1)

active/000-spark-streaming-sink-source-platform.md (1)

57-65: Verify mapping of Spark config key to getStreamingPlatform()
The fallback looks up sparkLineageConf.getOpenLineageConf().getStreamingPlatform(). Please confirm that this method indeed reads from spark.datahub.streaming_platform (underscore) and not a different key (e.g., with dots). If there’s a mismatch, update either the code or the documentation accordingly.

active/000-spark-streaming-sink-source-platform.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

active/000-spark-streaming-sink-source-platform.md (1)
40-40: Anchor the regex for more robust matching

The pattern "(.*?)\\[(.*)]" is unanchored and may over‐match when descriptions include multiple or nested brackets. Consider anchoring it to match the full string:
- String pattern = "(.*?)\\[(.*)]";
+ String pattern = "^(.+?)\\[(.+)]$";

🧹 Nitpick comments (2)

active/000-spark-streaming-sink-source-platform.md (2)
14-14: Reduce wordiness in motivation section

Consider simplifying “in order to extract the data platform” to “to extract the data platform” for more concise phrasing.

🧰 Tools

🪛 LanguageTool

[style] ~14-~14: Consider a shorter alternative to avoid wordiness.
Context: ...dentifiable prefixes, such as Kafka[…], in order to extract the data platform. However, pla...

(IN_ORDER_TO_PREMIUM)

117-125: Use standard properties syntax in examples

The example uses colons (key : value), but Java properties conventionally use the equals sign (key=value). For example:
- spark.datahub.streaming.platform.iceberg.stream1.env : PROD
+ spark.datahub.streaming.platform.iceberg.stream1.env=PROD

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c909d3 and f65a133.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

active/000-spark-streaming-sink-source-platform.md

[style] ~14-~14: Consider a shorter alternative to avoid wordiness.
Context: ...dentifiable prefixes, such as Kafka[…], in order to extract the data platform. However, pla...

(IN_ORDER_TO_PREMIUM)

🔇 Additional comments (1)

active/000-spark-streaming-sink-source-platform.md (1)

1-5: Add Implementation PR(s) when available

The header’s “Implementation PR(s)” field is empty. Please fill in this link once the corresponding code changes are merged to help readers track the implementation.

active/000-spark-streaming-sink-source-platform.md

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

active/000-spark-streaming-sink-source-platform.md (4)
3-4: Clean up front-matter placeholders.

The fields “Discussion Issue: (…)" and “Implementation PR(s): (…)" are left as generic placeholders. If there is no prior GitHub issue or linked PR, consider removing these lines or explicitly marking them as “None” to avoid confusion.

14-14: Streamline wording.

Consider replacing “in order to extract the data platform” with “to extract the data platform” for improved conciseness.

🧰 Tools

🪛 LanguageTool

[style] ~14-~14: Consider a shorter alternative to avoid wordiness.
Context: ...dentifiable prefixes, such as Kafka[…], in order to extract the data platform. However, pla...

(IN_ORDER_TO_PREMIUM)

64-77: Improve error visibility on fallback URN creation.

In the fallback branch you catch only InstantiationException and swallow it:
} catch (InstantiationException e) {
    return Optional.empty();
}
It’s helpful to log these failures so users can diagnose misconfigurations. For example:
- } catch (InstantiationException e) {
-     return Optional.empty();
- }
+ } catch (InstantiationException e) {
+     log.error("Failed to instantiate CatalogTableDataset for description {} (isSink={}): {}", description, isSink, e.getMessage(), e);
+     return Optional.empty();
+ }
This will aid debugging if the configured platform instance or alias is invalid.

109-113: Clarify how streaming_alias is selected.

The doc refers to a “streaming_alias” for grouping multiple streams, but it’s not clear how aliases are discovered or applied. Consider adding a brief section on how to configure or enumerate aliases (e.g., via a comma-separated spark.datahub.streaming.platform.<platform>.aliases property), and how the code picks the correct <alias> block when constructing URNs.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f65a133 and a274a54.

📒 Files selected for processing (1)

active/000-spark-streaming-sink-source-platform.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

active/000-spark-streaming-sink-source-platform.md

[style] ~14-~14: Consider a shorter alternative to avoid wordiness.
Context: ...dentifiable prefixes, such as Kafka[…], in order to extract the data platform. However, pla...

(IN_ORDER_TO_PREMIUM)

coderabbitai · 2025-04-29T06:59:51Z

active/000-spark-streaming-sink-source-platform.md

+    String pattern = "(.*?)\\[(.*)]";
+    Pattern r = Pattern.compile(pattern);
+    Matcher m = r.matcher(description);
+    if (m.find()) {


🛠️ Refactor suggestion

Anchor the regex to prevent over-matching.

The pattern

String pattern = "(.*?)\\[(.*)]";

is unanchored and could over-match when there are multiple or nested brackets in the description. To ensure you only match the full string and capture minimal groups, anchor the expression, for example:

- String pattern = "(.*?)\\[(.*)]"; + String pattern = "^(.+?)\\[(.+)]$";

This change guards against unintended partial matches.

cgpoh added 2 commits April 21, 2025 16:58

rfc: spark structured streaming sink and source platform

88df956

chore: update RFC PR link

eec5125

coderabbitai bot reviewed Apr 21, 2025

View reviewed changes

active/000-spark-streaming-sink-source-platform.md Outdated Show resolved Hide resolved

chore: update example implementation

4423942

coderabbitai bot reviewed Apr 21, 2025

View reviewed changes

chore: update example

d1e49a4

coderabbitai bot reviewed Apr 21, 2025

View reviewed changes

chore: change design to use streaming spec

1452d62

coderabbitai bot reviewed Apr 24, 2025

View reviewed changes

active/000-spark-streaming-sink-source-platform.md Outdated Show resolved Hide resolved

chore: update document

8e7b940

coderabbitai bot reviewed Apr 24, 2025

View reviewed changes

active/000-spark-streaming-sink-source-platform.md Outdated Show resolved Hide resolved

chore: update document

5a9a937

coderabbitai bot reviewed Apr 24, 2025

View reviewed changes

chore: update document

0c909d3

coderabbitai bot reviewed Apr 25, 2025

View reviewed changes

active/000-spark-streaming-sink-source-platform.md Outdated Show resolved Hide resolved

active/000-spark-streaming-sink-source-platform.md Outdated Show resolved Hide resolved

chore: update document

f65a133

coderabbitai bot reviewed Apr 29, 2025

View reviewed changes

active/000-spark-streaming-sink-source-platform.md Show resolved Hide resolved

chore: update document

a274a54

coderabbitai bot reviewed Apr 29, 2025

View reviewed changes

rfc: spark structured streaming sink and source platform #11

Are you sure you want to change the base?

rfc: spark structured streaming sink and source platform #11

Uh oh!

Conversation

cgpoh commented Apr 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cgpoh commented Apr 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 21, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)