-
Notifications
You must be signed in to change notification settings - Fork 3
feat: Add CLP UDF docs #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-0.293-clp-connector
Are you sure you want to change the base?
Conversation
WalkthroughAdds a CLP Functions documentation block to the CLP connector docs describing path‑based JSON helper functions, wildcard column functions, Changes
Sequence Diagram(s)Omitted — documentation-only change; no control-flow modifications. Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
presto-docs/src/main/sphinx/connector/clp.rst(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
PR: y-scope/presto#64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-docs/src/main/sphinx/connector/clp.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test (8.0.442, :presto-tests -P ci-only-local-queries)
- GitHub Check: maven-checks (17.0.13)
- GitHub Check: maven-checks (8.0.442)
- GitHub Check: test (:presto-docs)
- GitHub Check: prestissimo-worker-images-build
quinntaylormitchell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coming along; a few things to address.
|
Thanks for the review! @quinntaylormitchell. I applied most of your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
presto-docs/src/main/sphinx/connector/clp.rst (2)
327-340: Consolidate duplicate introductory paragraphs.Lines 327–333 contain two separate paragraphs that convey the same message: the purpose of CLP functions to handle dynamic fields while keeping table metadata concise. Consolidate these into a single, cohesive introduction.
Consider consolidating to something like:
- Semi-structured logs can have many potential keys, which can lead to very wide Presto tables. To keep table metadata - concise and still preserve access to dynamic fields, the connector provides two sets of functions that are specific to - the CLP connector. These functions are not part of standard Presto SQL. - - To enable access to dynamic or less common fields not present in the exposed schema, CLP provides two set of functions - to help users query flexible log schemas while keeping the table metadata definition concise. These functions are only - available in the CLP connector and are not part of standard Presto SQL. + Semi-structured logs can have many potential keys, which can lead to very wide Presto tables. To keep table metadata + concise and still preserve access to dynamic fields, the connector provides two sets of functions that are specific to + the CLP connector. These functions are not part of standard Presto SQL.
378-380: Fix unquoted JSON paths and non-existent function name in examples.The examples at lines 378–384 contradict the documented requirement (line 369) that "JSON paths must be constant string literals":
- Line 378:
CLP_GET_STRING(msg.author)should beCLP_GET_STRING('msg.author')- Line 380: Uses non-existent
CLP_GET_INT; should beCLP_GET_BIGINT('msg.timestamp')- Line 382:
CLP_GET_STRING_ARRAY(msg.tags)should beCLP_GET_STRING_ARRAY('msg.tags').. code-block:: sql - SELECT CLP_GET_STRING(msg.author) AS author + SELECT CLP_GET_STRING('msg.author') AS author FROM clp.default.table_1 - WHERE CLP_GET_INT('msg.timestamp') > 1620000000; + WHERE CLP_GET_BIGINT('msg.timestamp') > 1620000000; - SELECT CLP_GET_STRING_ARRAY(msg.tags) AS tags + SELECT CLP_GET_STRING_ARRAY('msg.tags') AS tags FROM clp.default.table_2 WHERE CLP_GET_BOOL('msg.is_active') = true;
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
presto-docs/src/main/sphinx/connector/clp.rst(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
PR: y-scope/presto#64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-docs/src/main/sphinx/connector/clp.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: prestissimo-worker-images-build
quinntaylormitchell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few more comments.
also I noticed you didn't apply the "sentence-case heading" suggestions; I feel that you should, but I understand that the rest of the document doesn't use them, so it's up to you.
|
@quinntaylormitchell Thanks for the follow-up! Regarding the “sentence-case heading,” I noticed that other docs capitalize the first letter of every word, so I followed the same style. |
quinntaylormitchell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
presto-docs/src/main/sphinx/connector/clp.rst (2)
375-381: Fix JSON path examples to match documented requirements and correct function name.Multiple issues in the examples:
- Line 377: Uses
CLP_GET_INT, which doesn't exist. Should useCLP_GET_BIGINT(as documented in line 347-349).- Lines 375, 379: JSON paths (
msg.authorandmsg.tags) are not quoted, but the constraint at line 366 explicitly requires paths to be constant string literals. The WHERE clause at line 381 correctly quotes the path ('msg.is_active'), so these should match.Apply this diff to fix the examples:
SELECT CLP_GET_STRING(msg.author) AS author FROM clp.default.table_1 - WHERE CLP_GET_INT('msg.timestamp') > 1620000000; + WHERE CLP_GET_BIGINT('msg.timestamp') > 1620000000; - SELECT CLP_GET_STRING_ARRAY(msg.tags) AS tags + SELECT CLP_GET_STRING_ARRAY('msg.tags') AS tags FROM clp.default.table_2 WHERE CLP_GET_BOOL('msg.is_active') = true;Also update line 375 to quote the path:
- SELECT CLP_GET_STRING(msg.author) AS author + SELECT CLP_GET_STRING('msg.author') AS author
407-426: Document the split filter configuration limitation for wildcard functions.The current note correctly restricts wildcard functions to WHERE clauses only, but it omits that they also cannot be used in split filter configurations (configured via
clp.split-filter-config). This is an important constraint since split filters require concrete field names for metadata validation, and wildcard functions do not correspond to specific columns. The supported split filter expressions (line 224) do not include wildcard functions.Add a bullet point clarifying this limitation:
- Wildcard functions must appear **only in filter conditions** (`WHERE` clause). They cannot be selected and cannot be passed as arguments to other functions. + - Wildcard functions cannot be used in split filter configurations (see :ref:`Split Filter Config File<split-filter-config-file>`) + because split filters require concrete column field names.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
presto-docs/src/main/sphinx/connector/clp.rst(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
PR: y-scope/presto#64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-docs/src/main/sphinx/connector/clp.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (:presto-docs)
- GitHub Check: maven-checks (8.0.442)
- GitHub Check: maven-checks (17.0.13)
- GitHub Check: prestissimo-worker-images-build
🔇 Additional comments (2)
presto-docs/src/main/sphinx/connector/clp.rst (2)
323-337: LGTM!The introduction clearly explains the three function categories and accurately conveys the performance benefits of the connector's query optimization rewrites.
443-476: LGTM!The JSON string function section is well-documented with clear constraints and practical examples. The restriction to SELECT list only is clearly stated, and the examples effectively demonstrate common use cases.
gibber9809
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! Just a few comments about things that might be worth clarifying.
|
|
||
| .. note:: | ||
|
|
||
| This function can only be used in the ``SELECT``` list to retrieve the complete JSON representation of each record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "SELECT list" common terminology? I think I'd expect something more like "list of projected columns in a SELECT statement", but we can keep the current working if "SELECT list" is widespread.
|
|
||
| - JSON paths must be **constant string literals**; variables are not supported. | ||
| - Wildcards (e.g., ``msg.*.ts``) are **not supported**. | ||
| - If a path is invalid or missing, the function returns ``NULL`` rather than raising an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think its worth elaborationg? i.e. "If a path is invalid or missing in a given record, "?
| records in their raw JSON form. | ||
|
|
||
| Similar to the path-based and wildcard functions, this function is rewritten during query optimization to a special | ||
| internal column. During query execution, this column is serialized into a JSON string that represents the full record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't include metadata columns from splits right? Do you think its worth highlighting the distinction between JSON record and metadata columns here? I think it could be confusing to a reader thinking in terms of normal SQL records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. Currently we don't have support of metadata projection, probably can edit it later when we have the support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
presto-docs/src/main/sphinx/connector/clp.rst (1)
407-426: Consider documenting split filter limitation for wildcard functions.The documentation correctly restricts wildcard functions to
WHEREclauses only. However, per the learnings from prior reviews, wildcard functions also cannot be used in split filter configurations (configured viaclp.split-filter-config), since the split filter mechanism requires concrete field names that map to specific metadata columns, while wildcard functions don't correspond to concrete column names.Consider adding a note that clarifies this constraint to help users understand why split filter configurations fail with wildcard functions.
Consider adding a sentence such as:
- Wildcard functions must appear **only in filter conditions** (`WHERE` clause). They cannot be selected and cannot be passed as arguments to other functions. + - Wildcard functions cannot be used in ``clp.split-filter-config`` split filter configurations, as split filters + require concrete field names mapped to metadata columns; use explicit metadata field names in split filter + configurations instead.Based on learnings.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
presto-docs/src/main/sphinx/connector/clp.rst(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
Applied to files:
presto-docs/src/main/sphinx/connector/clp.rst
📚 Learning: 2025-09-12T14:47:20.734Z
Learnt from: wraymo
Repo: y-scope/presto PR: 64
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:419-483
Timestamp: 2025-09-12T14:47:20.734Z
Learning: In Presto, SQL validation (including type compatibility checks) happens before reaching the connector optimizer, so invalid SQL with type mismatches will be rejected upstream rather than needing validation in the connector-specific optimization code.
Applied to files:
presto-docs/src/main/sphinx/connector/clp.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (:presto-docs)
- GitHub Check: maven-checks (17.0.13)
- GitHub Check: maven-checks (8.0.442)
- GitHub Check: prestissimo-worker-images-build
🔇 Additional comments (3)
presto-docs/src/main/sphinx/connector/clp.rst (3)
323-368: Path-based functions section looks good.The function signatures, return types, and usage constraints are well-documented and consistent. The clarification that wildcards are not supported is clear.
370-381: Examples correctly use quoted paths.The examples properly demonstrate the documented requirement that JSON paths must be constant string literals, making them consistent with the usage notes above.
462-475: JSON String Function examples correctly demonstrate constraints.The examples properly show that
CLP_GET_JSON_STRING()is only used in theSELECTlist, never in filter predicates, and demonstrate practical combinations with other CLP functions. The examples align with the documented usage constraints.
| JSON String Function | ||
| ==================== | ||
|
|
||
| The ``CLP_GET_JSON_STRING``` function provides a convenient way to retrieve the entire log record—including both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix RST backtick formatting errors.
Lines 446 and 458 have mismatched backticks that will render incorrectly in reStructuredText:
- Line 446:
\`CLP_GET_JSON_STRING```should be``CLP_GET_JSON_STRING()``` - Line 458:
\`SELECT```should be``SELECT```
- The ``CLP_GET_JSON_STRING``` function provides a convenient way to retrieve the entire log record—including both
+ The ``CLP_GET_JSON_STRING()`` function provides a convenient way to retrieve the entire log record—including both
- This function can only be used in the list of projected columns in a ``SELECT``` clause to retrieve the complete
+ This function can only be used in the list of projected columns in a ``SELECT`` clause to retrieve the completeAlso applies to: 458-458
🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst around lines 446 and 458, there
are mismatched backticks causing RST rendering errors; update line 446 to use
the function name with parentheses and proper double backticks
(``CLP_GET_JSON_STRING()``) and update line 458 to close the ``SELECT`` inline
literal with matching double backticks, ensuring both inline code spans use
balanced backticks.
Description
This PR adds docs for
CLP_GET_*andCLP_WILDCARD_*UDFs.Checklist
breaking change.
Validation performed
Summary by CodeRabbit