Skip to content

Conversation

@wraymo
Copy link

@wraymo wraymo commented Oct 3, 2025

Description

This PR adds docs for CLP_GET_* and CLP_WILDCARD_* UDFs.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Summary by CodeRabbit

  • Documentation
    • Added a CLP Functions guide describing path-based JSON helpers and wildcard column helpers, their return types, constraints (constant string literals only, no wildcards) and NULL behaviour.
    • Documented usage patterns, supported operators, filter-only restrictions, and examples for strings, numbers, booleans and arrays.
    • Added a "SQL support" section and noted duplicated content to be consolidated.

@coderabbitai
Copy link

coderabbitai bot commented Oct 3, 2025

Walkthrough

Adds a CLP Functions documentation block to the CLP connector docs describing path‑based JSON helper functions, wildcard column functions, CLP_GET_JSON_STRING, usage rules (constant path args, no wildcards in paths), NULL/invalid‑path behavior, examples, supported operators, optimizer rewrite notes, and a duplicated block present in the file.

Changes

Cohort / File(s) Summary
Documentation — CLP functions
presto-docs/src/main/sphinx/connector/clp.rst
Adds a CLP Functions section documenting Path‑Based Functions (CLP_GET_STRING, CLP_GET_BIGINT, CLP_GET_DOUBLE, CLP_GET_BOOL, CLP_GET_STRING_ARRAY), Wildcard Column Functions (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, CLP_WILDCARD_FLOAT_COLUMN, CLP_WILDCARD_BOOL_COLUMN), and CLP_GET_JSON_STRING. Documents constraints (path args must be constant string literals; no wildcards inside path args), behavior on missing/invalid paths (returns NULL), supported operators and examples, optimizer rewrite behavior to schema/KQL references, includes a "SQL support" subsection, and contains a duplicated block in the file.

Sequence Diagram(s)

Omitted — documentation-only change; no control-flow modifications.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Attention points:
    • Confirm whether the duplicated CLP Functions block should be consolidated or removed.
    • Verify correctness of function names, example snippets, and the list of supported operators.
    • Validate the phrasing about optimizer rewrite behavior (schema-backed / KQL references).

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description provides a clear summary of changes but uses a different template structure than required. Key required sections (Motivation, Impact, Test Plan, Release Notes) from the repository template are missing or incomplete. Fill in all required template sections: Motivation and Context, Impact, Test Plan, and Release Notes. Align the description with the repository's standard template format.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the primary change: adding documentation for CLP UDFs. It follows Conventional Commits format and directly relates to the changeset.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 971443a and e750458.

📒 Files selected for processing (1)
  • presto-docs/src/main/sphinx/connector/clp.rst (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
PR: y-scope/presto#64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.

Applied to files:

  • presto-docs/src/main/sphinx/connector/clp.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test (8.0.442, :presto-tests -P ci-only-local-queries)
  • GitHub Check: maven-checks (17.0.13)
  • GitHub Check: maven-checks (8.0.442)
  • GitHub Check: test (:presto-docs)
  • GitHub Check: prestissimo-worker-images-build

Copy link

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coming along; a few things to address.

@wraymo
Copy link
Author

wraymo commented Oct 27, 2025

Thanks for the review! @quinntaylormitchell. I applied most of your suggestions.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
presto-docs/src/main/sphinx/connector/clp.rst (2)

327-340: Consolidate duplicate introductory paragraphs.

Lines 327–333 contain two separate paragraphs that convey the same message: the purpose of CLP functions to handle dynamic fields while keeping table metadata concise. Consolidate these into a single, cohesive introduction.

Consider consolidating to something like:

- Semi-structured logs can have many potential keys, which can lead to very wide Presto tables. To keep table metadata
- concise and still preserve access to dynamic fields, the connector provides two sets of functions that are specific to
- the CLP connector. These functions are not part of standard Presto SQL.
-
- To enable access to dynamic or less common fields not present in the exposed schema, CLP provides two set of functions
- to help users query flexible log schemas while keeping the table metadata definition concise. These functions are only
- available in the CLP connector and are not part of standard Presto SQL.
+ Semi-structured logs can have many potential keys, which can lead to very wide Presto tables. To keep table metadata
+ concise and still preserve access to dynamic fields, the connector provides two sets of functions that are specific to
+ the CLP connector. These functions are not part of standard Presto SQL.

378-380: Fix unquoted JSON paths and non-existent function name in examples.

The examples at lines 378–384 contradict the documented requirement (line 369) that "JSON paths must be constant string literals":

  • Line 378: CLP_GET_STRING(msg.author) should be CLP_GET_STRING('msg.author')
  • Line 380: Uses non-existent CLP_GET_INT; should be CLP_GET_BIGINT('msg.timestamp')
  • Line 382: CLP_GET_STRING_ARRAY(msg.tags) should be CLP_GET_STRING_ARRAY('msg.tags')
 .. code-block:: sql

-    SELECT CLP_GET_STRING(msg.author) AS author
+    SELECT CLP_GET_STRING('msg.author') AS author
     FROM clp.default.table_1
-    WHERE CLP_GET_INT('msg.timestamp') > 1620000000;
+    WHERE CLP_GET_BIGINT('msg.timestamp') > 1620000000;

-    SELECT CLP_GET_STRING_ARRAY(msg.tags) AS tags
+    SELECT CLP_GET_STRING_ARRAY('msg.tags') AS tags
     FROM clp.default.table_2
     WHERE CLP_GET_BOOL('msg.is_active') = true;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e750458 and d7d03cd.

📒 Files selected for processing (1)
  • presto-docs/src/main/sphinx/connector/clp.rst (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
PR: y-scope/presto#64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.

Applied to files:

  • presto-docs/src/main/sphinx/connector/clp.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: prestissimo-worker-images-build

Copy link

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few more comments.

also I noticed you didn't apply the "sentence-case heading" suggestions; I feel that you should, but I understand that the rest of the document doesn't use them, so it's up to you.

@wraymo
Copy link
Author

wraymo commented Oct 28, 2025

@quinntaylormitchell Thanks for the follow-up! Regarding the “sentence-case heading,” I noticed that other docs capitalize the first letter of every word, so I followed the same style.

Copy link

@quinntaylormitchell quinntaylormitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
presto-docs/src/main/sphinx/connector/clp.rst (2)

375-381: Fix JSON path examples to match documented requirements and correct function name.

Multiple issues in the examples:

  1. Line 377: Uses CLP_GET_INT, which doesn't exist. Should use CLP_GET_BIGINT (as documented in line 347-349).
  2. Lines 375, 379: JSON paths (msg.author and msg.tags) are not quoted, but the constraint at line 366 explicitly requires paths to be constant string literals. The WHERE clause at line 381 correctly quotes the path ('msg.is_active'), so these should match.

Apply this diff to fix the examples:

    SELECT CLP_GET_STRING(msg.author) AS author
    FROM clp.default.table_1
-   WHERE CLP_GET_INT('msg.timestamp') > 1620000000;
+   WHERE CLP_GET_BIGINT('msg.timestamp') > 1620000000;

-   SELECT CLP_GET_STRING_ARRAY(msg.tags) AS tags
+   SELECT CLP_GET_STRING_ARRAY('msg.tags') AS tags
    FROM clp.default.table_2
    WHERE CLP_GET_BOOL('msg.is_active') = true;

Also update line 375 to quote the path:

-   SELECT CLP_GET_STRING(msg.author) AS author
+   SELECT CLP_GET_STRING('msg.author') AS author

407-426: Document the split filter configuration limitation for wildcard functions.

The current note correctly restricts wildcard functions to WHERE clauses only, but it omits that they also cannot be used in split filter configurations (configured via clp.split-filter-config). This is an important constraint since split filters require concrete field names for metadata validation, and wildcard functions do not correspond to specific columns. The supported split filter expressions (line 224) do not include wildcard functions.

Add a bullet point clarifying this limitation:

   - Wildcard functions must appear **only in filter conditions** (`WHERE` clause). They cannot be selected and cannot
     be passed as arguments to other functions.
+  - Wildcard functions cannot be used in split filter configurations (see :ref:`Split Filter Config File<split-filter-config-file>`)
+    because split filters require concrete column field names.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ddb9d8 and 4442f9a.

📒 Files selected for processing (1)
  • presto-docs/src/main/sphinx/connector/clp.rst (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
PR: y-scope/presto#64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.

Applied to files:

  • presto-docs/src/main/sphinx/connector/clp.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (:presto-docs)
  • GitHub Check: maven-checks (8.0.442)
  • GitHub Check: maven-checks (17.0.13)
  • GitHub Check: prestissimo-worker-images-build
🔇 Additional comments (2)
presto-docs/src/main/sphinx/connector/clp.rst (2)

323-337: LGTM!

The introduction clearly explains the three function categories and accurately conveys the performance benefits of the connector's query optimization rewrites.


443-476: LGTM!

The JSON string function section is well-documented with clear constraints and practical examples. The restriction to SELECT list only is clearly stated, and the examples effectively demonstrate common use cases.

Copy link

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! Just a few comments about things that might be worth clarifying.


.. note::

This function can only be used in the ``SELECT``` list to retrieve the complete JSON representation of each record.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "SELECT list" common terminology? I think I'd expect something more like "list of projected columns in a SELECT statement", but we can keep the current working if "SELECT list" is widespread.


- JSON paths must be **constant string literals**; variables are not supported.
- Wildcards (e.g., ``msg.*.ts``) are **not supported**.
- If a path is invalid or missing, the function returns ``NULL`` rather than raising an error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think its worth elaborationg? i.e. "If a path is invalid or missing in a given record, "?

records in their raw JSON form.

Similar to the path-based and wildcard functions, this function is rewritten during query optimization to a special
internal column. During query execution, this column is serialized into a JSON string that represents the full record.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't include metadata columns from splits right? Do you think its worth highlighting the distinction between JSON record and metadata columns here? I think it could be confusing to a reader thinking in terms of normal SQL records.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. Currently we don't have support of metadata projection, probably can edit it later when we have the support.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
presto-docs/src/main/sphinx/connector/clp.rst (1)

407-426: Consider documenting split filter limitation for wildcard functions.

The documentation correctly restricts wildcard functions to WHERE clauses only. However, per the learnings from prior reviews, wildcard functions also cannot be used in split filter configurations (configured via clp.split-filter-config), since the split filter mechanism requires concrete field names that map to specific metadata columns, while wildcard functions don't correspond to concrete column names.

Consider adding a note that clarifies this constraint to help users understand why split filter configurations fail with wildcard functions.

Consider adding a sentence such as:

   - Wildcard functions must appear **only in filter conditions** (`WHERE` clause). They cannot be selected and cannot
     be passed as arguments to other functions.
+  - Wildcard functions cannot be used in ``clp.split-filter-config`` split filter configurations, as split filters
+    require concrete field names mapped to metadata columns; use explicit metadata field names in split filter
+    configurations instead.

Based on learnings.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a00bac5 and 0796cf6.

📒 Files selected for processing (1)
  • presto-docs/src/main/sphinx/connector/clp.rst (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.
📚 Learning: 2025-09-16T17:39:45.102Z
Learnt from: anlowee
Repo: y-scope/presto PR: 64
File: presto-clp/src/test/java/com/facebook/presto/plugin/clp/TestClpFilterToKql.java:284-299
Timestamp: 2025-09-16T17:39:45.102Z
Learning: CLP wildcard UDFs (CLP_WILDCARD_STRING_COLUMN, CLP_WILDCARD_INT_COLUMN, etc.) cannot be used as split filters (formerly called metadata filters) because the split filter mechanism checks whether the filter contains field names that are defined in the split filter configuration file. Since wildcard functions don't correspond to specific column names, they don't pass this validation.

Applied to files:

  • presto-docs/src/main/sphinx/connector/clp.rst
📚 Learning: 2025-09-12T14:47:20.734Z
Learnt from: wraymo
Repo: y-scope/presto PR: 64
File: presto-clp/src/main/java/com/facebook/presto/plugin/clp/optimization/ClpFilterToKqlConverter.java:419-483
Timestamp: 2025-09-12T14:47:20.734Z
Learning: In Presto, SQL validation (including type compatibility checks) happens before reaching the connector optimizer, so invalid SQL with type mismatches will be rejected upstream rather than needing validation in the connector-specific optimization code.

Applied to files:

  • presto-docs/src/main/sphinx/connector/clp.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (:presto-docs)
  • GitHub Check: maven-checks (17.0.13)
  • GitHub Check: maven-checks (8.0.442)
  • GitHub Check: prestissimo-worker-images-build
🔇 Additional comments (3)
presto-docs/src/main/sphinx/connector/clp.rst (3)

323-368: Path-based functions section looks good.

The function signatures, return types, and usage constraints are well-documented and consistent. The clarification that wildcards are not supported is clear.


370-381: Examples correctly use quoted paths.

The examples properly demonstrate the documented requirement that JSON paths must be constant string literals, making them consistent with the usage notes above.


462-475: JSON String Function examples correctly demonstrate constraints.

The examples properly show that CLP_GET_JSON_STRING() is only used in the SELECT list, never in filter predicates, and demonstrate practical combinations with other CLP functions. The examples align with the documented usage constraints.

JSON String Function
====================

The ``CLP_GET_JSON_STRING``` function provides a convenient way to retrieve the entire log record—including both
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix RST backtick formatting errors.

Lines 446 and 458 have mismatched backticks that will render incorrectly in reStructuredText:

  • Line 446: \`CLP_GET_JSON_STRING```should be``CLP_GET_JSON_STRING()```
  • Line 458: \`SELECT```should be``SELECT```
- The ``CLP_GET_JSON_STRING``` function provides a convenient way to retrieve the entire log record—including both
+ The ``CLP_GET_JSON_STRING()`` function provides a convenient way to retrieve the entire log record—including both

- This function can only be used in the list of projected columns in a ``SELECT``` clause to retrieve the complete
+ This function can only be used in the list of projected columns in a ``SELECT`` clause to retrieve the complete

Also applies to: 458-458

🤖 Prompt for AI Agents
In presto-docs/src/main/sphinx/connector/clp.rst around lines 446 and 458, there
are mismatched backticks causing RST rendering errors; update line 446 to use
the function name with parentheses and proper double backticks
(``CLP_GET_JSON_STRING()``) and update line 458 to close the ``SELECT`` inline
literal with matching double backticks, ensuring both inline code spans use
balanced backticks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants