Skip to content

feat: redact SQL literals in query logging (WBC-139)#70

Open
ClayMav wants to merge 4 commits into
mainfrom
clay/wbc-139-redact-query-logging
Open

feat: redact SQL literals in query logging (WBC-139)#70
ClayMav wants to merge 4 commits into
mainfrom
clay/wbc-139-redact-query-logging

Conversation

@ClayMav

@ClayMav ClayMav commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

This driver leaked raw SQL into the logs of every consumer that embeds it
(notably the shared mcp-server control plane, whose logs showed lines like
INFO root: Executing SQL query <id>: SHOW TBLPROPERTIES [...]).

Two spots logged statement text:

  • connection.py INFO query log used textwrap.shorten(sql, width=60) (raw SQL).
  • connection.py.__send logged the full request JSON at DEBUG, which includes
    the raw statement field.

This PR redacts literal values before logging at both spots. No raw SQL is now
logged at any level, and result rows/data are never logged.

Redaction approach: a new wherobots/db/redaction.py provides redact_sql()
(plus get_statement_type()). It is a single-pass character tokenizer (no
third-party dependency) that replaces single-quoted string literals and numeric
literals with ?, while preserving keywords, function names, and quoted
("..." / `...`) identifiers, and leaving comments intact. The INFO log
now emits the redacted statement plus its statement type alongside the
execution_id; the DEBUG request log redacts the statement field in the
serialized payload (the wire payload sent to the server is unchanged).

This logic is duplicated from sql-session PR #197 — these are separate repos
with no shared package, so the implementation is intentionally replicated rather
than imported.

Limitations: best-effort redaction for logging only, not a security
boundary. Dollar-quoted strings and other exotic literal syntaxes are not
recognized (Sedona/Spark SQL does not use them). A literal embedded inside a SQL
comment is preserved verbatim (comments are kept for structure).

Related Issues

Relates to WBC-139


Requester Checklist

Complete these before marking Ready for Review

  • I have self-reviewed my own code
  • I have added/updated tests that prove my fix/feature works
  • I have included visual proof (screenshot, video, or test output) if applicable
  • All CI checks are passing
  • PR size is S/M, OR I have justified the size and added a walkthrough
  • I have updated documentation if needed

Visual Proof

$ uv run pytest tests/test_redaction.py -v
18 passed
$ uv run pytest
90 passed
$ uv run pre-commit run --all-files
check yaml...Passed
fix end of files...Passed
trim trailing whitespace...Passed
ruff...Passed
ruff-format...Passed
codespell...Passed

Reviewer Checklist

  • Requester checklist above is complete
  • All CI checks are passing
  • Tests adequately cover the changes

ClayMav added 2 commits June 11, 2026 15:24
This driver's INFO-level query log emitted raw SQL via
textwrap.shorten(sql, ...), and __send logged the full request JSON
(including the raw statement) at DEBUG. Because the driver is embedded
by other services (notably the shared mcp-server control plane), that
text leaked into their log streams (WBC-139).

Add a redact_sql() helper that replaces string and numeric literals with
? while preserving keywords, identifiers, and quoted identifiers, and use
it everywhere a statement is logged. The logic is duplicated from
sql-session PR #197 since the two repos share no package.

- wherobots/db/redaction.py: single-pass redactor + get_statement_type
- connection.py: log redacted statement (+ type) and redacted request JSON
- tests/test_redaction.py: unit tests mirroring sql-session's cases
Replace the hand-rolled character scanner in redact_sql() with a ~15-line
implementation built on sqlparse, mirroring the sql-session change (PR #197)
exactly. sqlparse is a lenient, pure-Python tokenizer with zero transitive
dependencies; it never raises on malformed input and classifies literals by
token type, so we redact exactly String.Single and Number.* tokens and copy
everything else (keywords, identifiers, quoted/backtick identifiers, comments,
whitespace, operators) through verbatim. Public API and behavior unchanged;
the connection.py request-log wiring is untouched.

get_statement_type() is intentionally kept as the leading-keyword regex (not
sqlparse's Statement.get_type(), which returns UNKNOWN for SHOW/DESCRIBE/SET
and would regress observability tagging).

Adds sqlparse>=0.5.0 as a runtime dependency of this published client library
(pure-Python, zero transitive deps) and updates the lockfile.
@ClayMav

ClayMav commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Switched `redact_sql()` to `sqlparse`, mirroring sql-session PR #197 exactly (the two `redact_sql` bodies are byte-identical; the repos share no package so the logic is duplicated by necessity). New impl is ~15 lines: parse, `flatten()`, redact `String.Single` + `Number.*` to `?`, copy everything else verbatim (keywords, identifiers, double-quoted/backtick identifiers, comments, whitespace, operators). An unterminated single quote (emitted by sqlparse as an `Error` token followed by plain text) is caught explicitly so the trailing value can never leak.

Dependency note: this adds `sqlparse>=0.5.0` as a runtime dependency of this published client library. sqlparse is pure-Python with zero transitive dependencies, so the install footprint impact is minimal.

`get_statement_type()` is unchanged — still the leading-keyword regex, NOT sqlparse's `Statement.get_type()` (which returns `UNKNOWN` for SHOW/DESCRIBE/SET). The `connection.py` `__send` DEBUG request-log wiring is untouched; it just calls the new `redact_sql`.

All 18 redaction tests pass unchanged; full suite (90 tests) green; ruff format/check + codespell clean; redaction.py is mypy-clean (pre-existing mypy errors in connection.py/driver.py are unrelated).

@ClayMav ClayMav marked this pull request as ready for review June 11, 2026 22:42
@ClayMav ClayMav requested a review from a team as a code owner June 11, 2026 22:42

@salty-hambot salty-hambot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed by Salty Hambot 🤖🧂

Solid redaction work, but the unterminated-quote guard breaks when it should return — in a multi-statement parse it'll happily leak the literals it's supposed to hide. Fail closed, add the multi-statement test, and ship it.

3 finding(s) posted.
💬 To request a re-review, comment @salty-hambot review

Comment thread wherobots/db/redaction.py
Comment thread tests/test_redaction.py
Comment thread wherobots/db/connection.py
Comment thread wherobots/db/connection.py Outdated
def __send(self, message: Dict[str, Any]) -> None:
request = json.dumps(message)
logging.debug("Request: %s", request)
logging.debug("Request: %s", self.__redacted_request(message))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to lazily call __redacted_request() only when debug logging is enabled?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6b62c0f. __send now guards the call behind logging.getLogger().isEnabledFor(logging.DEBUG) so the json.dumps + sqlparse parse only run when DEBUG logging is actually enabled (the log argument is evaluated eagerly otherwise). Matched the module's existing use of the root logging object.

# services, so raw SQL here would leak into their log streams (WBC-139).
logging.info(
"Executing SQL query %s: %s", execution_id, textwrap.shorten(sql, width=60)
"Executing SQL query %s (%s): %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we lost the truncation here, is that intentional?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, not intentional. Restored in 6b62c0f: the INFO log now wraps the redacted SQL in textwrap.shorten(redact_sql(sql), width=200). Truncation is applied to the redacted output (so we keep redaction and re-add the cap), at width 200 rather than the old 60 to leave more useful structure in the log.

Comment thread wherobots/db/redaction.py
# sqlparse ships no type stubs, so ``flatten`` is untyped under strict.
for token in stmt.flatten(): # type: ignore[no-untyped-call]
ttype = token.ttype
if ttype in T.String.Single or ttype in T.Number:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ai-generated comment) Does sqlparse ever classify string literals as something other than String.Single — e.g. String.Symbol, hex X'...', or unicode U&'...' literals? If so, those would pass through verbatim. Might be worth a quick check since this token classification is now load-bearing for the redaction.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — I ran sqlparse empirically against these forms (target dialect Sedona/Spark SQL):

==== X'deadbeef'  (hex/blob)
   Token.Name                  'X'
   Token.Literal.String.Single "'deadbeef'"   <- value body is String.Single
==== U&'\0041'    (unicode)
   Token.Name                  'U'
   Token.Operator              '&'
   Token.Literal.String.Single "'\\0041'"      <- String.Single
==== N'abc'       (symbol-prefixed)
   Token.Name                  'N'
   Token.Literal.String.Single "'abc'"         <- String.Single

So for every prefixed form, sqlparse classifies the value-bearing quoted body as String.Single, which the current guard already redacts. Verified end-to-end:

X'deadbeef'  -> X?
U&'\0041'    -> U&?
N'abc'       -> N?

No secret leaks through. So no code change needed here.

Importantly, broadening the guard to ttype in T.String would be harmful: sqlparse classifies double-quoted identifiers as String.Symbol (a subtype of String) —

'SELECT "user id" FROM "my table"'  ->  "user id" / "my table" are Token.Literal.String.Symbol

— so T.String would wrongly redact table/column names and regress test_double_quoted_identifier_left_intact, making logs useless for debugging. The only literal form outside String.Single that isn't an identifier is Postgres-only $$...$$ dollar-quoting (Token.Literal), which Sedona/Spark doesn't support, so it's intentionally out of scope.

Added a regression test (test_prefixed_string_literals_redacted) locking in that X'..', U&'..', and N'..' redact correctly. (6b62c0f)

ClayMav added 2 commits June 12, 2026 08:42
…literals (WBC-139)

- connection.py __send: guard redaction behind logging.isEnabledFor(DEBUG)
  so json.dumps + sqlparse parse only run when DEBUG logging is active.
- connection.py __execute_sql: re-add truncation lost in the redaction
  change by wrapping the redacted SQL in textwrap.shorten(width=200).
- redaction.py: no code change needed. Verified X'..' (hex), U&'..'
  (unicode) and N'..' (symbol-prefixed) string literals tokenize their
  value-bearing body as String.Single, which the existing guard already
  redacts. Broadening to T.String would wrongly redact double-quoted
  identifiers (String.Symbol). Added a regression test locking this in.
@ClayMav ClayMav requested a review from sfishel18 June 16, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants