feat: redact SQL literals in query logging (WBC-139) by ClayMav · Pull Request #70 · wherobots/wherobots-python-dbapi

ClayMav · 2026-06-11T22:25:05Z

Summary

This driver leaked raw SQL into the logs of every consumer that embeds it
(notably the shared mcp-server control plane, whose logs showed lines like
INFO root: Executing SQL query <id>: SHOW TBLPROPERTIES [...]).

Two spots logged statement text:

connection.py INFO query log used textwrap.shorten(sql, width=60) (raw SQL).
connection.py.__send logged the full request JSON at DEBUG, which includes
the raw statement field.

This PR redacts literal values before logging at both spots. No raw SQL is now
logged at any level, and result rows/data are never logged.

Redaction approach: a new wherobots/db/redaction.py provides redact_sql()
(plus get_statement_type()). It is a single-pass character tokenizer (no
third-party dependency) that replaces single-quoted string literals and numeric
literals with ?, while preserving keywords, function names, and quoted
("..." / `...`) identifiers, and leaving comments intact. The INFO log
now emits the redacted statement plus its statement type alongside the
execution_id; the DEBUG request log redacts the statement field in the
serialized payload (the wire payload sent to the server is unchanged).

This logic is duplicated from sql-session PR #197 — these are separate repos
with no shared package, so the implementation is intentionally replicated rather
than imported.

Limitations: best-effort redaction for logging only, not a security
boundary. Dollar-quoted strings and other exotic literal syntaxes are not
recognized (Sedona/Spark SQL does not use them). A literal embedded inside a SQL
comment is preserved verbatim (comments are kept for structure).

Related Issues

Relates to WBC-139

Requester Checklist

Complete these before marking Ready for Review

I have self-reviewed my own code
I have added/updated tests that prove my fix/feature works
I have included visual proof (screenshot, video, or test output) if applicable
All CI checks are passing
PR size is S/M, OR I have justified the size and added a walkthrough
I have updated documentation if needed

Visual Proof

$ uv run pytest tests/test_redaction.py -v
18 passed
$ uv run pytest
90 passed
$ uv run pre-commit run --all-files
check yaml...Passed
fix end of files...Passed
trim trailing whitespace...Passed
ruff...Passed
ruff-format...Passed
codespell...Passed

Reviewer Checklist

Requester checklist above is complete
All CI checks are passing
Tests adequately cover the changes

This driver's INFO-level query log emitted raw SQL via textwrap.shorten(sql, ...), and __send logged the full request JSON (including the raw statement) at DEBUG. Because the driver is embedded by other services (notably the shared mcp-server control plane), that text leaked into their log streams (WBC-139). Add a redact_sql() helper that replaces string and numeric literals with ? while preserving keywords, identifiers, and quoted identifiers, and use it everywhere a statement is logged. The logic is duplicated from sql-session PR #197 since the two repos share no package. - wherobots/db/redaction.py: single-pass redactor + get_statement_type - connection.py: log redacted statement (+ type) and redacted request JSON - tests/test_redaction.py: unit tests mirroring sql-session's cases

Replace the hand-rolled character scanner in redact_sql() with a ~15-line implementation built on sqlparse, mirroring the sql-session change (PR #197) exactly. sqlparse is a lenient, pure-Python tokenizer with zero transitive dependencies; it never raises on malformed input and classifies literals by token type, so we redact exactly String.Single and Number.* tokens and copy everything else (keywords, identifiers, quoted/backtick identifiers, comments, whitespace, operators) through verbatim. Public API and behavior unchanged; the connection.py request-log wiring is untouched. get_statement_type() is intentionally kept as the leading-keyword regex (not sqlparse's Statement.get_type(), which returns UNKNOWN for SHOW/DESCRIBE/SET and would regress observability tagging). Adds sqlparse>=0.5.0 as a runtime dependency of this published client library (pure-Python, zero transitive deps) and updates the lockfile.

ClayMav · 2026-06-11T22:41:41Z

Switched `redact_sql()` to `sqlparse`, mirroring sql-session PR #197 exactly (the two `redact_sql` bodies are byte-identical; the repos share no package so the logic is duplicated by necessity). New impl is ~15 lines: parse, `flatten()`, redact `String.Single` + `Number.*` to `?`, copy everything else verbatim (keywords, identifiers, double-quoted/backtick identifiers, comments, whitespace, operators). An unterminated single quote (emitted by sqlparse as an `Error` token followed by plain text) is caught explicitly so the trailing value can never leak.

Dependency note: this adds `sqlparse>=0.5.0` as a runtime dependency of this published client library. sqlparse is pure-Python with zero transitive dependencies, so the install footprint impact is minimal.

`get_statement_type()` is unchanged — still the leading-keyword regex, NOT sqlparse's `Statement.get_type()` (which returns `UNKNOWN` for SHOW/DESCRIBE/SET). The `connection.py` `__send` DEBUG request-log wiring is untouched; it just calls the new `redact_sql`.

All 18 redaction tests pass unchanged; full suite (90 tests) green; ruff format/check + codespell clean; redaction.py is mypy-clean (pre-existing mypy errors in connection.py/driver.py are unrelated).

salty-hambot

Reviewed by Salty Hambot 🤖🧂

Solid redaction work, but the unterminated-quote guard breaks when it should return — in a multi-statement parse it'll happily leak the literals it's supposed to hide. Fail closed, add the multi-statement test, and ship it.

_{3 finding(s) posted.}
_{💬 To request a re-review, comment @salty-hambot review}

sfishel18 · 2026-06-12T15:21:06Z

    def __send(self, message: Dict[str, Any]) -> None:
        request = json.dumps(message)
-        logging.debug("Request: %s", request)
+        logging.debug("Request: %s", self.__redacted_request(message))


is there a way to lazily call __redacted_request() only when debug logging is enabled?

Done in 6b62c0f. __send now guards the call behind logging.getLogger().isEnabledFor(logging.DEBUG) so the json.dumps + sqlparse parse only run when DEBUG logging is actually enabled (the log argument is evaluated eagerly otherwise). Matched the module's existing use of the root logging object.

sfishel18 · 2026-06-12T15:21:06Z

+        # services, so raw SQL here would leak into their log streams (WBC-139).
        logging.info(
-            "Executing SQL query %s: %s", execution_id, textwrap.shorten(sql, width=60)
+            "Executing SQL query %s (%s): %s",


we lost the truncation here, is that intentional?

Good catch, not intentional. Restored in 6b62c0f: the INFO log now wraps the redacted SQL in textwrap.shorten(redact_sql(sql), width=200). Truncation is applied to the redacted output (so we keep redaction and re-add the cap), at width 200 rather than the old 60 to leave more useful structure in the log.

sfishel18 · 2026-06-12T15:21:06Z

+        # sqlparse ships no type stubs, so ``flatten`` is untyped under strict.
+        for token in stmt.flatten():  # type: ignore[no-untyped-call]
+            ttype = token.ttype
+            if ttype in T.String.Single or ttype in T.Number:


(ai-generated comment) Does sqlparse ever classify string literals as something other than String.Single — e.g. String.Symbol, hex X'...', or unicode U&'...' literals? If so, those would pass through verbatim. Might be worth a quick check since this token classification is now load-bearing for the redaction.

Good question — I ran sqlparse empirically against these forms (target dialect Sedona/Spark SQL):

==== X'deadbeef' (hex/blob) Token.Name 'X' Token.Literal.String.Single "'deadbeef'" <- value body is String.Single ==== U&'\0041' (unicode) Token.Name 'U' Token.Operator '&' Token.Literal.String.Single "'\\0041'" <- String.Single ==== N'abc' (symbol-prefixed) Token.Name 'N' Token.Literal.String.Single "'abc'" <- String.Single

So for every prefixed form, sqlparse classifies the value-bearing quoted body as String.Single, which the current guard already redacts. Verified end-to-end:

X'deadbeef' -> X? U&'\0041' -> U&? N'abc' -> N?

No secret leaks through. So no code change needed here.

Importantly, broadening the guard to ttype in T.String would be harmful: sqlparse classifies double-quoted identifiers as String.Symbol (a subtype of String) —

'SELECT "user id" FROM "my table"' -> "user id" / "my table" are Token.Literal.String.Symbol

— so T.String would wrongly redact table/column names and regress test_double_quoted_identifier_left_intact, making logs useless for debugging. The only literal form outside String.Single that isn't an identifier is Postgres-only $$...$$ dollar-quoting (Token.Literal), which Sedona/Spark doesn't support, so it's intentionally out of scope.

Added a regression test (test_prefixed_string_literals_redacted) locking in that X'..', U&'..', and N'..' redact correctly. (6b62c0f)

…tement leak (WBC-139)

…literals (WBC-139) - connection.py __send: guard redaction behind logging.isEnabledFor(DEBUG) so json.dumps + sqlparse parse only run when DEBUG logging is active. - connection.py __execute_sql: re-add truncation lost in the redaction change by wrapping the redacted SQL in textwrap.shorten(width=200). - redaction.py: no code change needed. Verified X'..' (hex), U&'..' (unicode) and N'..' (symbol-prefixed) string literals tokenize their value-bearing body as String.Single, which the existing guard already redacts. Broadening to T.String would wrongly redact double-quoted identifiers (String.Symbol). Added a regression test locking this in.

ClayMav added 2 commits June 11, 2026 15:24

ClayMav marked this pull request as ready for review June 11, 2026 22:42

ClayMav requested a review from a team as a code owner June 11, 2026 22:42

salty-hambot Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread wherobots/db/redaction.py

Comment thread tests/test_redaction.py

Comment thread wherobots/db/connection.py

sfishel18 reviewed Jun 12, 2026

View reviewed changes

ClayMav added 2 commits June 12, 2026 08:42

fix: fail closed on unterminated-quote redaction to prevent multi-sta…

ea518f5

…tement leak (WBC-139)

ClayMav requested a review from sfishel18 June 16, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: redact SQL literals in query logging (WBC-139)#70

feat: redact SQL literals in query logging (WBC-139)#70
ClayMav wants to merge 4 commits into
mainfrom
clay/wbc-139-redact-query-logging

ClayMav commented Jun 11, 2026

Uh oh!

ClayMav commented Jun 11, 2026

Uh oh!

salty-hambot Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfishel18 Jun 12, 2026

Uh oh!

ClayMav Jun 16, 2026

Uh oh!

sfishel18 Jun 12, 2026

Uh oh!

ClayMav Jun 16, 2026

Uh oh!

sfishel18 Jun 12, 2026

Uh oh!

ClayMav Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

ClayMav commented Jun 11, 2026

Summary

Related Issues

Requester Checklist

Visual Proof

Reviewer Checklist

Uh oh!

ClayMav commented Jun 11, 2026

Uh oh!

salty-hambot Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfishel18 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ClayMav Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

sfishel18 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ClayMav Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

sfishel18 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

ClayMav Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants