feat: redact SQL literals in query logging (WBC-139)#70
Conversation
This driver's INFO-level query log emitted raw SQL via textwrap.shorten(sql, ...), and __send logged the full request JSON (including the raw statement) at DEBUG. Because the driver is embedded by other services (notably the shared mcp-server control plane), that text leaked into their log streams (WBC-139). Add a redact_sql() helper that replaces string and numeric literals with ? while preserving keywords, identifiers, and quoted identifiers, and use it everywhere a statement is logged. The logic is duplicated from sql-session PR #197 since the two repos share no package. - wherobots/db/redaction.py: single-pass redactor + get_statement_type - connection.py: log redacted statement (+ type) and redacted request JSON - tests/test_redaction.py: unit tests mirroring sql-session's cases
Replace the hand-rolled character scanner in redact_sql() with a ~15-line implementation built on sqlparse, mirroring the sql-session change (PR #197) exactly. sqlparse is a lenient, pure-Python tokenizer with zero transitive dependencies; it never raises on malformed input and classifies literals by token type, so we redact exactly String.Single and Number.* tokens and copy everything else (keywords, identifiers, quoted/backtick identifiers, comments, whitespace, operators) through verbatim. Public API and behavior unchanged; the connection.py request-log wiring is untouched. get_statement_type() is intentionally kept as the leading-keyword regex (not sqlparse's Statement.get_type(), which returns UNKNOWN for SHOW/DESCRIBE/SET and would regress observability tagging). Adds sqlparse>=0.5.0 as a runtime dependency of this published client library (pure-Python, zero transitive deps) and updates the lockfile.
|
Switched `redact_sql()` to `sqlparse`, mirroring sql-session PR #197 exactly (the two `redact_sql` bodies are byte-identical; the repos share no package so the logic is duplicated by necessity). New impl is ~15 lines: parse, `flatten()`, redact `String.Single` + `Number.*` to `?`, copy everything else verbatim (keywords, identifiers, double-quoted/backtick identifiers, comments, whitespace, operators). An unterminated single quote (emitted by sqlparse as an `Error` token followed by plain text) is caught explicitly so the trailing value can never leak. Dependency note: this adds `sqlparse>=0.5.0` as a runtime dependency of this published client library. sqlparse is pure-Python with zero transitive dependencies, so the install footprint impact is minimal. `get_statement_type()` is unchanged — still the leading-keyword regex, NOT sqlparse's `Statement.get_type()` (which returns `UNKNOWN` for SHOW/DESCRIBE/SET). The `connection.py` `__send` DEBUG request-log wiring is untouched; it just calls the new `redact_sql`. All 18 redaction tests pass unchanged; full suite (90 tests) green; ruff format/check + codespell clean; redaction.py is mypy-clean (pre-existing mypy errors in connection.py/driver.py are unrelated). |
There was a problem hiding this comment.
Reviewed by Salty Hambot 🤖🧂
Solid redaction work, but the unterminated-quote guard breaks when it should return — in a multi-statement parse it'll happily leak the literals it's supposed to hide. Fail closed, add the multi-statement test, and ship it.
3 finding(s) posted.
💬 To request a re-review, comment @salty-hambot review
| def __send(self, message: Dict[str, Any]) -> None: | ||
| request = json.dumps(message) | ||
| logging.debug("Request: %s", request) | ||
| logging.debug("Request: %s", self.__redacted_request(message)) |
There was a problem hiding this comment.
is there a way to lazily call __redacted_request() only when debug logging is enabled?
There was a problem hiding this comment.
Done in 6b62c0f. __send now guards the call behind logging.getLogger().isEnabledFor(logging.DEBUG) so the json.dumps + sqlparse parse only run when DEBUG logging is actually enabled (the log argument is evaluated eagerly otherwise). Matched the module's existing use of the root logging object.
| # services, so raw SQL here would leak into their log streams (WBC-139). | ||
| logging.info( | ||
| "Executing SQL query %s: %s", execution_id, textwrap.shorten(sql, width=60) | ||
| "Executing SQL query %s (%s): %s", |
There was a problem hiding this comment.
we lost the truncation here, is that intentional?
There was a problem hiding this comment.
Good catch, not intentional. Restored in 6b62c0f: the INFO log now wraps the redacted SQL in textwrap.shorten(redact_sql(sql), width=200). Truncation is applied to the redacted output (so we keep redaction and re-add the cap), at width 200 rather than the old 60 to leave more useful structure in the log.
| # sqlparse ships no type stubs, so ``flatten`` is untyped under strict. | ||
| for token in stmt.flatten(): # type: ignore[no-untyped-call] | ||
| ttype = token.ttype | ||
| if ttype in T.String.Single or ttype in T.Number: |
There was a problem hiding this comment.
(ai-generated comment) Does sqlparse ever classify string literals as something other than String.Single — e.g. String.Symbol, hex X'...', or unicode U&'...' literals? If so, those would pass through verbatim. Might be worth a quick check since this token classification is now load-bearing for the redaction.
There was a problem hiding this comment.
Good question — I ran sqlparse empirically against these forms (target dialect Sedona/Spark SQL):
==== X'deadbeef' (hex/blob)
Token.Name 'X'
Token.Literal.String.Single "'deadbeef'" <- value body is String.Single
==== U&'\0041' (unicode)
Token.Name 'U'
Token.Operator '&'
Token.Literal.String.Single "'\\0041'" <- String.Single
==== N'abc' (symbol-prefixed)
Token.Name 'N'
Token.Literal.String.Single "'abc'" <- String.Single
So for every prefixed form, sqlparse classifies the value-bearing quoted body as String.Single, which the current guard already redacts. Verified end-to-end:
X'deadbeef' -> X?
U&'\0041' -> U&?
N'abc' -> N?
No secret leaks through. So no code change needed here.
Importantly, broadening the guard to ttype in T.String would be harmful: sqlparse classifies double-quoted identifiers as String.Symbol (a subtype of String) —
'SELECT "user id" FROM "my table"' -> "user id" / "my table" are Token.Literal.String.Symbol
— so T.String would wrongly redact table/column names and regress test_double_quoted_identifier_left_intact, making logs useless for debugging. The only literal form outside String.Single that isn't an identifier is Postgres-only $$...$$ dollar-quoting (Token.Literal), which Sedona/Spark doesn't support, so it's intentionally out of scope.
Added a regression test (test_prefixed_string_literals_redacted) locking in that X'..', U&'..', and N'..' redact correctly. (6b62c0f)
…tement leak (WBC-139)
…literals (WBC-139) - connection.py __send: guard redaction behind logging.isEnabledFor(DEBUG) so json.dumps + sqlparse parse only run when DEBUG logging is active. - connection.py __execute_sql: re-add truncation lost in the redaction change by wrapping the redacted SQL in textwrap.shorten(width=200). - redaction.py: no code change needed. Verified X'..' (hex), U&'..' (unicode) and N'..' (symbol-prefixed) string literals tokenize their value-bearing body as String.Single, which the existing guard already redacts. Broadening to T.String would wrongly redact double-quoted identifiers (String.Symbol). Added a regression test locking this in.
Summary
This driver leaked raw SQL into the logs of every consumer that embeds it
(notably the shared mcp-server control plane, whose logs showed lines like
INFO root: Executing SQL query <id>: SHOW TBLPROPERTIES [...]).Two spots logged statement text:
connection.pyINFO query log usedtextwrap.shorten(sql, width=60)(raw SQL).connection.py.__sendlogged the full request JSON at DEBUG, which includesthe raw
statementfield.This PR redacts literal values before logging at both spots. No raw SQL is now
logged at any level, and result rows/data are never logged.
Redaction approach: a new
wherobots/db/redaction.pyprovidesredact_sql()(plus
get_statement_type()). It is a single-pass character tokenizer (nothird-party dependency) that replaces single-quoted string literals and numeric
literals with
?, while preserving keywords, function names, and quoted(
"..."/`...`) identifiers, and leaving comments intact. The INFO lognow emits the redacted statement plus its statement type alongside the
execution_id; the DEBUG request log redacts thestatementfield in theserialized payload (the wire payload sent to the server is unchanged).
This logic is duplicated from sql-session PR #197 — these are separate repos
with no shared package, so the implementation is intentionally replicated rather
than imported.
Limitations: best-effort redaction for logging only, not a security
boundary. Dollar-quoted strings and other exotic literal syntaxes are not
recognized (Sedona/Spark SQL does not use them). A literal embedded inside a SQL
comment is preserved verbatim (comments are kept for structure).
Related Issues
Relates to WBC-139
Requester Checklist
Complete these before marking Ready for Review
Visual Proof
Reviewer Checklist