feat(googlesql/analysis): masking-grade relation-projection resolution#299
Merged
Conversation
Resolve each FROM relation (base table / CTE / derived subquery / join /
UNNEST) to its ordered output projection with base-column lineage, so query
spans drive bytebase data masking faithfully:
- projection.go: the relation-projection resolver — CTE/derived projections
retained and reproduced by `*`/qualified refs; set-op position-merge with
star markers (baseStar / starGroup EXCEPT-REPLACE / StarMerge / deferred
SetOpMerge); BY NAME / CORRESPONDING name-merge (mergeProjectionsByName);
recursive-CTE lineage fixpoint; JOIN USING key coalescing (case-insensitive
— the legacy resolver's lowercase keys never coalesced, a masking leak);
NATURAL JOIN fails closed.
- query_span.go: spanWalker builds relations per FROM source; UNNEST yields a
relation whose element column carries the array argument's lineage; wire
types StarSegment{ExceptColumns}/StarMergeInfo/SetOpMergeInfo{ByName,
MatchColumns} let the metadata-aware consumer finish base-table star
expansion without re-deriving lineage; CTEReferences recorded separately
from AccessTables.
Validated against a 54-case legacy-resolver differential corpus (bytebase
bigquery query-span goldens recorded FROM the legacy ANTLR resolver) plus
projection unit tests; three independent-review gate rounds (8 -> 3 -> 1
findings, all fixed and pinned).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t-star fail-closed, base-field names Three additions surfaced by the Spanner legacy-differential corpus (the bigquery corpus stays green; the resolver is shared): - resolveColumn: an unqualified column with exactly ONE base relation in scope is attributed to that relation (the strict pass already ruled out concrete CTE/derived projections). Without the attribution the consumer matches the bare name across ALL expanded tables, over-including whenever another table shares the column name (a 3-way UNION whose arms' tables overlap) and diverging from the legacy resolver's exact single-table attribution. - resolveDotStar: a multi-part qualified star (schema.table.*) matches the relation by its trailing part with the schema prefix verified; an UNRESOLVABLE wild path now FAILS CLOSED instead of silently yielding zero output columns (zero result maskers = every output column unmasked). The legacy resolver errored on schema-qualified stars; resolving them is an improvement, the fail-closed branch preserves the structural rule. - BaseFieldName (projColumn -> ColumnInfo/StarSegment): marks the JOIN ... USING coalesced key as a base-table FIELD passthrough, so a consumer reproducing legacy naming renders it in the field's metadata case (the legacy resolver named it after the left PhysicalTable's field). Existing projection tests updated to the table-attributed lineage (strictly more precise; the corpus goldens — recorded from the legacy resolver — pin the end-to-end behavior). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…losed Gate-round P0: resolveDotStar's rel.* fast path fired on parts[0] for ANY path length, so a struct-field star through a relation (`d.s.*`) returned the relation's WHOLE projection — misaligning the positional masker (the first result's masker lands on the struct's first sub-column → a sensitive column returned unmasked). The fast path now requires a SINGLE-part qualifier; a multi-part path with a leading relation is a struct star omni cannot enumerate metadata-free and FAILS CLOSED (legacy errored there too). The schema-qualified branch additionally requires the head NOT to name a relation. Pinned by TestProjection_StructFieldStarFailsClosed (the gate's reproducer + the plain t.s.* variant + both still-resolving shapes). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…dings) - A value-table (UNNEST) elem.* fails closed: the element's struct sub-fields expand to N engine output columns omni cannot enumerate — returning the relation's single projection column would shift every later position against the positional masker. - The schema-qualified star branch requires the written prefix to match the relation's non-empty Schema OR Database/dataset qualifier: an unqualified FROM accepts no written prefix (the engine rejects the range variable; the legacy resolver errored on every schema-qualified star), and a BigQuery dataset-qualified star (ds.t.* FROM ds.t) now resolves via the Database bucket instead of the removed empty-schema acceptance. Pinned by TestProjection_DotStarQualifierTightening. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
h3n4l
added a commit
to bytebase/bytebase
that referenced
this pull request
Jun 10, 2026
Drop the local-development replace directive and pin github.com/bytebase/omni to the bytebase/omni#299 merge (masking-grade relation-projection resolution for googlesql). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
h3n4l
added a commit
to bytebase/bytebase
that referenced
this pull request
Jun 11, 2026
…0569) * feat(bigquery): cut query-span/split/diagnose/query-type to omni googlesql BigQuery plugin delegates to the omni hand-written parser + masking-grade analysis (replace directive pins the local omni-harden during validation): split/diagnose/classify via omni parser+diagnostics+analysis; the query-span extractor consumes resolved relation projections (StarSegments with ExceptColumns for USING-coalesced keys; StarMerge; deferred SetOpMerge with BY NAME name-merge honoring MatchColumns) and expands base-table stars via catalog metadata. Legacy ANTLR import removed from the plugin. 54-case legacy-recorded differential corpus green (goldens recorded from the legacy resolver, never hand-faked); leak-pin unit tests cover shapes the legacy resolver cannot record (lowercase-USING coalesce, UNNEST lineage, BY NAME merges) per the structural rule: correct lineage or fail closed, never silent under-attribution. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(spanner): cut query-span/split/diagnose/query-type to omni googlesql Spanner plugin delegates to the omni hand-written parser + masking-grade analysis (DialectSpanner), mirroring the BigQuery cutover in this branch: - split.go: omni block-aware splitter with the legacy parse-tree splitter's conventions (contiguous Text from the previous statement's end through the trailing ';', positions via the shared byte-offset mapper). - dignose.go / query_type.go: omni diagnostics + classifier; the legacy spanner SET->Select special case re-applied. - query_span_extractor.go: the Spanner name model (named schemas under one database; the db part of db.schema.table ignored like legacy), system-only queries early-return an EMPTY SelectInfoSchema span and mixed user+system rejects (exact legacy behavior; SPANNER_SYS included), predicate columns empty (legacy parity), star-derived USING keys in metadata case. - spanner.go (legacy ANTLR wrapper) deleted; legacy import count is 0. Validated against the 52-case legacy-resolver differential corpus (recorded FROM the legacy spanner resolver; two ON+USING cases reshaped so both resolvers agree, the lowercase-USING case moved to a leak pin — legacy's case-fold non-coalesce is a positional masking leak omni fixes). Leak-pin unit tests cover legacy-unrecordable shapes: lowercase-USING coalesce, UNNEST lineage, BY NAME merge, schema-qualified dot-star (legacy errored; omni resolves with schema-qualified lineage), mixed/system-only handling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: pin omni to the merged googlesql masking-grade analysis Drop the local-development replace directive and pin github.com/bytebase/omni to the bytebase/omni#299 merge (masking-grade relation-projection resolution for googlesql). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(googlesql): extract the shared engine implementation The bigquery and spanner plugins were structural copies (SonarCloud: 73% duplication on new code). Extract the omni-backed implementation into backend/plugin/parser/googlesql — the query-span extractor, splitter, diagnostics adapter, and query-type mapping — parameterized by a Config carrying the documented dialect deltas (name model, system-table handling, SET classification, naming/split conventions). Each engine package shrinks to a registration wrapper plus its dialect Config; the shared test harness moves to googlesql/googlesqltest. Besides the duplication gate, one shared code path is the right shape for masking-critical lineage: the two copies could otherwise drift independently. Behavior is unchanged — both legacy-recorded differential corpora (54 + 52 cases) and all leak-pin tests pass identically through the shared implementation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(googlesql): quality-gate round — thread ctx, drop dead code, dedupe - Thread context.Context as a parameter through the metadata-touching methods instead of storing it on the extractor (SonarCloud S8242). - Remove the orderedColumns machinery that became dead once stars expand via omni StarSegments (golangci-lint unused-parameter), make applyStarModifiers a plain function, and move to slices.Sort/SortFunc/ContainsFunc. - Collapse the per-engine wrappers onto googlesql.Register (registers split/ diagnose/query-span and returns the handlers for the engine tests) and unify the table/view metadata lookup into a single lookup loop — removing the remaining duplication SonarCloud paired between the two plugin files and against the Trino extractor this implementation was templated from. Behavior unchanged: both legacy-recorded differential corpora and all leak-pin tests pass identically. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(googlesql): address cloud review — fail closed on metadata outages, BigQuery SET - expandTablesToColumns: a metadata error that is NOT ResourceNotFoundError (a store outage, a canceled context) now fails the span FATALLY instead of degrading to a table-level fallback with silently-empty result lineage and no NotFoundError — the fail-open positional masker would have returned sensitive data unmasked after a transient infrastructure failure. NotFound still degrades non-fatally (recorded as span.NotFoundError, which the masking layer rejects). Star-expansion lookups only touch tables this pass already validated (omni records every physical base table into AccessTables), so the single fatal gate covers the lineage paths. - BigQuery config: SetStatementIsSelect — the legacy BigQuery listener also classified SET as Select ("treat SAFE SET as select"); without it omni's Unknown would be rejected by the new-ACL access check, regressing read-only SET statements. - googlesql.Register returns a Handlers struct so each engine re-exports SplitSQL/GetQuerySpan with its own declaration (revive: exported). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
googlesql/analysis: masking-grade relation-projection resolution
Hardens the GoogleSQL query-span analysis from best-effort lineage to MASKING-GRADE — the precision the bytebase BigQuery+Spanner cutover needs (bytebase's masker applies per-output-column maskers positionally and fails OPEN, so under-attributed lineage returns sensitive columns unmasked).
What's in it
projection.go): every FROM relation (base table / CTE / derived subquery / join / UNNEST) resolves to its ordered output projection with base-column lineage. A*/rel.*/column over a CTE or derived relation reproduces that relation's RESOLVED projection (a subset-projecting CTE's star surfaces only its projected columns — never the base table's unprojected sensitive ones).StarSegment{ExceptColumns}/StarMerge/SetOpMerge{ByName, MatchColumns}); recursive CTEs resolve to a lineage fixpoint;SELECT * EXCEPT/REPLACEcarries its modifiers;JOIN USINGcoalesces keys case-insensitively (the legacy resolver's lowercase keys never coalesced — a positional masking leak); UNNEST elements carry the array argument's lineage; BY NAME / CORRESPONDING merges by column name.d.s.*, unresolvable qualified stars, value-tableelem.*) return an ERROR instead of silently-empty results — an error fail-closes masking; empty results fail open.Validation
googlesqlsuite + vet + the live Spanner-emulator oracle differential green.The bytebase consumer side (plugin/parser/{bigquery,spanner} cutover) follows as a separate bytebase PR pinned to this.
🤖 Generated with Claude Code