Skip to content

feat(googlesql/analysis): masking-grade relation-projection resolution#299

Merged
h3n4l merged 4 commits into
mainfrom
feat/googlesql-analysis-harden
Jun 10, 2026
Merged

feat(googlesql/analysis): masking-grade relation-projection resolution#299
h3n4l merged 4 commits into
mainfrom
feat/googlesql-analysis-harden

Conversation

@h3n4l

@h3n4l h3n4l commented Jun 10, 2026

Copy link
Copy Markdown
Member

googlesql/analysis: masking-grade relation-projection resolution

Hardens the GoogleSQL query-span analysis from best-effort lineage to MASKING-GRADE — the precision the bytebase BigQuery+Spanner cutover needs (bytebase's masker applies per-output-column maskers positionally and fails OPEN, so under-attributed lineage returns sensitive columns unmasked).

What's in it

  • Relation-projection resolution (projection.go): every FROM relation (base table / CTE / derived subquery / join / UNNEST) resolves to its ordered output projection with base-column lineage. A */rel.*/column over a CTE or derived relation reproduces that relation's RESOLVED projection (a subset-projecting CTE's star surfaces only its projected columns — never the base table's unprojected sensitive ones).
  • Set operations position-merge resolved projections — including star arms — via wire markers the catalog-aware consumer expands (StarSegment{ExceptColumns} / StarMerge / SetOpMerge{ByName, MatchColumns}); recursive CTEs resolve to a lineage fixpoint; SELECT * EXCEPT/REPLACE carries its modifiers; JOIN USING coalesces keys case-insensitively (the legacy resolver's lowercase keys never coalesced — a positional masking leak); UNNEST elements carry the array argument's lineage; BY NAME / CORRESPONDING merges by column name.
  • Fail-closed structural rule: shapes that cannot be resolved to correct lineage (NATURAL JOIN, struct-field stars d.s.*, unresolvable qualified stars, value-table elem.*) return an ERROR instead of silently-empty results — an error fail-closes masking; empty results fail open.
  • An unqualified column with exactly one base relation in scope is attributed to it (kills cross-table name-collision over-matching in multi-arm set-ops).

Validation

  • Legacy-resolver differential corpora in bytebase (goldens RECORDED from the legacy ANTLR resolvers, never hand-authored): bigquery 54/54, spanner 52/52 — plus leak-pin unit tests for shapes the legacy resolver cannot record (it errors on them).
  • 5 independent Codex review rounds (find → fix → re-verify loop): 8 → 3 → 1 → 0 findings on the bigquery half; 1 P0 + 2 refinements on the spanner half — every finding fixed + pinned or defended with ground truth.
  • Full googlesql suite + vet + the live Spanner-emulator oracle differential green.

The bytebase consumer side (plugin/parser/{bigquery,spanner} cutover) follows as a separate bytebase PR pinned to this.

🤖 Generated with Claude Code

h3n4l and others added 4 commits June 10, 2026 12:32
Resolve each FROM relation (base table / CTE / derived subquery / join /
UNNEST) to its ordered output projection with base-column lineage, so query
spans drive bytebase data masking faithfully:

- projection.go: the relation-projection resolver — CTE/derived projections
  retained and reproduced by `*`/qualified refs; set-op position-merge with
  star markers (baseStar / starGroup EXCEPT-REPLACE / StarMerge / deferred
  SetOpMerge); BY NAME / CORRESPONDING name-merge (mergeProjectionsByName);
  recursive-CTE lineage fixpoint; JOIN USING key coalescing (case-insensitive
  — the legacy resolver's lowercase keys never coalesced, a masking leak);
  NATURAL JOIN fails closed.
- query_span.go: spanWalker builds relations per FROM source; UNNEST yields a
  relation whose element column carries the array argument's lineage; wire
  types StarSegment{ExceptColumns}/StarMergeInfo/SetOpMergeInfo{ByName,
  MatchColumns} let the metadata-aware consumer finish base-table star
  expansion without re-deriving lineage; CTEReferences recorded separately
  from AccessTables.

Validated against a 54-case legacy-resolver differential corpus (bytebase
bigquery query-span goldens recorded FROM the legacy ANTLR resolver) plus
projection unit tests; three independent-review gate rounds (8 -> 3 -> 1
findings, all fixed and pinned).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t-star fail-closed, base-field names

Three additions surfaced by the Spanner legacy-differential corpus (the
bigquery corpus stays green; the resolver is shared):

- resolveColumn: an unqualified column with exactly ONE base relation in scope
  is attributed to that relation (the strict pass already ruled out concrete
  CTE/derived projections). Without the attribution the consumer matches the
  bare name across ALL expanded tables, over-including whenever another table
  shares the column name (a 3-way UNION whose arms' tables overlap) and
  diverging from the legacy resolver's exact single-table attribution.
- resolveDotStar: a multi-part qualified star (schema.table.*) matches the
  relation by its trailing part with the schema prefix verified; an
  UNRESOLVABLE wild path now FAILS CLOSED instead of silently yielding zero
  output columns (zero result maskers = every output column unmasked). The
  legacy resolver errored on schema-qualified stars; resolving them is an
  improvement, the fail-closed branch preserves the structural rule.
- BaseFieldName (projColumn -> ColumnInfo/StarSegment): marks the JOIN ...
  USING coalesced key as a base-table FIELD passthrough, so a consumer
  reproducing legacy naming renders it in the field's metadata case (the
  legacy resolver named it after the left PhysicalTable's field).

Existing projection tests updated to the table-attributed lineage (strictly
more precise; the corpus goldens — recorded from the legacy resolver — pin the
end-to-end behavior).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…losed

Gate-round P0: resolveDotStar's rel.* fast path fired on parts[0] for ANY path
length, so a struct-field star through a relation (`d.s.*`) returned the
relation's WHOLE projection — misaligning the positional masker (the first
result's masker lands on the struct's first sub-column → a sensitive column
returned unmasked). The fast path now requires a SINGLE-part qualifier;
a multi-part path with a leading relation is a struct star omni cannot
enumerate metadata-free and FAILS CLOSED (legacy errored there too). The
schema-qualified branch additionally requires the head NOT to name a relation.
Pinned by TestProjection_StructFieldStarFailsClosed (the gate's reproducer +
the plain t.s.* variant + both still-resolving shapes).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…dings)

- A value-table (UNNEST) elem.* fails closed: the element's struct sub-fields
  expand to N engine output columns omni cannot enumerate — returning the
  relation's single projection column would shift every later position against
  the positional masker.
- The schema-qualified star branch requires the written prefix to match the
  relation's non-empty Schema OR Database/dataset qualifier: an unqualified
  FROM accepts no written prefix (the engine rejects the range variable; the
  legacy resolver errored on every schema-qualified star), and a BigQuery
  dataset-qualified star (ds.t.* FROM ds.t) now resolves via the Database
  bucket instead of the removed empty-schema acceptance.

Pinned by TestProjection_DotStarQualifierTightening.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@h3n4l h3n4l merged commit 2e66c34 into main Jun 10, 2026
2 checks passed
h3n4l added a commit to bytebase/bytebase that referenced this pull request Jun 10, 2026
Drop the local-development replace directive and pin
github.com/bytebase/omni to the bytebase/omni#299 merge (masking-grade
relation-projection resolution for googlesql).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
h3n4l added a commit to bytebase/bytebase that referenced this pull request Jun 11, 2026
…0569)

* feat(bigquery): cut query-span/split/diagnose/query-type to omni googlesql

BigQuery plugin delegates to the omni hand-written parser + masking-grade
analysis (replace directive pins the local omni-harden during validation):
split/diagnose/classify via omni parser+diagnostics+analysis; the query-span
extractor consumes resolved relation projections (StarSegments with
ExceptColumns for USING-coalesced keys; StarMerge; deferred SetOpMerge with
BY NAME name-merge honoring MatchColumns) and expands base-table stars via
catalog metadata. Legacy ANTLR import removed from the plugin.

54-case legacy-recorded differential corpus green (goldens recorded from the
legacy resolver, never hand-faked); leak-pin unit tests cover shapes the
legacy resolver cannot record (lowercase-USING coalesce, UNNEST lineage,
BY NAME merges) per the structural rule: correct lineage or fail closed,
never silent under-attribution.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(spanner): cut query-span/split/diagnose/query-type to omni googlesql

Spanner plugin delegates to the omni hand-written parser + masking-grade
analysis (DialectSpanner), mirroring the BigQuery cutover in this branch:

- split.go: omni block-aware splitter with the legacy parse-tree splitter's
  conventions (contiguous Text from the previous statement's end through the
  trailing ';', positions via the shared byte-offset mapper).
- dignose.go / query_type.go: omni diagnostics + classifier; the legacy
  spanner SET->Select special case re-applied.
- query_span_extractor.go: the Spanner name model (named schemas under one
  database; the db part of db.schema.table ignored like legacy), system-only
  queries early-return an EMPTY SelectInfoSchema span and mixed user+system
  rejects (exact legacy behavior; SPANNER_SYS included), predicate columns
  empty (legacy parity), star-derived USING keys in metadata case.
- spanner.go (legacy ANTLR wrapper) deleted; legacy import count is 0.

Validated against the 52-case legacy-resolver differential corpus (recorded
FROM the legacy spanner resolver; two ON+USING cases reshaped so both
resolvers agree, the lowercase-USING case moved to a leak pin — legacy's
case-fold non-coalesce is a positional masking leak omni fixes). Leak-pin
unit tests cover legacy-unrecordable shapes: lowercase-USING coalesce, UNNEST
lineage, BY NAME merge, schema-qualified dot-star (legacy errored; omni
resolves with schema-qualified lineage), mixed/system-only handling.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: pin omni to the merged googlesql masking-grade analysis

Drop the local-development replace directive and pin
github.com/bytebase/omni to the bytebase/omni#299 merge (masking-grade
relation-projection resolution for googlesql).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(googlesql): extract the shared engine implementation

The bigquery and spanner plugins were structural copies (SonarCloud: 73%
duplication on new code). Extract the omni-backed implementation into
backend/plugin/parser/googlesql — the query-span extractor, splitter,
diagnostics adapter, and query-type mapping — parameterized by a Config
carrying the documented dialect deltas (name model, system-table handling,
SET classification, naming/split conventions). Each engine package shrinks to
a registration wrapper plus its dialect Config; the shared test harness moves
to googlesql/googlesqltest.

Besides the duplication gate, one shared code path is the right shape for
masking-critical lineage: the two copies could otherwise drift independently.
Behavior is unchanged — both legacy-recorded differential corpora (54 + 52
cases) and all leak-pin tests pass identically through the shared
implementation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(googlesql): quality-gate round — thread ctx, drop dead code, dedupe

- Thread context.Context as a parameter through the metadata-touching methods
  instead of storing it on the extractor (SonarCloud S8242).
- Remove the orderedColumns machinery that became dead once stars expand via
  omni StarSegments (golangci-lint unused-parameter), make applyStarModifiers
  a plain function, and move to slices.Sort/SortFunc/ContainsFunc.
- Collapse the per-engine wrappers onto googlesql.Register (registers split/
  diagnose/query-span and returns the handlers for the engine tests) and
  unify the table/view metadata lookup into a single lookup loop — removing
  the remaining duplication SonarCloud paired between the two plugin files
  and against the Trino extractor this implementation was templated from.

Behavior unchanged: both legacy-recorded differential corpora and all
leak-pin tests pass identically.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(googlesql): address cloud review — fail closed on metadata outages, BigQuery SET

- expandTablesToColumns: a metadata error that is NOT ResourceNotFoundError (a
  store outage, a canceled context) now fails the span FATALLY instead of
  degrading to a table-level fallback with silently-empty result lineage and no
  NotFoundError — the fail-open positional masker would have returned sensitive
  data unmasked after a transient infrastructure failure. NotFound still
  degrades non-fatally (recorded as span.NotFoundError, which the masking layer
  rejects). Star-expansion lookups only touch tables this pass already
  validated (omni records every physical base table into AccessTables), so the
  single fatal gate covers the lineage paths.
- BigQuery config: SetStatementIsSelect — the legacy BigQuery listener also
  classified SET as Select ("treat SAFE SET as select"); without it omni's
  Unknown would be rejected by the new-ACL access check, regressing read-only
  SET statements.
- googlesql.Register returns a Handlers struct so each engine re-exports
  SplitSQL/GetQuerySpan with its own declaration (revive: exported).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant