Comprehensive testing, Cassandra rebuild, Confluent conformance, and documentation overhaul#263
Closed
Comprehensive testing, Cassandra rebuild, Confluent conformance, and documentation overhaul#263
Conversation
Add 50 new tests across schema parsers and compatibility checker: - Avro parser: deeply nested records, logical types, recursive types, records with defaults, PaymentEvent, namespaces, complex collections/unions - Protobuf parser: deeply nested messages, complex maps, multiple top-level messages, PaymentEvent, proto3 optional, streaming services - JSON Schema parser: cross-$ref, PaymentEvent, composition, deeply nested, conditional if/then/else, standalone non-object types - Compatibility checker: all 7 modes across 3 schema types, transitive chains, edge cases, ParseMode, 4-version evolution scenarios
Add reusable storage conformance test suite with 108 test cases that can run against any storage backend via RunAll(t, factoryFunc): - Schema CRUD (25 tests) - Subject operations (9 tests) - Config and mode management (16 tests) - Users and API keys (21 tests) - Import and ID management (8 tests) - Sentinel error verification (30 tests)
…(Phase 4) Set up godog BDD test framework with in-process and Docker-based modes: - godog test runner with tag filtering (~@operational for in-process) - Docker Compose split files (base + per-backend overrides) - Webhook sidecar for Docker container control (kill, restart, pause/unpause) - Backend config files (memory, postgres, mysql, cassandra) - Step definitions: schema, import, mode, reference, infrastructure - Fresh httptest server per scenario for isolation - BDD_REGISTRY_URL/BDD_WEBHOOK_URL env vars for Docker-based runs
…ase 5) Comprehensive Gherkin features covering all API functionality: - Schema types: Avro (15), Protobuf (14), JSON Schema (18) scenarios covering all type variants, nesting, collections, round-trips - Compatibility modes: all 7 levels across 3 schema types, transitive 3-version chains, per-subject overrides, check endpoint - Schema references: cross-subject Avro, internal JSON $ref - Import: bulk import with ID preservation, all schema types - Mode management: READWRITE/READONLY/IMPORT, per-subject isolation - API errors: all Confluent error codes (40401-50001), invalid schemas - Health/metadata: cluster ID, server version, contexts endpoint - Configuration: global/per-subject, all 7 levels, delete/fallback - Deletion: soft/permanent delete, version isolation, deleted=true
Docker-based operational tests requiring webhook sidecar infrastructure: - Memory: data loss on restart, ID reset after restart (2 scenarios) - PostgreSQL: persistence, health on DB kill, recovery, pause/unpause, ID consistency (5 scenarios) - MySQL: persistence, recovery, pause/unpause (3 scenarios) - Cassandra: persistence, recovery (longer timeouts), pause/unpause (3 scenarios)
Add per-backend BDD test targets to Makefile: - test-bdd-memory, test-bdd-postgres, test-bdd-mysql, test-bdd-cassandra - test-bdd-all (runs all backends sequentially) - test-bdd-functional (functional only, skip operational) - test-all (unit + conformance + BDD in-process) Add tests/PROGRESS.md documenting full test inventory and phase status.
Redesign the BDD test infrastructure to run the webhook process directly inside the schema registry container instead of as a separate sidecar. This fixes operational resilience tests on Podman/macOS where Docker socket access is unavailable. Key changes: - Add Dockerfile.registry that builds the registry + webhook into a single container with entrypoint managing both processes - Rewrite all webhook scripts for PID-based process control (restart, stop, start, kill, pause, unpause) instead of Docker API calls - Fix zombie process reaping: start registry via intermediate shell so it's reparented to tini (PID 1) for proper wait() handling - Add include-command-output-in-response to hooks.json for synchronous webhook execution - Redirect registry stdout to /proc/1/fd/1 to avoid blocking webhook response pipe - Add 5s HTTP client timeout to TestContext for pause/unpause scenarios - Fix cleanup between operational scenarios (ensureRegistryRunning) - Fix memory store version counter reset on permanent delete - Fix hardcoded schema IDs in feature files to use stored values - Expand operational_memory.feature from 2 to 13 scenarios covering restart, stop/start, SIGKILL recovery, pause/unpause, config/mode reset, and multiple restart cycles All 160 BDD scenarios pass (147 functional + 13 operational).
Add 5 BDD test jobs to CI pipeline: - bdd-functional-tests: in-process, no Docker, fast gate - bdd-memory-tests: Docker Compose, functional + operational - bdd-postgres-tests: Docker Compose, functional + operational - bdd-mysql-tests: Docker Compose, functional + operational - bdd-cassandra-tests: Docker Compose, functional + operational Backend jobs depend on functional tests passing first to avoid wasting resources when tests are fundamentally broken. Also trigger CI on feature/** branch pushes.
- Fix gofmt import ordering in bdd_test.go (stdlib before third-party) - Fix MySQL healthcheck: use TCP query instead of socket-based ping that passes against MySQL's temporary init server before real server is ready. Add start_period and increase retries for CI runners. - Fix Cassandra healthcheck: add start_period and increase interval for slower CI runners. - Fix start-service.sh: send SIGCONT to paused (SIGSTOP'd) processes so ensureRegistryRunning works after pause scenarios.
- MySQL: backtick-quote table names in TRUNCATE (schemas is reserved) - Cassandra: add retry loop in entrypoint.sh for DB connection timing - PostgreSQL: fix health check scenario to use waitForUnhealthy - All backends: remove register-during-pause step that causes timeouts - PostgreSQL: fix stored key mismatch (before_id → schema_id)
Root cause: gocql CreateSession() fails with "no connections were made" when cluster.Keyspace is set but the keyspace doesn't exist. The regular CI pre-creates the keyspace before running tests, but the BDD Docker Compose didn't. Fix: Cassandra healthcheck now creates the keyspace (idempotent) so it exists before the registry starts. Also add start_period: 90s to the schema-registry healthcheck to give the entrypoint retry loop enough time for slow-starting backends.
- GetSchemaBySubjectVersion: return ErrVersionNotFound for deleted versions - GetSchemasBySubject: return ErrSubjectNotFound when subject has no versions - DeleteSubject: return ErrSubjectNotFound when subject doesn't exist - GetSubjectsBySchemaID: validate schema ID exists before scanning subjects - GetVersionsBySchemaID: validate schema ID exists before scanning subjects These bugs were uncovered by BDD tests running against the Cassandra backend — the memory backend already handled these cases correctly.
Add PostgreSQL, MySQL, and Cassandra conformance tests that run the same ~100 tests against each backend, ensuring identical Storage interface behavior. Add storage-conformance CI job.
Run PostgreSQL, MySQL, and Cassandra conformance tests as independent CI jobs so they execute in parallel rather than serially.
Conformance jobs now depend on postgres-tests, mysql-tests, and cassandra-tests so they only run once all integration tests succeed.
All four conformance backends now appear as separate CI jobs.
GetSchemaByID calls GetVersionsBySchemaID, which was calling GetSchemaByID to validate schema existence, creating infinite recursion. Replace with direct schema_by_id table query in both GetSubjectsBySchemaID and GetVersionsBySchemaID.
Each sub-test calls defer store.Close(). For DB backends sharing a single connection, this killed the connection after the first test. Wrap shared stores with noCloseStore so Close() is a no-op in sub-tests; the real Close() happens in the parent TestXxxBackend.
G704 (SSRF): admin CLI uses user-provided --server flag, not tainted G705 (XSS): schema content from storage, response has registry content type G117 (secret): OIDC config struct field, not a hardcoded secret
G117 flags all config struct fields named Password/Secret — these are legitimate config structs, not hardcoded secrets. G202 flags parameterized SQL query building using $N placeholders. Both are false positives introduced by a newer gosec version.
PostgreSQL/MySQL fixes: - Fix column name typo 'schema' -> 'schema_text' in GetSchemaByGlobalFingerprint - Fix missing backticks on reserved table name in MySQL GetSchemaByFingerprint - Fix SubjectExists to filter deleted rows (PostgreSQL) - Fix GetSchemasBySubject to return empty slice vs ErrSubjectNotFound when subject exists but all versions are soft-deleted - Fix ListSchemas LatestOnly query args mismatch - Re-insert default global config/mode after table truncation Cassandra fixes: - Fix DeleteConfig/DeleteMode to return ErrNotFound when key doesn't exist - Fix SubjectExists to check for non-deleted versions - Fix GetLatestSchema to skip soft-deleted versions - Fix GetSchemasBySubject to handle includeDeleted correctly - Fix ListUsers to sort by ID - Fix UpdateUser to detect duplicate usernames - Fix CreateAPIKey to detect duplicate hashes - Add reference tracking (schema_references + references_by_target) in CreateSchema Conformance test fixes: - Create users before API keys in auth tests (FK constraint compliance) - Adjust schema dedup tests to work with all backends
PostgreSQL/MySQL: - Fix UpdateAPIKey to include key_hash in UPDATE statement - Fix UpdateAPIKeyLastUsed to check RowsAffected and return ErrAPIKeyNotFound MySQL: - Add id_alloc table for sequential NextID/SetNextID (replaces AUTO_INCREMENT read) - Fix NextID off-by-one: use atomic SELECT FOR UPDATE + UPDATE on id_alloc Cassandra: - Fix CreateSchema to return ErrSchemaExists for duplicate fingerprint in same subject - Use user-provided fingerprint when set (matches PostgreSQL/MySQL behavior) - Fix GetSchemaBySubjectVersion to distinguish ErrSubjectNotFound vs ErrVersionNotFound - Fix DeleteSchema to check existence before delete with proper error types Tests: - Fix error_tests.go: create users before API keys (FK constraint) - Set valid ExpiresAt on API keys for MySQL compatibility - Add id_alloc to MySQL truncation and re-initialization
Match PostgreSQL behavior: only count non-deleted schemas when checking if a subject exists.
…rt references - GetSchemaByFingerprint: build result directly instead of calling GetSchemaBySubjectVersion, which rejects deleted versions even when includeDeleted=true - ImportSchema: write references to both schema_references and references_by_target tables, matching CreateSchema behavior
Add comprehensive handler-level unit tests: - handlers_test.go: ~65 tests covering schema, subject, config, mode, and compatibility endpoints - admin_test.go: ~40 tests covering user and API key admin endpoints - account_test.go: ~9 tests covering self-service account endpoints Total: 119 handler tests covering request parsing, response format, error codes, and Confluent API compatibility.
Schema references are a first-class Confluent feature (since Platform 5.5) but were not being resolved — any schema using cross-subject references would fail to parse, breaking Confluent compatibility. Changes: - Add Schema field to storage.Reference for resolved content - Add resolveReferences() to registry layer, wired into all Parse and compatibility check call sites - Avro parser: use avro.ParseWithCache to pre-register referenced named types - JSON Schema parser: use compiler.AddResource for external $ref - Protobuf resolver: store actual reference content for imports - Add SchemaWithRefs type to compatibility interface so checkers can parse schemas that have cross-subject references - Avro checker: parse with reference cache - Protobuf checker: replace simpleResolver with checkerResolver that handles references and well-known types - Add cross-subject reference tests for all three parser types - Update all compatibility checker tests for new interface
…rites, and block-based IDs Replace RDBMS-style patterns with Cassandra-native approaches: - Add SAI indexes on subject_versions (schema_id, deleted) and schemas_by_id (fingerprint), eliminating schemas_by_fingerprint and subjects tables - Batch reference writes in CreateSchema/ImportSchema with logged batches - Batch soft-deletes in DeleteSubject with unlogged batch (same partition) - Block-based ID allocation (default block size 50) reduces LWT frequency ~50x - IN-clause batch reads in GetSchemasBySubject (2N+1 → 3 queries) - SAI queries replace O(S×V) full-table scans in GetSubjectsBySchemaID, GetVersionsBySchemaID, cleanupOrphanedSchema, findSchemaInSubject, etc. - Propagate errors in cleanup methods via slog.Warn instead of silent discard - Update conformance test to remove dropped tables from truncation list Requires Cassandra 5.0+ for SAI support. Breaking change — drops legacy tables. All 1353 BDD tests pass against Cassandra.
The Cassandra storage layer now requires SAI (Storage Attached Index) which was introduced in Cassandra 5.0.
…d tables from cleanup - Re-check findSchemaInSubject on CAS retry to detect concurrent registrations of the same schema (fixes TestSchemaIdempotency) - Remove schemas_by_fingerprint and subjects from BDD truncation list (tables were dropped in SAI migration)
Block-based ID allocator caches IDs in-process, but GetMaxSchemaID reads from id_alloc table. After truncation, the table is empty and GetMaxSchemaID fails, causing fetchMaxId responses to omit maxId.
gocql sessions are expensive to create (~500-1000ms each due to topology discovery and connection pool setup). Previously, each BDD scenario cleanup created and closed a new session, adding significant overhead across 1355 scenarios. Now we lazily create a single long-lived session and reuse it for all cleanup operations.
All 4 optimization phases are complete and CI-verified (23/23 green).
…s, OpenAPI spec, and bug fixes - Makefile: 16 test targets (test-unit, test-bdd, test-integration, test-conformance, test-concurrency, test-migration, test-api, test-ldap, test-vault, test-oidc, test-auth, test-compatibility) with BACKEND= variable support and auto-detected container runtime (docker/podman) - Helper scripts: start-db.sh, stop-db.sh, setup-ldap.sh, setup-vault.sh, setup-oidc.sh for Docker lifecycle management with sr-test-* container naming - OpenAPI spec: complete 3100+ line spec with embedded serving at /docs endpoint - Fix LDAP bootstrap.ldif: reorder users before groups so memberOf overlay works - Fix migrate-from-confluent.sh: empty array expansion with set -u, container networking - Fix concurrency test port conflict: 18081 → 28181 to avoid BDD container collision - Fix migration test: dedicated container network for Podman macOS compatibility
Replace the SAI-based fingerprint dedup in ensureGlobalSchema with a Lightweight Transaction (INSERT IF NOT EXISTS) on a new schema_fingerprints table where fingerprint is the partition key. The previous approach used an eventually-consistent SAI index on schemas_by_id (where schema_id is the PK) to detect duplicate fingerprints. Under concurrent registration of the same schema, two writers could both miss each other's SAI entries, allocate different schema_ids, and create duplicate global schemas — causing TestSchemaIdempotency failures. The new schema_fingerprints table provides a true CAS: exactly one writer wins the fingerprint claim and all others receive the winning schema_id in the LWT response. An ensureSchemaData helper handles crash recovery (fingerprint claimed but schemas_by_id data missing) by inserting the data on the next request. Also updates ImportSchema to claim fingerprints for consistency, and adds a migration backfill step that populates schema_fingerprints from existing schemas_by_id data for production upgrades.
Import mode preserves external IDs, so the same schema content can legitimately have different IDs across subjects/imports. The fingerprint LWT claim should not reject these — it's for CreateSchema dedup only. Also add schema_fingerprints to BDD Cassandra cleanup truncation list.
The schema_fingerprints table may not exist if the schema-registry hasn't finished migrating when the first BDD scenario cleanup runs. Handle the "unconfigured table" error gracefully instead of failing hard, which was causing the BDD Cassandra tests to hang.
Remove orphaned tracking and analysis files that are no longer relevant as the work they tracked has been completed and merged.
Add 14 documentation guides covering all aspects of the registry: - Getting started, installation, and configuration reference - Storage backends (PostgreSQL, MySQL, Cassandra, memory) - Schema types (Avro, Protobuf, JSON Schema) with references - Compatibility modes, migration from Confluent, deployment - Authentication (6 methods), security hardening, RBAC - Monitoring (Prometheus metrics, alerting, Grafana) - Development guide, troubleshooting, and error code reference Add auto-generated API reference from OpenAPI spec: - docs/api-reference.md (markdown, 7002 lines via widdershins) - docs/api/index.html (ReDoc interactive HTML) - scripts/generate-api-docs.sh for regeneration - GitHub Actions workflow (workflow_dispatch) for CI generation Rebuild README.md as a focused landing page with feature comparison table, architecture diagrams, and documentation index.
Add consistent "## Contents" section with anchor links to all 14 docs and README. Update generate-api-docs.sh to auto-generate and inject a TOC into the api-reference.md output, positioned right after the title.
Restyle README to match AxonOps Workbench branding with centered logo, shield badges, quick-links bar, centered tables, section dividers, legal notices, and "Made with love" footer. Add AxonOps logo to assets.
Confluent Schema Registry stores schemas in Kafka (the _schemas topic), not ZooKeeper. ZooKeeper was only used for leader election and was removed in Confluent Platform 7.0. Update messaging to accurately state the distinction: we use databases instead of Kafka for storage.
Move Feature Comparison to directly after "Why AxonOps Schema Registry" for immediate visual impact. Replace Yes/No text with emoji ticks and crosses for scannability. Update copyright year to 2026.
Replace Confluent-centric subtitle with one that highlights the product's own value proposition: multi-backend storage and enterprise security.
Add docs/testing.md covering all test layers in detail: unit tests, storage conformance, integration, concurrency, BDD (76 feature files, ~1400 scenarios), API endpoint, auth (LDAP/OIDC/Vault), migration, Confluent wire-compatibility, and OpenAPI validation. Includes test pyramid, quick reference table, pre-commit workflow, and guidance on which tests to write for each type of change. Also fix Karapace OIDC/OAuth2 in feature comparison (supports it), add Confluent trademark to legal notices, update Overview link to point AxonOps to axonops.com, and add testing doc to README table.
Strip v1.0.0 from the auto-generated api-reference.md title via the generation script. Fix TOC generation by exporting the TOC env var. Add built-in API documentation (OpenAPI/Swagger UI/ReDoc) to the README "Why" section.
Expand the terse "Contexts are single-tenant" bullet in the README with a full explanation of what Confluent contexts are (multi-tenancy namespaces for Schema Linking) and why we return only the default context. Also clarify the cluster coordination difference. Update the OpenAPI spec /contexts endpoint description and regenerate API docs.
Create GitHub issue #264 for multi-tenant context support with detailed requirements, acceptance criteria, use cases, and implementation hints. Link to the issue from README known differences, OpenAPI spec /contexts endpoint, and auto-generated API reference. Add Multi-Tenant Contexts and Schema Linking rows to the feature comparison table.
Karapace does not support schema registry contexts — no evidence in their README, API docs, or codebase. Change from tick to cross.
Create docs/fundamentals.md covering what a schema registry is, the problem it solves, core concepts (schemas, subjects, versions, IDs, compatibility, references), producer/consumer serialization flow with Mermaid diagrams, wire format, subject naming strategies, schema evolution, compatibility modes, ID allocation, deduplication, modes, and architectural overview. Link from README with a callout above the "Why" section and in the documentation table.
Author
|
Closing to address code review feedback. Will reopen with fixes. |
millerjp
added a commit
that referenced
this pull request
Feb 16, 2026
…grity Fixes 11 confirmed issues from PR review: - Issues 1-2: Add schema_fingerprints table to PostgreSQL and MySQL for stable global schema IDs and reference preservation after permanent delete - Issues 3-4: Enforce IMPORT mode for explicit ID registration and bulk import (error 42205) - Issue 5: Propagate mode check errors instead of failing open - Issue 7: Guard SetNextID against sequence rewind after import - Issue 8: Include soft-deleted versions when computing next version in RegisterSchemaWithID - Issue 9: Handle "latest" sentinel in findDeletedVersion for GET version?deleted=true - Issue 10: Add external reference resolution to JSON Schema compatibility checker - Issue 11: Fix Cassandra GetMaxSchemaID to query actual max instead of block allocator ceiling Also adds BDD conformance tests covering all fixes (pr_fixes_conformance.feature) and updates existing import feature files for IMPORT mode enforcement.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a major branch that brings the project to public-release quality. It spans 84 commits, 216 files changed, ~73,600 lines added across testing infrastructure, storage engine fixes, Confluent API conformance, and a complete documentation overhaul.
What Changed
1. Comprehensive Test Suite (~50,000 lines of test code)
Built a multi-layered test suite from scratch, covering every component of the system:
Unit Tests (~900 test functions)
handlers,admin,account,audit,tls,ldap,oidc,association,cluster,context,rules,exporter,factory,compatibility/checker,compatibility/modes,compatibility/result,schema/types,schema/protobuf/resolverauth,registry,config,avro/parser,jsonschema/parser,protobuf/parser,avro/checker,jsonschema/checker,protobuf/checkerTestOpenAPISpecMatchesRoutes— bidirectional sync test that fails the build ifapi/openapi.yamland the chi router drift apartStorage Conformance Suite (108 tests x 4 backends)
tests/storage/conformance/that runs identically against memory, PostgreSQL, MySQL, and CassandraStorageinterface with identical behaviorBDD Tests (76 feature files, ~1,400 scenarios, ~24,000 lines of Gherkin)
Integration Tests
Concurrency Tests (10 scenarios)
Auth Integration Tests
Confluent Wire-Compatibility Tests
Migration Tests
2. Cassandra Storage Engine Rebuild
Complete rewrite of the Cassandra storage backend for Cassandra 5.0+:
GetSchemaByID, reference loading, orphaned schema cleanup, concurrent idempotency races, deleted schema handling in fingerprint lookup3. Confluent API Conformance
Extensive fixes to match Confluent Schema Registry behavior:
4. Storage Backend Fixes
countSchemasBySubject5. OpenAPI Specification & API Documentation
api/openapi.yamlfrom basic to comprehensive (3,186 lines) — all 47+ endpoints with parameters, request/response schemas, error codes, security schemesapi/embed.goscripts/generate-api-docs.sh— generates Markdown API reference (7,114 lines) and ReDoc HTML from the specGET /docswhenserver.docs_enabled: true6. CI Pipeline Overhaul
22 CI jobs covering every test layer:
7. Makefile Overhaul
Self-contained test targets with automatic Docker container lifecycle:
make test-unit,test-bdd,test-integration,test-concurrency,test-conformance,test-api,test-ldap,test-vault,test-oidc,test-auth,test-migration,test-compatibilityBACKEND=postgres|mysql|cassandra|allmake docs-apifor API doc generationmake docker-build,make docker-run,make dev(hot reload)8. Documentation Overhaul (15 docs, ~14,000 lines)
Complete documentation suite for public release:
README.mddocs/getting-started.mddocs/installation.mddocs/configuration.mddocs/storage-backends.mddocs/schema-types.mddocs/compatibility.mddocs/api-reference.mddocs/authentication.mddocs/security.mddocs/deployment.mddocs/monitoring.mddocs/migration.mddocs/testing.mddocs/development.mddocs/troubleshooting.mdAll docs follow RFC 2119 conventions, include a
## ContentsTOC, and cross-reference each other.Test Plan
make test-unit)make test-conformance BACKEND=all)make test-bdd)make test-bdd BACKEND=all)make test-bdd BACKEND=confluent)make test-integration BACKEND=all)make test-concurrency BACKEND=all)make test-auth)make test-migration)make test-compatibility)make test-api)make lint)make docker-build)TestOpenAPISpecMatchesRoutes)make docs-api)