Skip to content

Add benchmark tests for ScyllaDB migrator#300

Draft
dkropachev wants to merge 16 commits intomasterfrom
add-benchmark-tests
Draft

Add benchmark tests for ScyllaDB migrator#300
dkropachev wants to merge 16 commits intomasterfrom
add-benchmark-tests

Conversation

@dkropachev
Copy link
Contributor

@dkropachev dkropachev commented Feb 27, 2026

Summary

  • Add JMH microbenchmarks for CPU-bound transformations (explodeRow, convertValue, createSelection) in a new benchmarks sbt module
  • Add integration throughput benchmarks for Cassandra→Scylla and Scylla→Scylla migration paths at 100K and 500K row scales
  • Refactor convertRowTypes closure into public Cassandra.convertValue so JMH can call it directly
  • Add Benchmark munit tag excluded from regular test-integration runs
  • Add Makefile targets: benchmark-jmh, benchmark-jmh-quick, benchmark-integration, benchmark

Test plan

  • sbt migrator/compile — migrator compiles after refactor
  • sbt benchmarks/compile — JMH benchmarks compile
  • sbt scalafmtCheckAll — formatting passes
  • JMH smoke test runs successfully (ExplodeRowBenchmark)
  • All 91 unit tests pass (no regression)
  • make test-integration — existing integration tests still pass, benchmarks excluded
  • make benchmark-integration — integration benchmarks run against Docker services

Introduce JMH microbenchmarks for CPU-bound transformations (explodeRow,
convertValue, createSelection) and integration throughput benchmarks for
end-to-end migration paths (Cassandra→Scylla, Scylla→Scylla) at 100K
and 500K row scales.

- Refactor convertRowTypes closure into public Cassandra.convertValue
- Add sbt-jmh plugin and benchmarks module
- Add Benchmark munit tag, excluded from regular test-integration runs
- Add Makefile targets: benchmark-jmh, benchmark-jmh-quick,
  benchmark-integration, benchmark
dkropachev and others added 14 commits February 26, 2026 21:28
Set Jmh/baseDirectory to the project root so that relative output paths
in Makefile targets resolve correctly from the forked JVM.
Cover per-row hot paths that were missing benchmark coverage:
- DdbValue.from (flat, nested, set items)
- AttributeValueUtils.fromV1 (SDK v1→v2 conversion)
- DynamoDBS3Export.itemDecoder (simple through wide/deeply-nested JSON)
- compareCassandraRows and compareDynamoDBRows (with and without timestamps)
- stripTrailingZeros mapping (BigDecimal and mixed-type rows)
- DdbValue Java serialization roundtrip (serialize, deserialize, roundtrip)
- Cassandra.convertValue per type (UTF8String, Map, List, Set, ArrayBuffer)
- Wide-row explodeRow (50 columns vs existing 3-column benchmarks)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add TargetSettings.Parquet config (path + compression), a thin
writers.Parquet wrapper around Spark's native df.write.parquet, and
wire up the Cassandra → Parquet route in Migrator.scala.
Add TargetSettings.DynamoDBS3Export config (path), a writer that
produces gzipped DynamoDB JSON files with manifest-summary.json and
manifest-files.json compatible with the existing S3 Export reader,
and wire up the DynamoDB → S3 Export route in Migrator.scala.

Includes 14 roundtrip unit tests verifying all DynamoDB attribute
types (S, N, B, BOOL, NULL, SS, NS, BS, L, M) encode correctly
and decode back to identical values via the existing itemDecoder.
Two-leg benchmark: Scylla -> Parquet (export) then Parquet -> Scylla
(import) using the same dataset. Row count configurable via
E2E_CQL_ROWS (default 5M). Parquet files written to Docker volume
at /app/parquet/bench_e2e.
Split ParquetE2EBenchmark into ScyllaToParquetE2EBenchmark and
ParquetToScyllaE2EBenchmark (each runnable independently). Rename
all Makefile targets to test-benchmark-e2e-{source}-{target}:

  test-benchmark-e2e-cassandra-scylla
  test-benchmark-e2e-scylla-scylla
  test-benchmark-e2e-dynamodb-alternator
  test-benchmark-e2e-scylla-parquet
  test-benchmark-e2e-parquet-scylla
Remove self-seeding fallback from ParquetToScyllaE2EBenchmark.
The test now fails with a clear message if Parquet files are
missing, requiring test-benchmark-e2e-scylla-parquet to run first.
Two-leg benchmark mirroring the Parquet pattern:

  test-benchmark-e2e-dynamodb-s3export: Seeds DynamoDB Local, exports
  to S3 Export format on local filesystem, verifies files exist.

  test-benchmark-e2e-s3export-alternator: Uploads export files to
  LocalStack S3, imports to Alternator, verifies row count.
  Requires running dynamodb-s3export first.

Row count configurable via E2E_DDB_ROWS (default 500K).
…ucture

- Add E2E benchmarks: Cassandra->Scylla, Scylla->Scylla, DynamoDB->Alternator, Cassandra->Parquet
- Refactor benchmark utilities into shared traits (E2EBenchmarkSuite, ThroughputBenchmarkSupport)
- Add E2E test category and TestFileUtils for config file management
- Add DynamoDBBenchmarkDataGenerator for Alternator E2E tests
- Add unit tests for Parquet and DynamoDB S3Export writers
- Add ParquetTargetValidationTest for config validation
- Refactor Makefile: sequential E2E execution, dependency targets, remove old benchmark-integration
- Extract version constants in build.sbt, forward e2e system properties to test JVM
- Refactor DynamoDB S3Export writer for improved encoding
- Remove old non-E2E benchmark infrastructure (BenchmarkSuite, Benchmark category)
- Add 5-minute timeout to COUNT(*) queries in ThroughputBenchmarkSupport
  and ParquetToScyllaE2EBenchmark to prevent read timeouts on large tables
- Set Parquet write mode to 'overwrite' in benchmark configs to handle
  pre-existing output directories from previous runs
- Add docker compose exec fallback in TestFileUtils.deleteRecursive for
  cleaning up root-owned files created by Docker containers
Add test-benchmark-e2e-sanity Makefile target that runs all E2E
migration path tests with minimal row counts (1000 CQL, 100 DynamoDB)
for fast CI validation (~2 min). Integrated into the existing
integration test job in the GitHub Actions workflow.

Also fix stop-services to run unconditionally (if: always()) so
Docker containers are cleaned up even when tests fail.
Rewrite the testing section to cover all test categories (unit,
integration, AWS, E2E benchmarks, JMH), migration paths, row count
configuration, CI pipeline, and the new E2E sanity suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant