Skip to content

Latest commit

 

History

History
198 lines (145 loc) · 7.32 KB

File metadata and controls

198 lines (145 loc) · 7.32 KB

Gemini Architecture

Introduction

Gemini is designed around the concept of Jobs. Each job performs a specific function, such as applying mutations or validating data consistency between clusters.

Jobs

Gemini has two primary job types:

  1. MutationJob (pkg/jobs/mutation.go): Applies mutations (INSERT, DELETE) to both clusters. Each mutation worker continuously generates and executes statements until the stop condition is met. The job tracks success/failure metrics and respects error budgets.

  2. ValidationJob (pkg/jobs/validation.go): Reads rows from both clusters and compares them. When differences are detected, an error is raised. Validation supports configurable retry attempts with stabilization delays to handle eventual consistency.

Note: UPDATE statements are currently disabled and converted to INSERTs internally. DDL operations (ALTER) are also temporarily disabled pending v2 stabilization.

Modes of Operation

Mode Description
mixed (Default) Runs both mutation and validation workers concurrently
write Only mutation workers, no validation
read Only validation workers on existing data

Concurrency Model

Gemini uses separate concurrency settings for mutations and reads:

  • --concurrency / --mutation-concurrency: Number of mutation workers
  • --read-concurrency: Number of validation workers

In mixed mode, both mutation and validation workers run concurrently. Each worker operates on a shared partition pool but uses distribution functions to select partitions, avoiding direct conflicts.

Workers are managed using Go's errgroup for coordinated startup and shutdown. When the error budget is reached or duration expires, a soft stop is signaled and workers gracefully terminate.

Partitions System

The partition system (pkg/partitions/) is the core component that manages partition key generation and tracking.

Partitions Structure

type Partitions struct {
    table   *typedef.Table           // Table definition
    idxFunc distributions.DistributionFunc  // Distribution for partition selection
    deleted *deletedPartitions       // Tracks recently deleted partitions
    r       random.GoRoutineSafeRandom      // Thread-safe random source
    config  typedef.PartitionRangeConfig    // Value range configuration
    parts   partitions               // Slice of partition values
}

Key Operations

Method Description
New() Creates a partition pool with pre-generated partition keys
Get(idx) Returns partition values at the given index
Next() Returns the next partition using the distribution function
Extend() Adds a new partition to the pool
Replace(idx) Generates new values for a partition, tracking the old ones as deleted
Deleted() Returns a channel of recently deleted partition keys for validation

Partition Key Generation

Partition keys are generated by delegating to the column types defined in the table schema:

func generateValue(r utils.Random, table *typedef.Table, config typedef.RangeConfig) []any {
    values := make([]any, 0, table.PartitionKeys.LenValues())
    for _, pk := range table.PartitionKeys {
        values = pk.Type.GenValueOut(values, r, config)
    }
    return values
}

Each partition key column type is responsible for generating its own random values within the configured ranges.

Deleted Partitions Tracking

When a partition is deleted, its keys are tracked in a time-bucketed heap structure (deletedPartitions). This allows validation workers to verify that deleted data is actually removed from both clusters after a configurable delay.

Key features:

  • Min-heap sorted by "ready at" time for efficient processing
  • Background goroutine emits ready partitions via channel
  • Configurable time buckets control when deleted partitions become eligible for validation
  • Memory optimized with pre-allocated backing arrays

Distribution Functions

Partition selection supports multiple distributions (pkg/distributions/):

Distribution Description
uniform Equal probability for all partitions
zipf Power-law distribution - some partitions accessed more frequently
normal Gaussian distribution around a mean
lognormal Log-normal distribution

Statement Generation

The statements.Generator (pkg/statements/) creates CQL statements using partition values:

type Generator struct {
    generator        partitions.Interface  // Partition value source
    random           utils.Random
    table            *typedef.Table
    ratioController  *RatioController     // Controls statement type ratios
    // ...
}

Statement types include:

  • SELECT: Single partition, multiple partition, clustering range, index queries
  • INSERT: Regular and JSON format
  • DELETE: Whole partition, single row, single column, multiple partitions

The RatioController manages the probability distribution of different statement types based on user configuration.

Data Structures

Schema (pkg/typedef/schema.go)

Top-level structure containing the keyspace definition and list of tables:

type Schema struct {
    Keyspace Keyspace     `json:"keyspace"`
    Tables   []*Table     `json:"tables"`
    Config   SchemaConfig `json:"-"`
}

Table (pkg/typedef/table.go)

Represents a CQL table with all its components:

type Table struct {
    Name              string
    PartitionKeys     Columns
    ClusteringKeys    Columns
    Columns           Columns
    Indexes           []IndexDef
    MaterializedViews []MaterializedView
    KnownIssues       KnownIssues
    TableOptions      []string
}

Columns (pkg/typedef/columns.go)

A slice of ColumnDef representing a set of columns (partition keys, clustering keys, or regular columns):

type Columns []ColumnDef

type ColumnDef struct {
    Name string
    Type Type
}

Types (pkg/typedef/types.go)

Types are responsible for generating random values. There are two categories:

  1. Simple Types: int, bigint, text, boolean, decimal, uuid, timestamp, etc.
  2. Complex Types: MapType, ListType, SetType, TupleType, UDTType - composed of simple types

Each type implements value generation:

type Type interface {
    GenValue(r Random, config RangeConfig) []any
    GenValueOut(out []any, r Random, config RangeConfig) []any
    LenValue() int
    // ...
}

Package Overview

Package Purpose
pkg/jobs Job definitions (mutation, validation)
pkg/partitions Partition key management and deleted tracking
pkg/statements CQL statement generation
pkg/typedef Core type definitions (Schema, Table, Columns, Types)
pkg/schema Schema loading and generation
pkg/distributions Partition selection distributions
pkg/store Database interaction layer (oracle + test clusters)
pkg/status Global status tracking (ops counters, errors)
pkg/stop Graceful shutdown coordination
pkg/metrics Prometheus metrics
pkg/stmtlogger Statement logging for debugging