Roadmap

This document describes the implementation phases and architecture of GitHub Activity DB.

Unified Architecture

This diagram shows how all components fit together into a cohesive system:

┌─────────────────────────────────────────────────────────────────────────┐
│                        Application Layer                                 │
│                    CLI Commands (cli/*.py)                               │
│    ghactivity sync pr | ghactivity sync repo | ghactivity sync all       │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                         Service Layer                                    │
│   PRIngestionService | BulkPRIngestionService | MultiRepoOrchestrator    │
│                                                                          │
│    Orchestrates "what to do": fetch PR → transform → store               │
│    Abstracted from rate limiting details                                 │
└───────────────┬─────────────────────────────────┬───────────────────────┘
                │                                 │
                │ fetches via                     │ stores via
                ▼                                 ▼
┌───────────────────────────────────┐   ┌─────────────────────────────────┐
│        GitHub Layer               │   │      Repository Layer           │
│  ┌─────────────────────────────┐  │   │     (db/repositories/)          │
│  │       GitHubClient          │  │   │                                 │
│  │   get_full_pull_request()   │  │   │  RepositoryRepository           │
│  │   iter_pull_requests()      │  │   │  PullRequestRepository          │
│  └──────────────┬──────────────┘  │   └────────────────┬────────────────┘
│                 │                 │                    │
│  ┌──────────────▼──────────────┐  │                    │
│  │   Orchestration Layer       │  │                    │
│  │   BatchExecutor             │  │                    │
│  │   RequestScheduler          │  │                    │
│  │   ProgressTracker           │  │                    │
│  └──────────────┬──────────────┘  │                    │
│                 │                 │                    │
│  ┌──────────────▼──────────────┐  │                    │
│  │   Control Layer             │  │                    │
│  │   RequestPacer              │  │                    │
│  │   RateLimitMonitor          │  │                    │
│  └──────────────┬──────────────┘  │                    │
│                 │                 │                    │
└─────────────────┼─────────────────┘                    │
                  │                                      │
                  ▼                                      ▼
           GitHub API                          ┌─────────────────┐
                                               │  Database Layer │
                                               │  (db/models.py) │
                                               │  SQLite + ORM   │
                                               └─────────────────┘

Phase 1: Core Implementation ✅ COMPLETE

Foundation layer providing database models, GitHub client, and CLI scaffold.

Components:

Project scaffolding (uv, pyproject.toml, ruff, mypy)
Database models (Repository, PullRequest, UserTag)
Async SQLAlchemy engine and session management
Configuration with pydantic-settings
Alembic migrations
CLI scaffold with typer
Pydantic schemas with factory pattern
GitHub client with error handling

Phase 1.5: Rate Limiting & Request Pacing ✅ COMPLETE

Internal layers within the GitHub module for request control using a token bucket algorithm with adaptive throttling. These components are internal to the GitHub layer - higher layers don't interact with them directly.

Components: RateLimitMonitor (track state from headers), RequestPacer (calculate delays), RequestScheduler (priority queue), BatchExecutor (coordinate batches), ProgressTracker (observe progress).

CLI Commands:

ghactivity github rate-limit           # Check core rate limit
ghactivity github rate-limit --all     # Show all pools

See Rate Limit Handling for algorithm details, retry logic, and testing patterns.

Phase 1.6: Single PR Ingestion Pipeline ✅ COMPLETE

End-to-end pipeline to fetch a single PR from GitHub API, transform it through the schema hierarchy, and persist it to the database.

PR State Machine

We track two states: OPEN and MERGED.

                          create
            [None] ───────────────────▶ [OPEN]
                                          │
                    ┌─────────────────────┤
                    │                     │
                    │ update              │ merge
                    │                     ▼
                    │                 [MERGED]
                    │           (frozen after grace period)
                    │
                    └──────▶ [OPEN]

2-Week Grace Period: Merged PRs can still be updated for 14 days post-merge to capture any late changes. After the grace period, they are frozen.

is_frozen = (state == MERGED) and (now - close_date > grace_period)

Schema Transformation Flow

GitHub API Response
       │
       ▼
 GitHubPullRequest (Pydantic)
       │
       ├──▶ .to_pr_create(repo_id)  →  PRCreate (immutable fields)
       │                                 - number, link, submitter
       │                                 - open_date, repository_id
       │
       └──▶ .to_pr_sync(files,      →  PRSync (synced fields)
            commits, reviews)            - title, state, additions
                                         - deletions, changed_files_count
                                         - commits_breakdown, participants
       │
       ▼
  SQLAlchemy Model
       │
       ▼
 PRRead.from_orm(model)  →  PRRead (output)

Result Objects

@dataclass
class PRIngestionResult:
    pr: PullRequest | None
    created: bool             # New PR created
    updated: bool             # Existing PR updated
    skipped_frozen: bool      # Skipped - frozen state
    skipped_unchanged: bool   # Skipped - no changes detected
    error: Exception | None   # Error if failed

CLI Commands

ghactivity sync pr prebid/prebid-server 1234              # Basic usage
ghactivity sync pr prebid/prebid-server 1234 --verbose    # Detailed output
ghactivity sync pr prebid/prebid-server 1234 --quiet      # Silent (errors only)
ghactivity sync pr prebid/prebid-server 1234 --dry-run    # Preview without writing
ghactivity sync pr prebid/prebid-server 1234 --format json # JSON output

Phase 1.7: Bulk PR Ingestion ✅ COMPLETE

Extends the single PR pipeline to support bulk historical imports using lazy pagination for efficient date filtering.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                      Service Layer                                       │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │                BulkPRIngestionService                              │  │
│  │                                                                    │  │
│  │  async def ingest_repository(owner, repo, config):                 │  │
│  │      async for pr in client.iter_pull_requests(...):               │  │
│  │          if pr.created_at < config.since: break  # Stops pagination│  │
│  │          pr_numbers.append(pr.number)                              │  │
│  │      return await batch_executor.execute(pr_numbers, ingest_pr)    │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                    │                          │                          │
│                    │ discovers via            │ ingests via              │
│                    ▼                          ▼                          │
│            GitHubClient              PRIngestionService                  │
│       iter_pull_requests()                                               │
└─────────────────────────────────────────────────────────────────────────┘

Lazy Pagination

The iter_pull_requests() method returns an AsyncIterator that fetches pages on-demand. When the caller breaks out of the loop (e.g., when hitting PRs older than --since), pagination stops immediately, saving API calls.

# Lazy iteration - only fetches pages as needed
async for pr in client.iter_pull_requests(owner, repo, state="all"):
    if config.since and pr.created_at < config.since:
        break  # Stops network pagination immediately
    # ... process PR

Cold vs Hot Path Handling

Hot Path (synced normally):

Open PRs
Merged PRs within 14-day grace period

Cold Path (skipped):

Merged PRs past 14-day grace period (frozen)

Filtered Out (excluded from discovery):

Closed but not merged PRs (abandoned)

Result Objects

@dataclass
class BulkIngestionResult:
    total_discovered: int     # PRs matching filters
    created: int              # New PRs created
    updated: int              # Existing PRs updated
    skipped_frozen: int       # Skipped - frozen state
    skipped_unchanged: int    # Skipped - no changes
    failed: int               # PRs that failed
    failed_prs: list[tuple[int, str]]  # (pr_number, error)
    duration_seconds: float

CLI Commands

ghactivity sync repo prebid/prebid-server                    # Sync all PRs
ghactivity sync repo prebid/prebid-server --since 2024-10-01 # Since date
ghactivity sync repo prebid/prebid-server --state open       # Only open PRs
ghactivity sync repo prebid/prebid-server --state merged     # Only merged PRs
ghactivity sync repo prebid/prebid-server --max 10           # Limit count
ghactivity sync repo prebid/prebid-server --dry-run          # Preview mode
ghactivity sync repo prebid/prebid-server --format json      # JSON output

API Cost Analysis

PRs	Discovery Calls	Per-PR Calls	Total	Est. Time (5 concurrent)
100	1	400	~401	~5 min
300	3	1200	~1203	~15 min
500	5	2000	~2005	~25 min

With --since filtering, discovery calls are reduced significantly due to lazy pagination stopping early.

Phase 1.7.5: Test Coverage & Documentation ✅ COMPLETE

Expanded test coverage and created comprehensive testing documentation before adding multi-repo orchestration.

Documentation

New: `docs/testing.md`

Comprehensive testing guide covering:

Section	Content
Philosophy	Test pyramid, mocking strategy, coverage goals
Test Categories	Unit vs Integration vs E2E with examples
Current Coverage	515+ tests, breakdown by module, known gaps
Infrastructure	Fixtures, factories, async patterns
Mocking Patterns	AsyncMock, async_iter helper, response fixtures
Running Tests	Commands, filtering, coverage reports
Writing Tests	Naming conventions, AAA pattern
Troubleshooting	Common async and database issues

Test Expansion

Current State (515 tests)

Module	Tests	Status
`github/pacing/`	113	✅ Comprehensive (incl. integration)
`github/sync/`	43	✅ Comprehensive (incl. MultiRepoOrchestrator)
`github/rate_limit/`	50+	✅ Comprehensive (incl. state transitions)
`db/repositories/`	42	✅ Good
`schemas/`	150+	✅ Comprehensive
`cli/`	31	✅ Integration tests added
E2E	11	✅ Core paths

Tests Added

1. GitHubClient Tests ✅

File: tests/github/test_client.py

Test	Purpose
`test_iter_pull_requests_pagination`	Verify lazy iteration works
`test_iter_pull_requests_early_termination`	Verify pagination stops on break
`test_get_full_pull_request_success`	Verify full PR fetch
`test_get_full_pull_request_not_found`	404 handling
`test_rate_limit_header_extraction`	Verify monitor updates from headers

2. CLI Integration Tests ✅

File: tests/cli/test_sync_integration.py

Test	Purpose
`test_sync_pr_creates_database_record`	Real DB, mocked API
`test_sync_repo_with_small_batch`	End-to-end with --max 3
`test_sync_repo_dry_run_no_writes`	Verify dry-run doesn't persist
`test_sync_repo_json_output_structure`	Verify JSON schema

3. Pacer + Scheduler Integration ✅

File: tests/github/pacing/test_integration.py

Test	Purpose
`test_scheduler_uses_pacer_delays`	Verify pacing applied
`test_batch_executor_respects_rate_limits`	Verify throttling
`test_scheduler_priority_with_pacing`	HIGH tasks get less delay

4. Rate Limit State Transitions ✅

File: tests/github/rate_limit/test_monitor.py

Test	Purpose
`test_state_healthy_to_warning_transition`	State machine correctness
`test_state_warning_to_critical_transition`	State machine correctness
`test_state_recovery_after_reset`	Recovery behavior

5. GitHubClient Pacer Integration ✅

File: tests/github/test_client.py (TestGitHubClientPacerIntegration class)

Test	Purpose
`test_init_with_pacer`	Client accepts pacer parameter
`test_apply_pacing_calls_get_recommended_delay`	Verify pacer delay hook called
`test_apply_pacing_calls_on_request_start`	Verify request start hook called
`test_apply_pacing_sleeps_when_delay_positive`	Verify asyncio.sleep called
`test_update_calls_pacer_on_request_complete`	Verify completion hook called
`test_update_passes_headers_to_pacer`	Verify headers passed correctly
`test_get_pull_request_calls_apply_pacing`	Verify pacing before API calls

Phase 1.8: Multi-Repository Sync Orchestration ✅ COMPLETE

Extends bulk PR ingestion to support syncing all 9 Prebid repositories in a single command, verifying the implementation scales and is composable.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         CLI: ghactivity sync all                         │
└────────────────────────────────────┬────────────────────────────────────┘
                                     │
                                     ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                     MultiRepoOrchestrator                                │
│                                                                          │
│  async def sync_all(config):                                             │
│      repos = await initialize_repositories()                             │
│      for repo in repos:                                                  │
│          result = await bulk_service.ingest_repository(repo, config)     │
│          await update_last_synced_at(repo)                               │
│      return MultiRepoSyncResult.aggregate(results)                       │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
              ▼                     ▼                     ▼
    RepositoryRepository   BulkPRIngestionService   Settings.tracked_repos
    (get_or_create)        (per-repo ingestion)     (9 Prebid repos)

Tracked Repositories

tracked_repos = [
    "prebid/prebid-server",
    "prebid/prebid-server-java",
    "prebid/Prebid.js",
    "prebid/prebid.github.io",
    "prebid/prebid-mobile-android",
    "prebid/prebid-mobile-ios",
    "prebid/prebid-universal-creative",
    "prebid/professor-prebid",
    "prebid/salesagent",
]

Result Objects

@dataclass
class RepoSyncResult:
    repository: str
    result: BulkIngestionResult
    started_at: datetime
    completed_at: datetime

@dataclass
class MultiRepoSyncResult:
    repo_results: list[RepoSyncResult]
    total_discovered: int
    total_created: int
    total_updated: int
    total_skipped: int
    total_failed: int
    duration_seconds: float

CLI Commands

ghactivity sync all                                    # Sync all 9 repos
ghactivity sync all --since 2024-10-01                 # With date filter
ghactivity sync all --state merged                     # Only merged PRs
ghactivity sync all --repos prebid/Prebid.js,prebid/prebid-server  # Specific repos
ghactivity sync all --max-per-repo 50                  # Limit per repo
ghactivity sync all --dry-run                          # Preview mode
ghactivity sync all --format json                      # JSON output

Progress Output

Syncing 9 repositories...

[1/9] prebid/prebid-server
      Discovered: 150 | Created: 45 | Updated: 80 | Skipped: 20 | Failed: 5

[2/9] prebid/Prebid.js
      Discovered: 230 | Created: 100 | Updated: 110 | Skipped: 15 | Failed: 5
...

Summary:
  Repositories synced: 8
  Total PRs processed: 1,240
  Created: 450 | Updated: 620 | Skipped: 150 | Failed: 20
  Duration: 45m 30s

File Structure

src/github_activity_db/github/sync/
├── __init__.py                    # Add exports
├── multi_repo_orchestrator.py     # NEW: MultiRepoOrchestrator + result classes
├── bulk_ingestion.py              # Existing
└── ingestion.py                   # Existing

Phase 1.9: Logging Infrastructure ✅ COMPLETE

Replaces stdlib logging with loguru for improved developer experience, structured context binding, and proper log level control.

Why Loguru?

Feature	stdlib logging	loguru
Setup complexity	Handler/Formatter boilerplate	Zero-config
Context binding	Manual, verbose	`logger.bind(repo="...", pr=123)`
Exception tracebacks	Basic	Rich, colorized
File rotation	Requires RotatingFileHandler	Built-in
Type hints	Limited	Full support

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         CLI Entry Point                                  │
│                     cli/app.py callback()                                │
│         setup_logging(level, verbose, quiet, log_file)                   │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                      logging.py Module                                   │
│  ┌─────────────────────────────────────────────────────────────────┐   │
│  │  setup_logging()     Configure handlers, levels, interception    │   │
│  │  get_logger(name)    Get logger with name context bound          │   │
│  │  bind_repo()         Bind repository context                     │   │
│  │  bind_pr()           Bind PR context                             │   │
│  │  InterceptHandler    Route stdlib → loguru                       │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              │                  │                  │
              ▼                  ▼                  ▼
         Console            File (opt)        SQLAlchemy
         (stderr)           (rotation)        (intercepted)

Configuration

Environment Variables:

Variable	Default	Description
`LOG_LEVEL`	`INFO`	Base log level
`LOGGING__LOG_FILE`	None	Enable file logging with rotation
`LOGGING__ROTATION`	`10 MB`	When to rotate log file
`LOGGING__RETENTION`	`7 days`	How long to keep rotated logs

CLI Flags (Global):

Flag	Short	Effect
`--verbose`	`-v`	DEBUG level (overrides LOG_LEVEL)
`--quiet`	`-q`	WARNING level (overrides LOG_LEVEL)

Usage Examples

# Standard logging (INFO level)
ghactivity sync all --since 2024-12-17

# Debug logging (shows SQLAlchemy queries, detailed traces)
ghactivity -v sync pr prebid/prebid-server 1234

# Quiet mode (warnings and errors only)
ghactivity -q sync all

# With file logging
LOGGING__LOG_FILE=./sync.log ghactivity sync all --since 2024-12-17

Context Binding

Loguru's bind() enables structured logging with contextual data:

from github_activity_db.logging import bind_pr, get_logger

logger = get_logger(__name__)

# Bind PR context for all subsequent logs
pr_logger = bind_pr("prebid", "prebid-server", 1234)
pr_logger.info("Processing")
# Output: 10:15:30 | INFO | repo=prebid/prebid-server pr=1234 | Processing

File Structure

src/github_activity_db/
├── logging.py              # NEW: Core logging module
├── config.py               # Updated: LoggingConfig
└── cli/
    └── app.py              # Updated: Global -v/-q flags, logging setup

Phase 1.10: Bugfix - GitHub List API Missing Merged Status ✅ COMPLETE

Fixed critical bug where all merged PRs were incorrectly excluded during discovery. The GitHub list API always returns merged=False; the actual merge status is only available from the full PR endpoint.

Fix: Discovery phase now includes ALL closed PRs. The ingestion phase (which fetches full PR data) determines if closed PRs are merged or abandoned. Added skipped_abandoned field to result types.

Trade-off: Slightly increased API calls (4 per closed PR) but ensures correctness.

See GitHub API Quirks for detailed documentation of list vs full API differences.

Phase 1.11: Testing Strategy Improvements ✅ COMPLETE

Improved test infrastructure following the GitHub list API bug (Phase 1.10). Added API contract tests to verify assumptions about GitHub API responses, sleep mocking patterns for fast test execution, and improved mock accuracy validation.

Key improvements:

Real API response fixtures (tests/fixtures/real_pr_*.py)
Sleep mocking patterns (targeted mock, complete mock, disable retries)
Test suite optimized from 281s to ~18s (94% reduction)
502 tests with comprehensive coverage

See Testing Guide for sleep mocking patterns, API contract testing, and mock accuracy guidelines.

Phase 1.12: Rate Limit Retry Handling Fix ✅ COMPLETE

Fixed rate limit errors being swallowed by the ingestion pipeline instead of propagating to the scheduler for proper retry handling.

Fix: Added GitHubRetryableError base class for errors that should be retried. Modified ingest_pr() to re-raise retryable errors. Added explicit retry loop in discovery phase with exponential backoff.

Result: Rate limit errors now trigger proper wait-and-retry behavior with scheduler priority boosting.

See Rate Limit Handling for retry flow details and exponential backoff tables.

Phase 1.13: Sync Failure Management System ✅ COMPLETE

Persistent tracking of failed PR ingestion attempts with automatic and manual retry capabilities. Failed PRs are now recorded in the database with error details and retry state, enabling recovery from transient failures.

Components

Database Layer:

SyncFailure model with status tracking (PENDING, RESOLVED, PERMANENT)
SyncFailureRepository for CRUD operations and status transitions
Alembic migration for sync_failures table

Service Layer:

FailureRetryService orchestrates retry operations with max retry limits
BulkPRIngestionService integration for automatic failure persistence

CLI Commands:

ghactivity sync retry                      # Retry all pending failures
ghactivity sync retry -r owner/repo        # Retry for specific repository
ghactivity sync retry --max 10             # Limit retries
ghactivity sync retry --dry-run            # Preview mode

ghactivity sync repo owner/repo --auto-retry  # Auto-retry before main sync
ghactivity sync all --auto-retry              # Auto-retry across all repos

Result Objects:

RetryResult with succeeded/failed/marked_permanent counts
Integration with existing BulkIngestionResult for unified reporting

Phase 1.14: Pacing Integration at GitHubClient Layer ✅ COMPLETE

Fixed critical architecture gap where pacing infrastructure existed but was disconnected from actual API calls.

The Problem

The pacing infrastructure (Phase 1.5) only controlled PR-level concurrency via the scheduler, but individual API calls were uncontrolled:

BEFORE (Broken):
┌─────────────────────────────────────────────────────────────────────────┐
│  Scheduler controls: "Start next PR ingestion" (max 5 concurrent)       │
│                                                                          │
│  BUT each PR makes 4+ API calls with ZERO pacing:                       │
│    get_pull_request() → get_files() → get_commits() → get_reviews()    │
│                                                                          │
│  5 concurrent PRs × 4 calls = 20 rapid requests                         │
│  With 50ms fallback delay = ~20 requests/second                         │
│  GitHub limit = 5000/hour = 1.39 requests/second                        │
│  Result: Rate limit exhausted in ~4 minutes                             │
└─────────────────────────────────────────────────────────────────────────┘

The Fix

Integrated pacing directly into GitHubClient so every API call is automatically paced:

AFTER (Fixed):
┌─────────────────────────────────────────────────────────────────────────┐
│                         GitHubClient                                     │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  Every API method now:                                             │  │
│  │    1. await self._apply_pacing()  ← Calculate and apply delay     │  │
│  │    2. Make API request                                             │  │
│  │    3. _update_rate_limit_from_response() → Notify pacer           │  │
│  └───────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

Implementation Details

File: src/github_activity_db/github/client.py

Constructor accepts optional RequestPacer:

def __init__(
    self,
    token: str | None = None,
    rate_monitor: RateLimitMonitor | None = None,
    pacer: RequestPacer | None = None,  # NEW
) -> None:

New _apply_pacing() method:

async def _apply_pacing(self, pool: RateLimitPool = RateLimitPool.CORE) -> None:
    if self._pacer is None:
        return
    delay = self._pacer.get_recommended_delay(pool)
    if delay > 0:
        await asyncio.sleep(delay)
    self._pacer.on_request_start()

Response headers feed back to pacer:

def _update_rate_limit_from_response(self, response: Any) -> None:
    # ... extract headers ...
    if self._pacer is not None:
        self._pacer.on_request_complete(header_dict)

All API methods call _apply_pacing() before requests:
- get_rate_limit()
- get_pull_request()
- list_pull_requests() / iter_pull_requests()
- get_pull_request_files()
- get_pull_request_commits()
- get_pull_request_reviews()

File: src/github_activity_db/cli/sync.py

CLI commands now properly initialize the pacing infrastructure:

async with GitHubClient() as base_client:
    # Initialize monitor and fetch current rate limit state
    monitor = RateLimitMonitor(base_client._github)
    await monitor.initialize()  # Fetches current rate limit from API

    # Create pacer with monitor state
    pacer = RequestPacer(monitor)

    # Create paced client for actual API calls
    async with GitHubClient(rate_monitor=monitor, pacer=pacer) as client:
        # All API calls through this client are now paced

Expected Behavior After Fix

Aspect	Before	After
API call pacing	None (scheduler only)	Every call paced
Header feedback	Monitor updated, pacer ignored	Both updated
Monitor initialization	Never called	Called at CLI startup
Rate limit exhaustion	~4 minutes	Adaptive (never exhausted)

Delay Calculation Example

With 5,000 requests/hour and 60 minutes until reset:

base_delay = 3600 seconds / 5000 requests = 0.72 seconds per request
With 10% reserve buffer: effective = 4500 requests
base_delay = 3600 / 4500 = 0.8 seconds per request

As quota depletes to WARNING (20% remaining = 1000 requests):

base_delay = remaining_time / 1000
Multiplied by 1.5x throttle = ~1.2+ seconds per request

Phase 1.15: Database Write Resilience ✅ COMPLETE

Implements batch commit boundaries to prevent data loss during sync failures. Previously, all writes happened in a single transaction that only committed on session exit—any failure caused complete rollback of all work.

The Problem

Before (all-or-nothing):
PR #1 → flush → PR #2 → flush → ... → PR #200 → flush → [context exit] → COMMIT ALL
[Any failure at PR #150] → ROLLBACK ALL → 0 PRs saved

Impact by operation type:

Single PR sync: Loss of 1 PR (acceptable)
Bulk repo sync: Loss of 100+ PRs if failure near end
Multi-repo sync: Loss of 1000+ PRs across all repos if any failure

Solution: CommitManager

The CommitManager class commits in configurable batches (default: 25 PRs), limiting data loss to the last uncommitted batch.

After (batched with batch_size=25):
PR #1-25 → flush each → COMMIT BATCH → 25 PRs saved
PR #26-50 → flush each → COMMIT BATCH → 50 PRs saved
...
PR #126-150 → flush each → [FAILURE at #150]
→ ROLLBACK current batch only → 125 PRs saved (5 batches)

Architecture Integration

┌─────────────────────────────────────────────────────────────────────────┐
│                         CLI: ghactivity sync                             │
│                                                                          │
│  async with get_session(auto_commit=False) as session:                   │
│      write_lock = asyncio.Lock()                                         │
│      commit_manager = CommitManager(session, write_lock, batch_size=25)  │
│      ...                                                                 │
│      await commit_manager.finalize()                                     │
└───────────────────────────────────┬─────────────────────────────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
              ▼                     ▼                     ▼
    BulkPRIngestionService   MultiRepoOrchestrator   PullRequestRepository
    (calls record_success)   (passes commit_manager)  (uses write_lock)

CommitManager API

class CommitManager:
    def __init__(
        self,
        session: AsyncSession,
        write_lock: asyncio.Lock | None = None,
        batch_size: int = 25,
    ) -> None: ...

    async def record_success(self) -> int:
        """Record successful operation, auto-commit at batch_size."""

    async def commit(self) -> int:
        """Force commit of pending changes."""

    async def finalize(self) -> int:
        """Commit any remaining uncommitted changes."""

    @property
    def uncommitted_count(self) -> int: ...
    @property
    def total_committed(self) -> int: ...
    @property
    def batch_size(self) -> int: ...

Configuration

Variable	Default	Description
`SYNC__COMMIT_BATCH_SIZE`	`25`	PRs to commit per batch (1-100)

CLI Integration

All sync commands automatically use CommitManager:

# Default batch size (25)
ghactivity sync repo prebid/prebid-server --since 2024-10-01

# Custom batch size via environment
SYNC__COMMIT_BATCH_SIZE=50 ghactivity sync all --since 2024-10-01

File Structure

src/github_activity_db/
├── config.py                          # + commit_batch_size in SyncConfig
├── db/
│   └── engine.py                      # + auto_commit param to get_session()
└── github/
    └── sync/
        ├── __init__.py                # + export CommitManager
        ├── commit_manager.py          # NEW: CommitManager class
        ├── bulk_ingestion.py          # + CommitManager integration
        └── multi_repo_orchestrator.py # + CommitManager integration

tests/github/sync/
├── test_commit_manager.py             # Unit tests (13 tests)
└── test_commit_manager_integration.py # Integration tests (5 tests)

Phase 1.16: Code Quality & Technical Debt Reduction ✅ COMPLETE

Systematic elimination of lazy engineering patterns that bypass proper solutions, improving type safety, reducing code duplication, and establishing better patterns for future development.

Problem Statement

A code quality audit identified patterns that trade proper engineering for quick fixes:

Pattern	Count	Impact
Duplicate `split("/", 1)` parsing	7	DRY violation, inconsistent error handling
`type: ignore` comments	10	Bypasses type safety
`noqa` lint suppressions	4	Hides code smells
`Any` type aliases	2	Weakens type checking
`except: pass` in tests	3	Silent test failures

Architectural Improvements

1. Centralized Repository String Parsing

Problem: owner, name = repo.split("/", 1) duplicated 7 times across CLI and service layers.

Solution: Add parse_repo_string() function to schemas/repository.py:

┌─────────────────────────────────────────────────────────────────────────┐
│                     schemas/repository.py                                │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │  def parse_repo_string(full_name: str) -> tuple[str, str]:        │  │
│  │      """Parse 'owner/repo' into (owner, name) tuple."""           │  │
│  │      if "/" not in full_name:                                     │  │
│  │          raise ValueError(f"Invalid: {full_name}")                │  │
│  │      return tuple(full_name.split("/", 1))                        │  │
│  └───────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              │                     │                     │
              ▼                     ▼                     ▼
        cli/sync.py           cli/github.py       multi_repo_orchestrator.py

Benefits:

Single validation logic with consistent error messages
Easier to enhance (e.g., add organization validation)
Type-safe return value
Exported from schemas/__init__.py for easy access

2. CLI Option Factory Pattern

Problem: Typer requires mutable defaults, triggering B008 lint rule. Same noqa: B008 repeated 4 times.

Solution: Create cli/common.py with reusable option factories:

# cli/common.py (NEW FILE)
def output_format_option() -> OutputFormat:  # noqa: B008
    """Standard output format option."""
    return typer.Option(OutputFormat.TEXT, "--format", "-f", help="Output format")

# cli/sync.py - usage (no noqa needed at call site)
def sync_pr(output_format: OutputFormat = output_format_option()):

Benefits:

Single noqa comment in factory, not scattered
Consistent option naming and help text across commands
Extensible for future common options

3. TYPE_CHECKING for Complex Types

Problem: GitHub = Any alias used to avoid complex generic typing.

Solution: Use TYPE_CHECKING block for proper typing:

from __future__ import annotations
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from githubkit import GitHub

class RateLimitMonitor:
    def __init__(self, github: GitHub | None = None) -> None:
        self._github: GitHub | None = github

Benefits:

Full type checking in IDE and mypy
No runtime overhead from complex imports
Aligns with project principle: "Type Safety - Strict mypy configuration"

Files to Modify

File	Changes
`schemas/repository.py`	Add `parse_repo_string()` helper function
`schemas/__init__.py`	Export `parse_repo_string`
`cli/common.py`	NEW: Option factory functions
`cli/sync.py`	Use helpers, fix type ignores, use option factories
`cli/github.py`	Use helper, modernize async runner
`github/sync/multi_repo_orchestrator.py`	Use `parse_repo_string()` helper
`github/rate_limit/monitor.py`	Fix GitHub type alias with TYPE_CHECKING
`github/sync/ingestion.py`	Fix type narrowing with assertions
`github/pacing/batch.py`	Fix type narrowing with cast()
`github/client.py`	Fix lazy init typing
2 test files	Replace `except: pass` with `pytest.raises()`

Expected Outcome

Metric	Before	After
`type: ignore` comments	10	2-3 (unavoidable Pydantic)
`noqa` comments in sync.py	4	0
Duplicate parsing logic	7 locations	1 location
`except: pass` in tests	3	0

Configuration Enforcement

Beyond code changes, this phase adds tooling to prevent regression:

Ruff Rules (pyproject.toml)

[tool.ruff.lint]
select = [
    # ... existing rules ...
    "ANN001", "ANN201", "ANN204",  # Type annotation enforcement
    "TC001", "TC002", "TC003",     # TYPE_CHECKING best practices
    "FA100",                        # Future annotations
]
ignore = ["ANN101", "ANN102", "ANN401"]  # self/cls/Any in dict patterns

[tool.ruff.lint.flake8-type-checking]
strict = true
runtime-evaluated-base-classes = ["pydantic.BaseModel"]

Mypy Enhancement (pyproject.toml)

[tool.mypy]
enable_error_code = ["ignore-without-code"]  # Require codes on type: ignore

Pre-commit Quality Gates (.pre-commit-config.yaml)

- repo: local
  hooks:
    - id: audit-type-ignores
      name: Audit type:ignore comments
      entry: bash -c 'count=$(grep -r "type: ignore" src/ | wc -l); echo "type:ignore: $count (target ≤5)"'
    - id: audit-noqa
      name: Audit noqa comments
      entry: bash -c 'count=$(grep -r "noqa:" src/ | wc -l); echo "noqa: $count (target ≤3)"'
    - id: prevent-any-aliases
      name: Prevent Any type aliases
      entry: bash -c 'grep -rn "^[A-Z].* = Any" src/ && exit 1 || exit 0'

Type Safety Policy (CLAUDE.md)

Document Any usage guidelines:

✅ Allowed: dict[str, Any], Pydantic validators, ORM factories, log context
❌ Forbidden: Lazy initialization aliases, avoiding generics
Policy: type: ignore must include error codes, noqa should be centralized

Verification

# Verify metrics after implementation
grep -r "type: ignore" src/ | wc -l  # Target: 2-3
grep -r "noqa:" src/ | wc -l         # Target: 2 (in common.py only)
grep -rn "split(\"/\", 1)" src/      # Target: 1 (in repository.py)

# Standard verification
uv run mypy src/
uv run ruff check src/ tests/
uv run pytest

Phase 1.16.2: CLI Refactoring ✅ COMPLETE

Building on the completed code quality work, this sub-phase focuses on:

✅ Consolidating CLI async execution patterns into a robust run_async_command helper
✅ Creating reusable repository argument/option factories

Results

Metric	Before	After Phase 1.16.2
CLI async boilerplate	~90 lines (6 commands)	~6 lines
Repo argument definitions	4 duplicated	3 centralized
Async pattern consistency	2 patterns	1 pattern (`asyncio.run()`)
PEP 561 compliance	No	Yes (`py.typed` marker)

Current State (Post Phase 1.16.1)

Metric	Before	After Phase 1.16.1
`type: ignore` comments	10	2 (Pydantic-only)
`noqa` comments	4	0
Duplicate `split("/", 1)`	7	1
`except: pass` in tests	3	0

CLI Async Execution Consolidation

Problem: Two inconsistent async patterns exist:

File	Pattern	Issues
`cli/sync.py`	`asyncio.get_event_loop().run_until_complete()`	Legacy pattern, reuses event loop
`cli/github.py`	`asyncio.run()` via `_run_async()` helper	Modern pattern, isolated event loop

Solution: Unified run_async_command() helper in cli/common.py:

def run_async_command(
    coro: Coroutine[object, object, T],
    *,
    error_prefix: str = "Error",
) -> T:
    """Execute async code from synchronous CLI command with unified error handling."""
    try:
        return asyncio.run(coro)
    except typer.Exit:
        raise  # Re-raise deliberate exits
    except Exception as e:
        console.print(f"[red]{error_prefix}:[/red] {e}")
        raise typer.Exit(1) from None

Impact: Reduces ~90 lines of boilerplate across 6 CLI commands.

Repository Argument Factories

Problem: Three repo argument patterns exist with duplicated definitions:

Pattern	Usage	Definition
Positional `Argument`	`sync pr`, `sync repo`	Required, `owner/name` format
Optional `--repo` Option	`sync retry`	Filtering, `owner/name` format
Comma-separated `--repos`	`sync all`	Override list

Solution: Annotated type aliases in cli/common.py:

RepoArgument = Annotated[
    str,
    typer.Argument(help="Repository in owner/name format"),
]

RepoFilterOption = Annotated[
    str | None,
    typer.Option("--repo", "-r", help="Filter by repository"),
]

ReposListOption = Annotated[
    str | None,
    typer.Option("--repos", "-r", help="Comma-separated list of repos"),
]

Plus validation helpers: validate_repo() and validate_repo_list().

Files Modified

File	Changes
`cli/common.py`	Add `run_async_command`, repo types, validation helpers
`cli/sync.py`	Use new helpers (4 async blocks, 4 repo definitions)
`cli/github.py`	Remove `_run_async`, use `run_async_command`
`src/github_activity_db/py.typed`	PEP 561 marker file

Phase 1.16.3: Test Type Safety Enforcement ✅ COMPLETE

Properly enabled mypy for tests by fixing all 230 type errors. Pre-commit now runs mypy on both src/ and tests/ directories.

Results

Metric	Before	After
Mypy errors in tests	230	0
Pre-commit runs mypy on tests	No	Yes
CI runs mypy on tests	No	Yes
Factory `**overrides` usage	2 functions	0
`type: ignore` in src/	-	2 (target: ≤5) ✅
`noqa` comments in src/	-	0 (target: ≤3) ✅

Fix Strategies Applied

1. Untyped Dict Unpacking (170+ errors)

Replaced Model(**dict) with Pydantic's model_validate():

# Before (causes arg-type errors)
pr = GitHubPullRequest(**GITHUB_PR_RESPONSE)

# After (uses Pydantic's built-in validation)
pr = GitHubPullRequest.model_validate(GITHUB_PR_RESPONSE)

2. Factory Dict Merging (16 errors)

Removed **overrides pattern, using explicit parameters instead.

3. Optional Access (37 errors)

Added null assertions before attribute access:

result = await repo.get_by_number(...)
assert result is not None  # Narrows type
assert result.state == PRState.MERGED  # Works

Pre-commit Configuration

- id: mypy
  additional_dependencies:
    - pydantic>=2.0
    - pydantic-settings>=2.0
    - sqlalchemy>=2.0
    - pytest-asyncio>=0.24
    - typer>=0.9.0
    - rich>=13.0
    - loguru>=0.7.0
    - githubkit>=0.11.0
  args: [--config-file=pyproject.toml, src, tests]
  pass_filenames: false
  files: ^(src|tests)/

Quality Gates

Pre-commit now includes automated quality audits:

Audit type:ignore comments - Reports count, target ≤5
Audit noqa comments - Reports count, target ≤3
Prevent Any type aliases - Blocks TypeName = Any patterns

Phase 2: Enhanced Features (Future)

2.1 GitHub User Identity (Normalized)

Normalize user identities into a dedicated github_users table to enable queries like "all PRs involving user X" and handle username changes gracefully.

Key Design Consideration: Verified vs Unverified Identities

GitHub's API returns two types of author information for commits:

GitHub User (commit.author.login): Verified GitHub account linked to the git email
Git Author (commit.commit.author.name): Name from git config user.name (fallback when email not linked)

The github_users table should track this distinction:

class GitHubUser(Base):
    id: int                      # Primary key
    username: str                # GitHub login OR git author name
    github_id: int | None        # GitHub user ID (null if unverified)
    is_verified: bool            # True = linked GitHub account, False = git-only
    email: str | None            # Git email (for potential future linking)
    display_name: str | None     # Full name if available
    first_seen_at: datetime      # When we first encountered this user

    # Relationships
    prs_submitted: list[PullRequest]
    prs_merged: list[PullRequest]
    commits: list[CommitAuthor]
    reviews: list[Review]

Benefits:

Query all contributions by a user across verified and unverified commits
Identify when a git author name gets linked to a GitHub account
Track contributors who haven't linked their git email to GitHub
Enable future deduplication (e.g., "Victor Gonzalez" → "optidigital-prebid")

2.2 GitHub Issues Support

Issue data model (similar to PR)
Issue sync from GitHub API
Issue tagging and search

2.3 Agent Integration

classify_tags generation pipeline
ai_summary generation on PR merge
Configurable prompts/models

2.4 Search Enhancements

Full-text search on title/description
Date range filtering
Export to CSV/JSON

Phase 3: Advanced Features (Future)

Real-time Sync

GitHub webhooks support
Incremental updates
Background sync daemon

Web Interface

REST API layer
Simple web UI for browsing
Dashboard with statistics

Multi-org Support

Support repos outside Prebid org
Configurable repo list via CLI
Per-repo sync settings

FilesExpand file tree

roadmap.md

Latest commit

History

roadmap.md

File metadata and controls

Roadmap

Unified Architecture

Phase 1: Core Implementation ✅ COMPLETE

Phase 1.5: Rate Limiting & Request Pacing ✅ COMPLETE

Phase 1.6: Single PR Ingestion Pipeline ✅ COMPLETE

PR State Machine

Schema Transformation Flow

Result Objects

CLI Commands

Phase 1.7: Bulk PR Ingestion ✅ COMPLETE

Architecture

Lazy Pagination

Cold vs Hot Path Handling

Result Objects

CLI Commands

API Cost Analysis

Phase 1.7.5: Test Coverage & Documentation ✅ COMPLETE

Documentation

New: docs/testing.md

Test Expansion

Current State (515 tests)

Tests Added

Phase 1.8: Multi-Repository Sync Orchestration ✅ COMPLETE

Architecture

Tracked Repositories

Result Objects

CLI Commands

Progress Output

File Structure

Phase 1.9: Logging Infrastructure ✅ COMPLETE

Why Loguru?

Architecture

Configuration

Usage Examples

Context Binding

File Structure

Phase 1.10: Bugfix - GitHub List API Missing Merged Status ✅ COMPLETE

Phase 1.11: Testing Strategy Improvements ✅ COMPLETE

Phase 1.12: Rate Limit Retry Handling Fix ✅ COMPLETE

Phase 1.13: Sync Failure Management System ✅ COMPLETE

Components

Phase 1.14: Pacing Integration at GitHubClient Layer ✅ COMPLETE

The Problem

The Fix

Implementation Details

Expected Behavior After Fix

Delay Calculation Example

Phase 1.15: Database Write Resilience ✅ COMPLETE

The Problem

Solution: CommitManager

Architecture Integration

CommitManager API

Configuration

CLI Integration

File Structure

Phase 1.16: Code Quality & Technical Debt Reduction ✅ COMPLETE

Problem Statement

Architectural Improvements

1. Centralized Repository String Parsing

2. CLI Option Factory Pattern

3. TYPE_CHECKING for Complex Types

Files to Modify

Expected Outcome

Configuration Enforcement

Ruff Rules (pyproject.toml)

Mypy Enhancement (pyproject.toml)

Pre-commit Quality Gates (.pre-commit-config.yaml)

Type Safety Policy (CLAUDE.md)

Verification

Phase 1.16.2: CLI Refactoring ✅ COMPLETE

Results

Current State (Post Phase 1.16.1)

CLI Async Execution Consolidation

Repository Argument Factories

New: `docs/testing.md`