CLAUDE.md - Jira Test Data Generator

Project Overview

Purpose: Generate realistic test data for Jira Cloud instances using production-based multipliers to simulate real-world usage patterns.

Key Features:

Bulk API operations (50 issues per call)
Parallel issue creation across projects for significant speedup
Async concurrency for high-volume items (comments, worklogs, watchers, votes, properties, remote links)
Optimized attachments with pre-generated pool of small files (1-5KB) and session reuse
Pre-generated random text pool for reduced CPU overhead at scale
Optimized connection pooling for both sync and async HTTP sessions
Memory-efficient task batching to avoid creating millions of task objects upfront
Intelligent rate limit handling with exponential backoff
Production-based multipliers loaded from CSV file
Dynamic project creation based on multipliers
Auto-assigns Project Administrator role for watcher permissions
Unique run ID labeling for easy JQL searching
Support for 4 instance size buckets (small/medium/large/xlarge)
Modular architecture for maintainability
Checkpointing for resumable large-scale runs (18M+ issues)
Benchmarking with time extrapolation and request statistics for planning large runs

Target User: Teams who need to test Jira backup/restore scenarios with realistic data volumes.

File Structure

.
├── jira_data_generator.py       # Main orchestrator (~800 lines)
├── jira_user_generator.py       # User/group creation helper (~480 lines)
├── generators/                   # Modular generators package
│   ├── __init__.py              # Package exports
│   ├── base.py                  # JiraAPIClient, RateLimitState (~300 lines)
│   ├── projects.py              # ProjectGenerator (~400 lines) - projects, categories, versions, components, properties
│   ├── issues.py                # IssueGenerator (~250 lines)
│   ├── issue_items.py           # IssueItemsGenerator (~450 lines)
│   ├── agile.py                 # AgileGenerator (~250 lines)
│   ├── filters.py               # FilterGenerator (~200 lines)
│   ├── custom_fields.py         # CustomFieldGenerator (~500 lines) - custom fields, contexts, options
│   ├── checkpoint.py            # CheckpointManager (~450 lines) - resumable runs
│   └── benchmark.py             # BenchmarkTracker (~400 lines) - timing, request stats, extrapolation
├── tests/                        # Unit tests (90%+ coverage)
│   ├── conftest.py              # Shared pytest fixtures
│   ├── test_base.py             # JiraAPIClient tests
│   ├── test_benchmark.py        # BenchmarkTracker tests
│   ├── test_checkpoint.py       # CheckpointManager tests
│   ├── test_projects.py         # ProjectGenerator tests
│   ├── test_issues.py           # IssueGenerator tests
│   ├── test_issue_items.py      # IssueItemsGenerator tests
│   ├── test_agile.py            # AgileGenerator tests
│   ├── test_filters.py          # FilterGenerator tests
│   ├── test_custom_fields.py    # CustomFieldGenerator tests
│   ├── test_jira_data_generator.py  # Main orchestrator tests
│   ├── test_jira_user_generator.py  # User generator tests
│   └── test_cli.py              # CLI entry point tests
├── .github/workflows/test.yml   # GitHub Actions CI/CD
├── item_type_multipliers.csv    # Multiplier configuration
├── requirements.txt             # Python dependencies
├── requirements-dev.txt         # Test dependencies (pytest, coverage, etc.)
├── pytest.ini                   # Pytest configuration
├── .env.example                 # API token template
├── .gitignore                   # Python venv and credentials
├── README.md                    # User-facing documentation
├── QUICKREF.md                  # Quick reference for common commands
├── example_usage.sh             # Example usage scenarios
└── CLAUDE.md                    # This file - for AI agents

Architecture & Design Patterns

Module Overview

The codebase uses a modular architecture with specialized generators:

Module	Class	Responsibility
`base.py`	`JiraAPIClient`	HTTP sessions, rate limiting, base API calls, text pool, connection pooling
`projects.py`	`ProjectGenerator`	Projects, categories, versions, components, properties, role management
`issues.py`	`IssueGenerator`	Bulk issue creation (parallel across projects), attachments (pooled)
`issue_items.py`	`IssueItemsGenerator`	Comments, worklogs, links, watchers, votes, properties, remote links
`agile.py`	`AgileGenerator`	Boards, sprints, sprint issue assignment
`filters.py`	`FilterGenerator`	Saved filters, dashboards
`custom_fields.py`	`CustomFieldGenerator`	Custom fields (20 types), contexts, options
`checkpoint.py`	`CheckpointManager`	Progress tracking, resume support for large runs
`benchmark.py`	`BenchmarkTracker`	Per-phase timing, rates calculation, request statistics, time extrapolation

Core Classes

`RateLimitState` (dataclass) - `generators/base.py`

Purpose: Track rate limiting state across API calls (thread-safe for async)
Fields:
- retry_after: Seconds to wait (from Retry-After header)
- consecutive_429s: Count of consecutive rate limit hits
- current_delay: Current exponential backoff delay
- max_delay: Maximum delay cap (60s)
- _lock: asyncio.Lock for thread-safe updates in async context

`JiraAPIClient` (base class) - `generators/base.py`

Purpose: Base class for all generators with shared API functionality
Key Methods:
- _api_call(): Synchronous API call with rate limiting and request tracking
- _api_call_async(): Async API call with rate limiting and request tracking
- _create_session(): Create requests session with optimized connection pooling
- _get_async_session(): Get/create aiohttp session with optimized connection pooling
- _init_text_pool(): Initialize pre-generated random text pool (class method)
- generate_random_text(): Get random text from pre-generated pool (class method)
- _record_request(): Record API request in benchmark stats
- _record_rate_limit(): Record rate limit hit in benchmark stats
- _record_error(): Record error in benchmark stats
- get_current_user_account_id(): Fetch authenticated user
- get_all_users(): Fetch users for watchers
Class-Level Attributes:
- _TEXT_POOL_SIZE: Number of pre-generated text strings per category (1000)
- _text_pool: Dict with 'short', 'medium', 'long' text pools
- _LOREM_WORDS: Word list for text generation (60 words)
Constructor Parameters:
- benchmark: Optional BenchmarkTracker for request statistics tracking

`JiraDataGenerator` (orchestrator) - `jira_data_generator.py`

Purpose: Orchestrates all generator modules
Key Properties:
- project_gen, issue_gen, issue_items_gen, agile_gen, filter_gen: Generator instances
- run_id: Unique label format PREFIX-YYYYMMDD-HHMMSS
- checkpoint: Optional CheckpointManager for resumable runs
- benchmark: BenchmarkTracker for timing and request statistics (passed to all generators)

`CheckpointManager` - `generators/checkpoint.py`

Purpose: Track progress and enable resumable data generation for large runs
Key Features:
- JSON-based checkpoint file ({PREFIX}-checkpoint.json)
- Phase-level progress tracking (pending/in_progress/complete)
- Count-based tracking for high-volume items
- Atomic file writes (temp file + rename)
- Auto-archive on completion to {run_id}-checkpoint.json
Key Methods:
- initialize(): Create new checkpoint for fresh run
- load(): Load existing checkpoint for resume
- save(): Persist current state to file
- is_phase_complete(): Check if phase can be skipped
- get_remaining_count(): Calculate items still needed
- finalize(): Mark run complete, archive checkpoint

`PhaseMetrics` (dataclass) - `generators/benchmark.py`

Purpose: Track metrics for a single generation phase
Fields:
- name: Phase name
- start_time, end_time: Timing boundaries
- items_created, items_target: Item counts
- rate_limited: Count of 429 responses during this phase
- errors: Count of errors during this phase
Key Properties:
- duration_seconds, items_per_second, seconds_per_item
- format_duration(), format_rate(): Human-readable formatting

`BenchmarkTracker` - `generators/benchmark.py`

Purpose: Track performance metrics and provide time extrapolations for large runs
Key Features:
- Per-phase timing with items/second rate calculation
- Per-phase rate limit (429) and error tracking with percentages
- Request statistics tracking (total, rate limited, errors)
- Time extrapolation to 18M issues based on observed rates
- JSON export for programmatic analysis
Key Properties:
- total_requests: Count of API requests made
- rate_limited_requests: Count of 429 responses
- error_count: Count of failed requests (non-429)
- rate_limit_percentage: Percentage of requests that were rate limited
- error_percentage: Percentage of requests that failed
- _current_phase: Tracks active phase for per-phase 429/error attribution
Key Methods:
- start_phase(name, target_count): Begin timing a phase (sets _current_phase)
- end_phase(name, items_created): End timing, log rate (clears _current_phase)
- record_request(): Increment request counter
- record_rate_limit(): Increment rate limit counter (global + per-phase)
- record_error(): Increment error counter (global + per-phase)
- get_summary_report(): Generate human-readable summary with per-phase 429s
- format_extrapolation(target, current): Generate time estimate for target
- to_dict(): Export all data as JSON-serializable dict

Async vs Sync Operations

The tool uses a hybrid approach optimized for 18M+ issue scale:

Operation	Mode	Volume at 18M	Reason
Custom Fields	Async	22K	Configuration items, created first
Project Categories	Sequential	360	Low volume, created before projects
Projects	Sequential	5,760	Low volume, dependencies
Project Properties	Async	1.6M	High volume at scale
Issues (bulk)	Parallel	18M	Async across projects (50/call per project)
Components	Async	3.8M	High volume at scale
Versions	Async	25.9M	Very high volume at scale
Boards	Sequential	15K	Filter dependency required
Sprints	Async	900K	Scrum boards only (kanban doesn't support sprints)
Filters	Async	100K	Medium-high volume at scale
Dashboards	Async	16K	Private or authenticated sharing (global disabled)
Comments	Async	48.4M	Very high volume
Worklogs	Async	4.3M	High volume
Issue Links	Async	4.1M	High volume
Watchers	Async	40.3M	Very high volume
Attachments	Async	27.4M	Pooled 1-5KB files for fast uploads
Votes	Async	5.4K	Low volume (but async for consistency)
Issue Properties	Async	13.3M	High volume
Remote Links	Async	5.9M	Medium-high volume

Rate Limiting Strategy

Priority Order:

Primary: Use Retry-After header from Jira response
Fallback: Exponential backoff starting at 1s, doubling on each 429
Reset: Return to 1s delay on successful request
Max: Cap at 60s delay
Thread-safe: Uses asyncio.Lock for shared state in async context

Adaptive Throttling (async only):

On 429: Increase adaptive_delay by 100ms (caps at 1s)
On success: Decrease adaptive_delay by 10ms every 10 successes
Effective delay = request_delay + adaptive_delay + jitter (±10%)

Jitter: ±20% on rate limit backoff, ±10% on request delay to prevent thundering herd

Sync Implementation: JiraAPIClient._handle_rate_limit() method Async Implementation: JiraAPIClient._handle_rate_limit_async() method with adaptive throttling

API Call Patterns

Synchronous (`_api_call`)

Used for low-volume operations (projects, issues bulk, components, versions):

Handles dry-run mode
Implements retry logic (max 5 attempts)
Processes rate limit responses
Provides timeout protection (30s)
Supports custom base_url for agile API
Returns Response object or None on failure

Asynchronous (`_api_call_async`)

Used for high-volume operations (comments, worklogs, links, watchers, etc.):

Uses aiohttp ClientSession with connection pooling
Semaphore-controlled concurrency
Shared rate limit tracking via asyncio.Lock
Supports custom base_url for agile API
Returns tuple: (success: bool, response_json: Optional[Dict])

Data Model

Multipliers (loaded from CSV)

Source: item_type_multipliers.csv

Loading:

def load_multipliers_from_csv(csv_path: Optional[str] = None) -> Dict[str, Dict[str, float]]:
    """Load multipliers from CSV file."""

CSV Structure:

Item Type,Small,Medium,Large,XLarge
comment,4.80,4.75,2.69,0.26
issue_worklog,7.27,1.49,0.24,0.06
project,0.00249,0.00066,0.00032,0.00001
...

Size Buckets:

small: 1-1K issues (high activity per issue)
medium: 1K-10K issues (balanced)
large: 10K-100K issues (moderate activity)
xlarge: 100K+ issues (low activity per issue)

Created Item Types

Project-level (created sequentially via ProjectGenerator):

Project Categories (project_category) - created before projects
Projects (dynamically created based on multiplier)
Project Properties (project_property) - custom key-value metadata
Versions (project_version)
Components (project_component)

Issue-level (via IssueGenerator and IssueItemsGenerator):

Issues (bulk created in batches of 50)
Comments (comment) - async
Worklogs (issue_worklog) - async
Watchers (issue_watcher) - async
Issue Links (issue_link) - async
Attachments (issue_attachment) - async
Votes (issue_vote) - async
Properties (issue_properties) - async
Remote Links (issue_remote_link) - async

Agile items (via AgileGenerator):

Boards (board) - scrum and kanban
Sprints (sprint) - with issue assignment

Other items (via FilterGenerator):

Filters (filter) - saved JQL queries
Dashboards (dashboard) - with share permissions

Configuration items (via CustomFieldGenerator):

Custom Fields (issue_field) - 20 types with contexts and options

Not Implemented (system/configuration items):

Workflows, screens, permission schemes
Issue types, priorities, resolutions
These are configuration items that should already exist

Key Conventions

Naming

Prefixes: User-specified, appears in all created items (e.g., PERF, LOAD)
Project Keys: {PREFIX[:6]}{N} (e.g., PERF1, PERF2)
Run IDs: {PREFIX}-YYYYMMDD-HHMMSS format
Issue Summaries: {PROJECT_KEY} Test Issue {N}
Versions: {PREFIX} v{N}.0
Components: {PREFIX}-Component-{N}
Categories: {PREFIX} {Type} {N} (e.g., PERF Development 1)
Project Properties: {prefix}_property_{N}
Boards: {PREFIX} {Type} Board {N}
Sprints: {PREFIX} Sprint {N}
Filters: {PREFIX} Filter {N}
Dashboards: {PREFIX} {Type} Dashboard {N}

Logging

INFO: Progress updates, batch completions
WARNING: Rate limits, recoverable errors
ERROR: Fatal errors, API failures
DEBUG: Verbose mode only (--verbose flag)

Error Handling

Fail Fast: If project/issue creation fails, abort (everything depends on these)
Continue on Error: For comments/worklogs/watchers/etc. (log and continue)
Retry Logic: 5 attempts for transient errors
Graceful Degradation: Missing features are skipped with warnings

Extending the Project

Adding New Generator Module

Create new file in generators/ directory
Inherit from JiraAPIClient
Implement sync and async methods following existing patterns
Add to generators/__init__.py exports
Initialize in JiraDataGenerator._init_generators()
Call from generate_all() and generate_all_async()

Adding New Async Item Type

Template for adding a new async item type (in IssueItemsGenerator or new module):

async def create_new_items_async(self, issue_keys: List[str], count: int) -> int:
    """Create new item type concurrently"""
    self.logger.info(f"Creating {count} new items (concurrency: {self.concurrency})...")

    # Pre-generate all tasks
    tasks = []
    for i in range(count):
        issue_key = random.choice(issue_keys)
        item_data = {
            "field1": "value1",
            "field2": self.generate_random_text()
        }
        tasks.append(self._api_call_async('POST', f'issue/{issue_key}/newitem', data=item_data))

    # Execute with progress tracking
    created = 0
    for i in range(0, len(tasks), self.concurrency * 2):
        batch = tasks[i:i + self.concurrency * 2]
        results = await asyncio.gather(*batch, return_exceptions=True)
        for result in results:
            if isinstance(result, tuple) and result[0]:
                created += 1
        self.logger.info(f"Created {created}/{count} new items")

    return created

Adding New Sync Item Type

Template for sync item type:

def create_new_items(self, issue_keys: List[str], count: int) -> int:
    """Create new item type"""
    self.logger.info(f"Creating {count} new items...")

    created = 0
    for i in range(count):
        issue_key = random.choice(issue_keys)
        item_data = {
            "field1": "value1",
            "field2": self.generate_random_text()
        }

        response = self._api_call('POST', f'issue/{issue_key}/newitem', data=item_data)
        if response:
            created += 1

        if created % 10 == 0:
            self.logger.info(f"Created {created}/{count} new items")
            time.sleep(0.2)

    return created

Updating Multipliers

Edit item_type_multipliers.csv file
Test with --dry-run to verify calculations
Document changes in README.md

API Endpoints Used

Jira Cloud REST API v3

Base URL: {jira_url}/rest/api/3/

Endpoint	Method	Generator	Purpose
`projectCategory`	POST	ProjectGenerator	Create project category
`project`	POST	ProjectGenerator	Create project
`project/{key}`	GET/PUT	ProjectGenerator	Get/update project details
`project/{key}/role`	GET	ProjectGenerator	Get project roles
`project/{key}/role/{id}`	POST	ProjectGenerator	Add user to role
`project/{key}/properties/{key}`	PUT	ProjectGenerator	Set project properties
`issue/bulk`	POST	IssueGenerator	Bulk issue creation
`issue/{key}/attachments`	POST	IssueGenerator	Upload attachments
`issue/{key}/comment`	POST	IssueItemsGenerator	Add comments
`issue/{key}/worklog`	POST	IssueItemsGenerator	Add worklogs
`issue/{key}/watchers`	POST	IssueItemsGenerator	Add watchers
`issue/{key}/votes`	POST	IssueItemsGenerator	Add votes
`issue/{key}/properties/{key}`	PUT	IssueItemsGenerator	Set issue properties
`issue/{key}/remotelink`	POST	IssueItemsGenerator	Add remote links
`issueLink`	POST	IssueItemsGenerator	Create issue links
`issueLinkType`	GET	IssueItemsGenerator	Get link types
`version`	POST	ProjectGenerator	Create versions
`component`	POST	ProjectGenerator	Create components
`filter`	POST	FilterGenerator	Create filters
`dashboard`	POST	FilterGenerator	Create dashboards
`field`	POST	CustomFieldGenerator	Create custom fields
`field/{id}/context`	GET/POST	CustomFieldGenerator	Get/create field contexts
`field/{id}/context/{id}/option`	POST	CustomFieldGenerator	Create field options
`myself`	GET	JiraAPIClient	Get current user
`users/search`	GET	JiraAPIClient	Search users

Jira Software Cloud REST API (Agile)

Base URL: {jira_url}/rest/agile/1.0/

Endpoint	Method	Generator	Purpose
`board`	GET/POST	AgileGenerator	Get/create boards
`sprint`	POST	AgileGenerator	Create sprints
`sprint/{id}/issue`	POST	AgileGenerator	Add issues to sprint

Documentation: https://developer.atlassian.com/cloud/jira/platform/rest/v3/

Performance Optimizations

The tool includes several optimizations designed for 18M+ issue scale:

1. Pre-Generated Random Text Pool

Location: generators/base.py (class-level in JiraAPIClient)

# Class attributes
_TEXT_POOL_SIZE = 1000  # Strings per category
_text_pool = {'short': [...], 'medium': [...], 'long': [...]}
_LOREM_WORDS = [...]  # 60 words

How it works:

3,000 pre-generated text strings (1,000 per size category)
Categories: short (3-10 words), medium (5-15 words), long (10-30 words)
generate_random_text() selects from pool based on requested word range
Thread-safe initialization with double-checked locking

Impact: ~38x faster text generation (3.8M/sec vs ~100K/sec)

2. Connection Pool Tuning

Sync Session (_create_session()):

HTTPAdapter(
    pool_connections=20,  # Cache 20 host pools
    pool_maxsize=50,      # 50 connections per host
    pool_block=False      # Don't block on pool exhaustion
)

Async Session (_get_async_session()):

TCPConnector(
    limit=100,            # Total connections
    limit_per_host=50,    # Per-host limit
    ttl_dns_cache=300,    # 5-minute DNS cache
    enable_cleanup_closed=True
)

Impact: Reduces TCP handshake overhead by reusing connections

3. Memory-Efficient Task Batching

Before (creates all tasks upfront):

tasks = []
for _ in range(count):  # 48M iterations for comments
    tasks.append(...)   # 48M task objects in memory

After (generates per-batch):

for batch_start in range(0, count, batch_size):
    tasks = []  # Only batch_size tasks in memory
    for _ in range(current_batch_size):
        tasks.append(...)
    results = await asyncio.gather(*tasks)

Impact: Reduces peak memory from GBs to MBs at scale

4. Attachment Session Reuse

Location: generators/issues.py

async def _get_attachment_session(self) -> aiohttp.ClientSession:
    """Reusable session for all attachment uploads"""
    if self._attachment_session is None:
        connector = aiohttp.TCPConnector(limit=50, limit_per_host=20)
        self._attachment_session = aiohttp.ClientSession(...)
    return self._attachment_session

Impact: Eliminates ~27M session creation/teardown cycles at 18M scale

5. Attachment File Pooling

Location: generators/issues.py

Pool of 20 pre-generated files (1-5KB each)
Files reused across all attachment uploads
Eliminates per-upload content generation

Impact: ~100x smaller files, no per-upload generation overhead

API Limitations

No Bulk Watcher API: Jira Cloud REST API does not support adding multiple watchers in a single call. Each watcher requires an individual POST to /rest/api/3/issue/{issueIdOrKey}/watchers. This is a Jira API limitation, not something we can optimize around.

Common Modifications

Change Concurrency Default

# In JiraDataGenerator.__init__()
concurrency: int = 5  # Change default value

Change Bulk API Limit

# In IssueGenerator class
BULK_CREATE_LIMIT = 50  # Change to Jira's limit

Adjust Rate Limit Behavior

# In RateLimitState dataclass
max_delay: float = 60.0  # Increase for more conservative rate limiting

# In JiraAPIClient._handle_rate_limit_async()
self.rate_limit.current_delay * 2  # Change multiplier for backoff

Add Custom Issue Fields

# In IssueGenerator.create_issues_bulk(), inside the issue_data dict
issue_data = {
    "fields": {
        "project": {"id": project_id},
        "summary": summary,
        "description": description,
        "issuetype": {"name": "Task"},
        "labels": [self.run_id, self.prefix],
        # Add custom fields here
        "customfield_10001": "custom value",
        "priority": {"name": "Medium"}
    }
}

Testing Guidelines

Unit Tests

The project has comprehensive unit tests with 90%+ code coverage.

# Install test dependencies
pip install -r requirements-dev.txt

# Run all tests in parallel (fastest)
pytest -n auto

# Run with coverage report
pytest -n auto --cov=generators --cov=jira_data_generator --cov=jira_user_generator --cov-report=term-missing

# Run a specific test file
pytest tests/test_cli.py -v

# Run tests matching a pattern
pytest -k "test_dry_run"

# Run only failed tests from last run
pytest --lf

Test Dependencies (requirements-dev.txt):

pytest - Test framework
pytest-asyncio - Async test support
pytest-cov - Coverage reporting
pytest-xdist - Parallel test execution
responses - Mock sync HTTP requests
aioresponses - Mock async HTTP requests

Mocking Strategy:

Sync HTTP calls mocked with responses library
Async HTTP calls mocked with aioresponses library
File I/O uses tmp_path fixture
CLI tests mock sys.argv and logging.FileHandler

CI/CD: Tests run on GitHub Actions (Python 3.12) with 90% coverage threshold.

Dry Run Testing

# Always test with --dry-run first
python jira_data_generator.py \
  --url https://mycompany.atlassian.net \
  --email test@company.com \
  --prefix DRYRUN \
  --count 100 \
  --size small \
  --dry-run

Small Scale Testing

# Start with 10-25 issues
python jira_data_generator.py ... --count 10 --size small

Test Different Concurrency Levels

# Low concurrency (safer)
python jira_data_generator.py ... --concurrency 3

# High concurrency (faster)
python jira_data_generator.py ... --concurrency 15

# Sequential (debugging)
python jira_data_generator.py ... --no-async

Verify in Jira

labels = PREFIX-20241204-143022

Performance Characteristics

Timing Estimates

Issues	Total Items	Sequential	Async (c=5)	Async (c=10)
25	~250	1-2 min	30-60s	20-40s
100	~1,000	5-7 min	2-3 min	1-2 min
500	~5,000	20-30 min	8-12 min	5-8 min
1,000	~10,000	45-60 min	15-25 min	10-15 min

Times vary based on rate limits and network latency

Bottlenecks

Rate Limits: Primary bottleneck - higher concurrency hits limits faster
Network Latency: Each API call has ~100-300ms latency
No Bulk API: Issue links, watchers, votes require individual calls

Optimization Features

Async I/O: Concurrent requests for high-volume items
Semaphore Control: Prevents overwhelming the API
Connection Pooling: Reuses HTTP connections (both sync and async)
Batch Processing: Issues created 50 at a time via bulk API
Shared Rate Limit State: All async tasks respect the same limits

Dependencies

Required

requests>=2.31.0: HTTP library with retry support (sync)
aiohttp>=3.9.0: Async HTTP library (async)
urllib3>=2.0.0: Connection pooling and retries
python-dotenv>=1.0.0: Load environment variables from .env file

Python Version

Minimum: Python 3.7+ (dataclasses, type hints, asyncio)
Recommended: Python 3.9+ (improved type hints)
Tested: Python 3.11, 3.14

Command Line Options

Option	Required	Description	Default
`--url`	Yes	Jira instance URL	-
`--email`	Yes	Your Jira email	-
`--token`	No*	API token	From env
`--prefix`	Yes	Prefix for items and project keys	-
`--count`	Yes	Number of issues to create	-
`--projects`	No	Override number of projects (issues spread evenly)	Calculated
`--size`	No	Instance size bucket	`small`
`--concurrency`	No	Number of concurrent API requests	`5`
`--request-delay`	No	Delay between requests in seconds (try 0.05-0.1)	`0`
`--no-async`	No	Disable async mode (sequential)	`false`
`--dry-run`	No	Preview only, no API calls	`false`
`--verbose`	No	Enable debug logging	`false`
`--issues-only`	No	Only create projects and issues, skip associated data	`false`
`--resume`	No	Resume from existing checkpoint	`false`
`--no-checkpoint`	No	Disable checkpointing entirely	`false`

* Token can also be set via JIRA_API_TOKEN environment variable or .env file

Checkpointing System

Overview

The checkpointing system enables resumable data generation for large-scale runs (designed for 18M+ issues). Progress is automatically saved to a JSON file and can be resumed after interruptions.

Checkpoint File Format

{
  "run_id": "PERF-20241208-143022",
  "prefix": "PERF",
  "size": "small",
  "target_issue_count": 1000000,
  "started_at": "2024-12-08T14:30:22",
  "last_updated": "2024-12-08T16:45:00",
  "jira_url": "https://company.atlassian.net",
  "async_mode": true,
  "concurrency": 10,
  "project_keys": ["PERF1", "PERF2"],
  "project_ids": {"PERF1": "10001", "PERF2": "10002"},
  "issue_keys": ["PERF1-1", "PERF1-2", ...],
  "issues_per_project": {"PERF1": 500000, "PERF2": 450000},
  "phases": {
    "projects": {"status": "complete", "target_count": 2, "created_count": 2},
    "issues": {"status": "in_progress", "target_count": 1000000, "created_count": 950000},
    "comments": {"status": "pending", "target_count": 4800000, "created_count": 0},
    ...
  }
}

Phase Tracking

All phases are tracked with status (pending/in_progress/complete) and counts:

Phase	Tracks
`project_categories`	Category creation
`projects`	Project keys and IDs
`project_properties`	Property count
`issues`	Issue keys, per-project counts
`components`	Component count
`versions`	Version count
`comments`	Comment count
`worklogs`	Worklog count
`issue_links`	Link count
`watchers`	Watcher count
`attachments`	Attachment count
`votes`	Vote count
`issue_properties`	Property count
`remote_links`	Remote link count
`boards`	Board count
`sprints`	Sprint count
`filters`	Filter count
`dashboards`	Dashboard count

Usage Examples

# Start a new large run (checkpoint auto-created)
python jira_data_generator.py \
  --url https://company.atlassian.net \
  --email user@company.com \
  --prefix BIGRUN \
  --count 1000000 \
  --size large \
  --concurrency 10

# Resume after interruption
python jira_data_generator.py \
  --url https://company.atlassian.net \
  --email user@company.com \
  --prefix BIGRUN \
  --count 1000000 \
  --size large \
  --resume

# Disable checkpointing for small runs
python jira_data_generator.py ... --no-checkpoint

Resume Behavior

When --resume is specified:

Looks for {PREFIX}-checkpoint.json in current directory
Validates URL matches (warns if different)
Restores run_id to maintain JQL label consistency
Skips completed phases entirely
For partial phases, calculates remaining items needed
Fetches existing issue keys from Jira if needed

Checkpoint Lifecycle

Start: Creates {PREFIX}-checkpoint.json
During run: Updates after each phase/batch completion
On interrupt: Last save preserved for resume
On completion: Renames to {run_id}-checkpoint.json for archival

Design Decisions

Periodic checkpointing for high-volume items: Issues, comments, watchers, attachments, and versions all checkpoint every ~500 items instead of only at phase completion. This balances safety vs performance - max data loss is ~500 items per type instead of potentially millions, with reasonable disk write overhead.
Count-based tracking: For 18M issues, storing all keys would be ~500MB. Instead, we track counts per project and can reconstruct keys via JQL query on resume.
Phase granularity: Complete phases are skipped entirely; partial phases continue from last count.
Atomic writes: Uses temp file + rename to prevent corruption on interrupt.
No item-level tracking for high-volume items: Comments, worklogs, etc. track only total count (some items may be duplicated on resume, which is acceptable).

Troubleshooting Guide for Agents

Issue: "Rate limit hit" warnings

Solution: This is expected. The code handles it automatically. If persistent:

Reduce --concurrency value
The tool will auto-adjust with exponential backoff

Issue: "Failed to create projects. Aborting."

Check:

API token has project creation permissions
Prefix creates valid project keys (uppercase, max 10 chars)
No existing projects conflict with generated keys

Issue: "Failed to create issues. Aborting."

Check:

API token is valid
User has permission in project
Project was created successfully
Jira URL is correct (https://company.atlassian.net)

Issue: Async-related errors

Solution: Try --no-async flag to use sequential mode

python jira_data_generator.py ... --no-async

Issue: Slow performance

Expected: Large datasets take time due to rate limits Try: Increase --concurrency (but may hit rate limits faster)

Issue: Board creation fails

Check:

Jira Software is enabled (not just Jira Core)
Project type is 'software'
Filter creation succeeds (boards require filters)

Quick Reference for Common Agent Tasks

"Add support for [feature]"

Check if endpoint exists in Jira API v3 docs
Decide which generator module to use (or create new one)
Decide sync vs async (based on volume)
Add method following patterns above
Add to JiraDataGenerator.generate_all() and generate_all_async()
Add multiplier to CSV if needed
Test with --dry-run
Update README.md

"Fix rate limiting issue"

Check JiraAPIClient._handle_rate_limit_async() method for async
Check JiraAPIClient._handle_rate_limit() method for sync
Verify asyncio.Lock is used for shared state
Adjust max_delay or backoff multiplier
Consider reducing default concurrency

"Change output format"

Modify logging statements in relevant generator methods
Keep structure: "Created X/Y items"
For async: log progress in batch loop
Update README.md examples

"Add new size bucket"

Add column to item_type_multipliers.csv
Update size_map in load_multipliers_from_csv()
Add to argparse choices
Document in README.md

User Generator Script

Overview

jira_user_generator.py is a helper script for creating sandbox test users and groups.

Key Classes

`JiraUserGenerator`

Purpose: Invite users and create groups in Jira Cloud
Key Properties:
- VALID_PRODUCTS: List of valid Jira product names
- products: List of products to grant access (default: ['jira-software'])
- created_users[]: Track invited users
- existing_users[]: Track users that already exist
- created_groups[]: Track created groups
- existing_groups[]: Track groups that already exist

Command Line Options

Option	Required	Description	Default
`--url`	Yes	Jira instance URL	-
`--email`	Yes	Your Jira admin email	-
`--token`	No*	API token	From env
`--base-email`	Yes	Base email for sandbox users	-
`--users`	Yes	Number of users to create	-
`--groups`	No	Group names to create	None
`--products`	No	Products to grant access	`jira-software`
`--user-prefix`	No	Display name prefix	`Sandbox`
`--dry-run`	No	Preview only	`false`
`--verbose`	No	Debug logging	`false`

Version History

v3.11 (2024-12-18): Project count override
- Added --projects option to override the calculated number of projects
- Issues are spread evenly across the specified number of projects
- Useful for controlling data distribution regardless of multiplier calculations
- Example: --count 18000000 --projects 100 creates 180,000 issues per project
v3.10 (2024-12-16): Periodic checkpointing for high-volume items
- Issues: Checkpoint every 500 issues (10 bulk batches) instead of per-project
- Comments: Checkpoint every 500 comments during async creation (48.4M at 18M scale)
- Watchers: Checkpoint every 500 watchers during async creation (40.3M at 18M scale)
- Attachments: Checkpoint every 500 attachments during async creation (27.4M at 18M scale)
- Versions: Checkpoint every 500 versions across parallel project creation (25.9M at 18M scale)
- Maximum data loss on interruption reduced from potentially millions to ~500 items per type
- All high-volume generators now accept optional checkpoint parameter
- Added start_count parameter to async methods for proper resume support
- Critical for reliable 18M+ issue runs where crashes could lose hours of work
v3.9 (2024-12-10): Issues-only mode
- Added --issues-only option to create only projects and issues, skipping all associated data
- Useful for quick testing of issue creation performance or scenarios where only issues are needed
- When enabled, skips: comments, worklogs, watchers, attachments, votes, properties, remote links, versions, components, categories, boards, sprints, filters, dashboards, custom fields
v3.8 (2024-12-10): Adaptive rate limiting and request smoothing
- Added --request-delay option to add base delay between requests (recommended: 0.05-0.1s)
- Implemented adaptive throttling: automatically increases delay when hitting 429s, decreases on success
- Added jitter to backoff delays (±20%) and request delays (±10%) to prevent thundering herd
- New RateLimitState fields: adaptive_delay, recent_429_count, recent_success_count
- New methods: _get_effective_delay(), _apply_request_delay()
- All async operations now benefit from smoothed request pacing
- Updated all generator classes to accept request_delay parameter
- Per-phase rate limit tracking: Benchmark summary now shows 429s and errors per phase with percentages
- New PhaseMetrics fields: rate_limited, errors for per-phase tracking
v3.7 (2024-12-10): Fix dashboard sharing and sprint board filtering
- Fixed dashboard creation error: Removed {"type": "global"} share permission which is disabled on most Jira Cloud instances
- Dashboards now alternate between private and authenticated user sharing
- Fixed sprint creation error: Sprints are now only created on scrum boards (kanban boards don't support sprints)
- Added scrum_board_ids filtering in both sync and async agile item creation
v3.6 (2024-12-09): Performance optimizations for 18M+ scale
- Session reuse for attachments: Dedicated aiohttp.ClientSession reused across all attachment uploads (eliminates ~27M session creation/teardown cycles at 18M scale)
- Pre-generated random text pool: 3,000 pre-generated text strings in 3 size categories (short/medium/long), ~38x faster than generating on-the-fly
- Memory-efficient task batching: All async methods now generate tasks per-batch instead of upfront (reduces peak memory from GBs to MBs at scale)
- Optimized connection pooling:
  - Sync (requests): HTTPAdapter(pool_connections=20, pool_maxsize=50)
  - Async (aiohttp): TCPConnector(limit=100, limit_per_host=50, ttl_dns_cache=300)
- Updated all async methods in issue_items.py, projects.py, and issues.py
v3.5 (2024-12-09): Parallel issue creation and attachment optimization
- Added create_issues_bulk_async() method for parallel issue creation across projects
- Issues now created concurrently across multiple projects (up to 5 parallel)
- Added attachment pooling system with pre-generated 1-5KB files
- Reduced attachment size from 1-500KB to 1-5KB (~100x smaller)
- Pool of 20 reusable attachments eliminates per-upload content generation
- Expected 10-20x speedup for attachment phase
- Updated all documentation (README, QUICKREF, CLAUDE.md)
v3.4 (2024-12-08): Custom fields support
- Added CustomFieldGenerator module for creating custom fields
- Supports 20 field types: textfield, textarea, float, date, datetime, select, multiselect, radiobuttons, checkboxes, userpicker, grouppicker, labels, url, project, version, etc.
- Automatically creates options for select-type fields
- Both sync and async methods available
- Created early as configuration items (before projects)
- At 18M issues: ~22K custom fields
v3.3 (2024-12-08): Expanded async support for 18M+ issue scale
- Added async methods for versions, components, project properties
- Added async methods for sprints, filters, dashboards
- Updated main generator to use async for all high-volume operations
- Boards remain sequential (filter dependency required)
- Volume analysis: versions (25.9M), components (3.8M), sprints (900K) at 18M scale
v3.2 (2024-12-08): Request statistics tracking
- Added request statistics to benchmark summary (total requests, rate limited %, errors %)
- Added record_request(), record_rate_limit(), record_error() methods to BenchmarkTracker
- Integrated request tracking into JiraAPIClient._api_call() and _api_call_async()
- Added benchmark parameter to all generator classes
- Shows "(No requests recorded - dry-run mode)" in dry-run mode
- Updated README, QUICKREF, and CLAUDE.md documentation
v3.1 (2024-12-08): Benchmarking and time extrapolation
- Added BenchmarkTracker class for performance tracking
- Per-phase timing with items/second rates
- Automatic time extrapolation for 18M issues
- Summary report with phase breakdown
- Updated README and QUICKREF with benchmarking docs
v3.0 (2024-12-08): Checkpointing for resumable large runs
- Added CheckpointManager for progress tracking
- New --resume flag to continue interrupted runs
- New --no-checkpoint flag to disable checkpointing
- Phase-level and count-based progress tracking
- Atomic checkpoint file writes
- Support for 18M+ issue runs with interruption recovery
v2.1 (2024-12-08): Project categories and properties
- Added project categories (organize projects)
- Added project properties (custom key-value metadata)
- Projects automatically assigned to categories
- Updated all documentation
v2.0 (2024-12-08): Modular architecture refactor
- Split monolithic file into generators/ package
- Added votes, properties, remote links support
- Added boards and sprints (agile API)
- Added filters and dashboards
- Improved code organization and maintainability
- Updated CLAUDE.md with new architecture
v1.2 (2024-12-05): User generator script
- New jira_user_generator.py helper script
- Invite sandbox users with plus-addressing
- Create groups
- Grant Jira product access
v1.1 (2024-12-05): Async concurrency support
- Added aiohttp for async HTTP requests
- Concurrent creation for comments, worklogs, watchers, links
- --concurrency flag to control parallel requests
- --no-async flag for sequential mode
v1.0 (2024-12-04): Initial release
- Bulk issue creation
- Rate limit handling
- 4 size buckets
- Comments, worklogs, watchers, links
- Versions and components

Last Updated: 2024-12-18 (v3.11 - Project Count Override) AI Agent Note: This file is specifically for you. The user-facing docs are in README.md.

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md - Jira Test Data Generator

Project Overview

File Structure

Architecture & Design Patterns

Module Overview

Core Classes

RateLimitState (dataclass) - generators/base.py

JiraAPIClient (base class) - generators/base.py

JiraDataGenerator (orchestrator) - jira_data_generator.py

CheckpointManager - generators/checkpoint.py

PhaseMetrics (dataclass) - generators/benchmark.py

BenchmarkTracker - generators/benchmark.py

Async vs Sync Operations

Rate Limiting Strategy

API Call Patterns

Synchronous (_api_call)

Asynchronous (_api_call_async)

Data Model

Multipliers (loaded from CSV)

Created Item Types

Key Conventions

Naming

Logging

Error Handling

Extending the Project

Adding New Generator Module

Adding New Async Item Type

Adding New Sync Item Type

Updating Multipliers

API Endpoints Used

Jira Cloud REST API v3

Jira Software Cloud REST API (Agile)

Performance Optimizations

1. Pre-Generated Random Text Pool

2. Connection Pool Tuning

3. Memory-Efficient Task Batching

4. Attachment Session Reuse

5. Attachment File Pooling

API Limitations

Common Modifications

Change Concurrency Default

Change Bulk API Limit

Adjust Rate Limit Behavior

Add Custom Issue Fields

Testing Guidelines

Unit Tests

Dry Run Testing

Small Scale Testing

Test Different Concurrency Levels

Verify in Jira

Performance Characteristics

Timing Estimates

Bottlenecks

Optimization Features

Dependencies

Required

Python Version

Command Line Options

Checkpointing System

Overview

Checkpoint File Format

Phase Tracking

Usage Examples

Resume Behavior

Checkpoint Lifecycle

Design Decisions

Troubleshooting Guide for Agents

Issue: "Rate limit hit" warnings

Issue: "Failed to create projects. Aborting."

Issue: "Failed to create issues. Aborting."

Issue: Async-related errors

Issue: Slow performance

Issue: Board creation fails

Quick Reference for Common Agent Tasks

`RateLimitState` (dataclass) - `generators/base.py`

`JiraAPIClient` (base class) - `generators/base.py`

`JiraDataGenerator` (orchestrator) - `jira_data_generator.py`

`CheckpointManager` - `generators/checkpoint.py`

`PhaseMetrics` (dataclass) - `generators/benchmark.py`

`BenchmarkTracker` - `generators/benchmark.py`

Synchronous (`_api_call`)

Asynchronous (`_api_call_async`)

`JiraUserGenerator`