feat: Add synthetic data for staging deployment#24
Conversation
Document decision to use custom Python seed script for staging environment synthetic data generation, with notes on exploring MapEHR/openFHIR for future plausibility research. Key points: - Custom Python script using Faker for immediate staging needs - MapEHR/openFHIR unavailable as open source, requires vendor contact - Synthea considered but adds complexity (FHIR → openEHR conversion) - Will explore MapEHR/openFHIR for plausibility once staging is functional
Add comprehensive Railway deployment strategies for seed scripts: - Option 1: Dockerfile CMD with chained commands (current pattern) - Option 2: railway.toml startCommand override - Option 3: Conditional environment-based seeding (recommended) Recommends Option 3 with idempotent seed script that: - Checks RAILWAY_ENVIRONMENT=staging before seeding - Only seeds if patient count < threshold - Completes quickly (<10s) to avoid deployment timeout - Uses unique identifiers to avoid production conflicts Includes Railway documentation references for start commands, migrations, and deployment actions.
Add automated seeding of realistic synthetic clinical data for Railway staging deployments, following ADR-0005. Changes: - Add scripts/seed_staging.py: Environment-aware, idempotent seed script - Generates 15 synthetic patients using Faker library - Creates 2-5 realistic vital signs per patient - Clinically plausible values based on WHO guidelines - Only runs when RAILWAY_ENVIRONMENT=staging - Checks patient count threshold before seeding - Completes in <10s to avoid deployment timeout - Update api/Dockerfile: - Copy scripts directory into container - Add conditional seeding in CMD before uvicorn starts - Grant appuser ownership of /scripts directory - Update api/pyproject.toml: - Add faker>=22.0.0 dependency for realistic data generation Implementation details: - Blood pressure: systolic 90-140 mmHg, diastolic 60-90 mmHg - Pulse rate: 60-100 bpm (normal resting adult) - Timestamps: Spread over past 1-4 weeks - MRN prefix: STAGING- to distinguish from production data - Idempotent: Safe to run multiple times (checks existing data) API endpoints used: - POST /api/patients - Create synthetic patients - POST /api/vital-signs - Record vital signs observations Deployment: Set RAILWAY_ENVIRONMENT=staging in Railway environment variables to enable automatic seeding on container startup.
Add comprehensive usage instructions for synthetic data generation: - Local development seeding commands - Railway staging automatic seeding setup - Seed script behavior and guarantees - Generated data specifications - Manual trigger instructions References ADR-0005 for detailed implementation rationale.
WalkthroughThis PR introduces synthetic data seeding for the staging environment. A new Python seed script generates 15 synthetic patients with vital signs using Faker, the Dockerfile is modified to conditionally execute this script during staging deployments, Faker is added as a dependency, and comprehensive documentation and architecture decision records are provided. Changes
Sequence DiagramsequenceDiagram
participant Docker as Docker Container<br/>(Startup)
participant Seed as Seed Script
participant API as API Server
participant DB as Database
Docker->>Seed: Execute seed_staging.py<br/>(if RAILWAY_ENVIRONMENT="staging")
Seed->>Seed: Check environment variable
activate Seed
Seed->>Seed: Verify conditions:<br/>staging/local mode
Seed->>API: GET /api/health (health check)
API-->>Seed: Health status
Seed->>API: GET /api/patients (count check)
API->>DB: Query patient count
DB-->>API: Current count
API-->>Seed: Patient count
alt Conditions Met
Seed->>Seed: Generate 15 synthetic patients<br/>(Faker demographics)
loop For each patient (1 to 15)
Seed->>API: POST /api/patients<br/>(MRN, name, birth date)
API->>DB: Insert patient
DB-->>API: Patient created
API-->>Seed: Patient ID + metadata
loop For each vital sign (3 readings)
Seed->>Seed: Generate vital signs<br/>(systolic, diastolic,<br/>pulse, timestamp)
Seed->>API: POST /api/vital-signs<br/>(reading data)
API->>DB: Insert vital observation
DB-->>API: Success
API-->>Seed: Vital sign ID
end
end
Seed->>Seed: Log summary:<br/>15 patients + 45 vitals
else Conditions Not Met
Seed->>Seed: Skip seeding<br/>(not staging or<br/>threshold reached)
end
deactivate Seed
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Fix all issues with AI Agents 🤖
In @scripts/seed_staging.py:
- Around line 188-190: The comment for spreading readings over "past 2-4 weeks"
is inconsistent with the code that sets hours_offset via randint(24 * 7, 24 *
28) (which yields 1-4 weeks); update either the comment to say "past 1-4 weeks"
or change the randint call to randint(24 * 14, 24 * 28) to enforce 2-4 weeks,
locating the change in the loop that defines hours_offset in the for i in
range(num_readings) block.
🧹 Nitpick comments (2)
docs/adr/0005-synthetic-data-generation.md (1)
1-452: Comprehensive ADR with solid research and rationale.The architecture decision record is well-structured, thoroughly researched, and provides clear justification for the custom Python seed script approach. The comparison of alternatives (MapEHR, openFHIR, Synthea) is particularly valuable.
Optional: Address markdown linting hints
The static analysis tool flagged several minor markdown formatting issues that could be addressed:
- Bare URLs (lines 36, 58, 76, 101, 442-447): Consider using proper markdown link syntax
[text](url)instead of bare URLs- Missing language specifiers (lines 94, 370): Add language identifiers to fenced code blocks (e.g.,
```textor```bash)These are purely stylistic and don't affect functionality.
CLAUDE.md (1)
74-119: Clear and comprehensive seeding documentation.The new synthetic data seeding section provides excellent guidance for both local development and Railway staging deployments. The breakdown of seed script behavior, configuration steps, and manual trigger options is particularly helpful.
Optional: Add language specifier to code block
At line 93, add a language identifier to the fenced code block (flagged by markdownlint):
-``` +```bash RAILWAY_ENVIRONMENT=stagingThis is a minor formatting improvement for consistency. </details> </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: Organization UI **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 5f5f894b8f5a191a580e46dd2980086c27898920 and aeec50a9af1afe8990807e08d49f1dfb8fd53e7e. </details> <details> <summary>📒 Files selected for processing (5)</summary> * `CLAUDE.md` * `api/Dockerfile` * `api/pyproject.toml` * `docs/adr/0005-synthetic-data-generation.md` * `scripts/seed_staging.py` </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧠 Learnings (2)</summary> <details> <summary>📚 Learning: 2026-01-02T20:46:00.339Z</summary>Learnt from: CR
Repo: platzhersh/open-cis PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-02T20:46:00.339Z
Learning: Copy .env.example to .env locally and configure environment variables: DATABASE_URL for Prisma app database connection, EHRBASE_URL for EHRBase REST API endpoint, CORS_ORIGINS for allowed origins JSON array, and VITE_API_URL for frontend API base URL**Applied to files:** - `api/Dockerfile` - `CLAUDE.md` </details> <details> <summary>📚 Learning: 2026-01-02T20:46:00.339Z</summary>Learnt from: CR
Repo: platzhersh/open-cis PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-02T20:46:00.339Z
Learning: EHRBase takes 30-60 seconds to become available after docker compose up; always check /ehrbase/rest/status before running the API**Applied to files:** - `CLAUDE.md` </details> </details><details> <summary>🧬 Code graph analysis (1)</summary> <details> <summary>scripts/seed_staging.py (1)</summary><blockquote> <details> <summary>api/tests/conftest.py (1)</summary> * `client` (12-18) </details> </blockquote></details> </details><details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>CLAUDE.md</summary> 93-93: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> <details> <summary>docs/adr/0005-synthetic-data-generation.md</summary> 36-36: Bare URL used (MD034, no-bare-urls) --- 58-58: Bare URL used (MD034, no-bare-urls) --- 76-76: Bare URL used (MD034, no-bare-urls) --- 94-94: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 101-101: Bare URL used (MD034, no-bare-urls) --- 370-370: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 442-442: Bare URL used (MD034, no-bare-urls) --- 443-443: Bare URL used (MD034, no-bare-urls) --- 444-444: Bare URL used (MD034, no-bare-urls) --- 445-445: Bare URL used (MD034, no-bare-urls) --- 446-446: Bare URL used (MD034, no-bare-urls) --- 447-447: Bare URL used (MD034, no-bare-urls) </details> </details> </details> <details> <summary>🔇 Additional comments (3)</summary><blockquote> <details> <summary>scripts/seed_staging.py (2)</summary><blockquote> `115-141`: **Well-designed correlated vital signs generation.** The logic for generating clinically realistic vital signs with correlated systolic/diastolic values is excellent. The correlation logic (higher systolic → higher diastolic) reflects real physiological patterns. --- `34-74`: **Solid idempotent design with proper error handling.** The overall structure demonstrates good practices: - Environment-aware execution (checks `RAILWAY_ENVIRONMENT`) - Idempotent with patient count threshold - Graceful error handling with informative messages - Realistic synthetic data generation using Faker - Appropriate timeouts (10s for checks, 30s for operations) The logic is sound, though the execution model needs adjustment per the Dockerfile review comment. Also applies to: 144-163, 207-247 </blockquote></details> <details> <summary>api/pyproject.toml (1)</summary><blockquote> `16-16`: The Faker dependency constraint `>=22.0.0` is appropriate. Version 22.0.0 and all subsequent versions (up to the latest 39.0.0 as of Dec 2025) are stable. No active security advisories or CVEs have been found for recent Faker releases. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| # Conditionally seed staging data if RAILWAY_ENVIRONMENT=staging | ||
| CMD sh -c "echo 'Running database migrations...' && \ | ||
| prisma migrate deploy && \ | ||
| if [ \"$RAILWAY_ENVIRONMENT\" = \"staging\" ]; then \ | ||
| echo 'Seeding staging data...' && python /scripts/seed_staging.py; \ | ||
| fi && \ | ||
| echo 'Starting uvicorn on port ${PORT:-8000}...' && \ | ||
| uvicorn src.main:app --host 0.0.0.0 --port ${PORT:-8000}" |
There was a problem hiding this comment.
Critical: Seed script requires running API but executes before uvicorn starts.
The seed script is executed before starting uvicorn in the CMD chain, but scripts/seed_staging.py makes HTTP requests to API endpoints:
- Line 50:
GET /health - Line 61:
GET /api/patients - Lines 82-85:
POST /api/patients - Lines 101-104:
POST /api/vital-signs
Since the API hasn't started yet, all HTTP requests will fail, causing should_seed() to return False (line 55-56), and seeding will be skipped every time.
This means automatic staging seeding will never execute successfully despite the documentation claiming it does.
Proposed solutions
Solution 1: Use database directly (Recommended)
Refactor seed_staging.py to use Prisma client directly instead of HTTP API:
from prisma import Prisma
async def main():
db = Prisma()
await db.connect()
try:
# Create patients directly in DB
patient = await db.patient.create(data={...})
# Create compositions via ehrbase_client (imported from api/src/ehrbase/client.py)
await ehrbase_client.create_composition(...)
finally:
await db.disconnect()Solution 2: Run seed after API starts
Use Railway deployment lifecycle hooks or a separate service to run the seed script after the API is healthy:
# Remove seeding from CMD, handle via Railway deployment actions
CMD sh -c "prisma migrate deploy && uvicorn src.main:app --host 0.0.0.0 --port ${PORT:-8000}"Then configure Railway to run the seed script as a post-deployment hook.
Solution 3: Background seeding
Start uvicorn first, then run seed in the background:
CMD sh -c "prisma migrate deploy && \
uvicorn src.main:app --host 0.0.0.0 --port ${PORT:-8000} & \
sleep 5 && \
if [ \"$RAILWAY_ENVIRONMENT\" = \"staging\" ]; then python /scripts/seed_staging.py; fi && \
wait"However, this adds complexity and fragility.
| for i in range(num_readings): | ||
| # Spread readings over past 2-4 weeks | ||
| hours_offset = randint(24 * 7, 24 * 28) # 1-4 weeks ago |
There was a problem hiding this comment.
Fix comment/code inconsistency for time range.
Line 189 comment states "Spread readings over past 2-4 weeks" but line 190 generates readings from 1-4 weeks ago:
hours_offset = randint(24 * 7, 24 * 28) # 1-4 weeks ago24 * 7 = 168 hours = 1 week, so the range is 1-4 weeks, not 2-4.
Proposed fix
Update the comment to match the code:
- # Spread readings over past 2-4 weeks
+ # Spread readings over past 1-4 weeks
hours_offset = randint(24 * 7, 24 * 28) # 1-4 weeks agoOr adjust the code if 2-4 weeks was intentional:
# Spread readings over past 2-4 weeks
- hours_offset = randint(24 * 7, 24 * 28) # 1-4 weeks ago
+ hours_offset = randint(24 * 14, 24 * 28) # 2-4 weeks ago🤖 Prompt for AI Agents
In @scripts/seed_staging.py around lines 188-190, The comment for spreading
readings over "past 2-4 weeks" is inconsistent with the code that sets
hours_offset via randint(24 * 7, 24 * 28) (which yields 1-4 weeks); update
either the comment to say "past 1-4 weeks" or change the randint call to
randint(24 * 14, 24 * 28) to enforce 2-4 weeks, locating the change in the loop
that defines hours_offset in the for i in range(num_readings) block.
Summary by CodeRabbit
New Features
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.