feat: Add synthetic data for staging deployment by platzhersh · Pull Request #24 · platzhersh/open-cis

platzhersh · 2026-01-05T22:44:19Z

Summary by CodeRabbit

New Features
- Staging environment now automatically generates synthetic patient data on deployment.
Documentation
- Added comprehensive documentation for data seeding workflows in local development and staging deployment.
- Added architectural decision record documenting the synthetic data generation approach and implementation strategy.
Chores
- Added dependency for synthetic data generation utilities.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Document decision to use custom Python seed script for staging environment synthetic data generation, with notes on exploring MapEHR/openFHIR for future plausibility research. Key points: - Custom Python script using Faker for immediate staging needs - MapEHR/openFHIR unavailable as open source, requires vendor contact - Synthea considered but adds complexity (FHIR → openEHR conversion) - Will explore MapEHR/openFHIR for plausibility once staging is functional

Add comprehensive Railway deployment strategies for seed scripts: - Option 1: Dockerfile CMD with chained commands (current pattern) - Option 2: railway.toml startCommand override - Option 3: Conditional environment-based seeding (recommended) Recommends Option 3 with idempotent seed script that: - Checks RAILWAY_ENVIRONMENT=staging before seeding - Only seeds if patient count < threshold - Completes quickly (<10s) to avoid deployment timeout - Uses unique identifiers to avoid production conflicts Includes Railway documentation references for start commands, migrations, and deployment actions.

Add automated seeding of realistic synthetic clinical data for Railway staging deployments, following ADR-0005. Changes: - Add scripts/seed_staging.py: Environment-aware, idempotent seed script - Generates 15 synthetic patients using Faker library - Creates 2-5 realistic vital signs per patient - Clinically plausible values based on WHO guidelines - Only runs when RAILWAY_ENVIRONMENT=staging - Checks patient count threshold before seeding - Completes in <10s to avoid deployment timeout - Update api/Dockerfile: - Copy scripts directory into container - Add conditional seeding in CMD before uvicorn starts - Grant appuser ownership of /scripts directory - Update api/pyproject.toml: - Add faker>=22.0.0 dependency for realistic data generation Implementation details: - Blood pressure: systolic 90-140 mmHg, diastolic 60-90 mmHg - Pulse rate: 60-100 bpm (normal resting adult) - Timestamps: Spread over past 1-4 weeks - MRN prefix: STAGING- to distinguish from production data - Idempotent: Safe to run multiple times (checks existing data) API endpoints used: - POST /api/patients - Create synthetic patients - POST /api/vital-signs - Record vital signs observations Deployment: Set RAILWAY_ENVIRONMENT=staging in Railway environment variables to enable automatic seeding on container startup.

Add comprehensive usage instructions for synthetic data generation: - Local development seeding commands - Railway staging automatic seeding setup - Seed script behavior and guarantees - Generated data specifications - Manual trigger instructions References ADR-0005 for detailed implementation rationale.

coderabbitai · 2026-01-05T22:44:26Z

Walkthrough

This PR introduces synthetic data seeding for the staging environment. A new Python seed script generates 15 synthetic patients with vital signs using Faker, the Dockerfile is modified to conditionally execute this script during staging deployments, Faker is added as a dependency, and comprehensive documentation and architecture decision records are provided.

Changes

Cohort / File(s)	Summary
Documentation `CLAUDE.md`, `docs/adr/0005-synthetic-data-generation.md`	Added comprehensive documentation of synthetic data seeding strategy, including local development and Railway staging deployment procedures, seed script behavior (idempotent, environment-aware), and architectural rationale with evaluated alternatives (MapEHR, openFHIR, Synthea, ehrbase/fhir-bridge vs. custom Python script).
Dependencies `api/pyproject.toml`	Added Faker library (`faker>=22.0.0`) to support realistic patient data generation.
Infrastructure & Startup `api/Dockerfile`	Copied seed scripts into image, granted ownership to appuser, and added conditional startup logic to execute seed script when `RAILWAY_ENVIRONMENT` equals "staging" (runs after migrations, before uvicorn).
Seed Script Implementation `scripts/seed_staging.py`	New standalone script that generates and seeds synthetic patients and vital signs. Includes environment-aware execution (staging/local only), API health checks, idempotent behavior with patient count thresholds, data generation using Faker, and robust error handling for API operations. Produces 15 synthetic patients with 3 vital-sign readings each, distributed over 1–4 weeks with clinically plausible values.

Sequence Diagram

sequenceDiagram
    participant Docker as Docker Container<br/>(Startup)
    participant Seed as Seed Script
    participant API as API Server
    participant DB as Database

    Docker->>Seed: Execute seed_staging.py<br/>(if RAILWAY_ENVIRONMENT="staging")
    
    Seed->>Seed: Check environment variable
    activate Seed
    Seed->>Seed: Verify conditions:<br/>staging/local mode
    Seed->>API: GET /api/health (health check)
    API-->>Seed: Health status
    Seed->>API: GET /api/patients (count check)
    API->>DB: Query patient count
    DB-->>API: Current count
    API-->>Seed: Patient count
    
    alt Conditions Met
        Seed->>Seed: Generate 15 synthetic patients<br/>(Faker demographics)
        loop For each patient (1 to 15)
            Seed->>API: POST /api/patients<br/>(MRN, name, birth date)
            API->>DB: Insert patient
            DB-->>API: Patient created
            API-->>Seed: Patient ID + metadata
            
            loop For each vital sign (3 readings)
                Seed->>Seed: Generate vital signs<br/>(systolic, diastolic,<br/>pulse, timestamp)
                Seed->>API: POST /api/vital-signs<br/>(reading data)
                API->>DB: Insert vital observation
                DB-->>API: Success
                API-->>Seed: Vital sign ID
            end
        end
        Seed->>Seed: Log summary:<br/>15 patients + 45 vitals
    else Conditions Not Met
        Seed->>Seed: Skip seeding<br/>(not staging or<br/>threshold reached)
    end
    deactivate Seed

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Seeds of data, sprouting bright,
Faker weaves patients through staging night,
Vitals flow like morning dew,
Idempotent magic, trusty and true! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: Add synthetic data for staging deployment' accurately summarizes the main change—implementing synthetic data seeding for the staging environment across multiple files (Dockerfile, seed script, dependencies, and documentation).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch claude/add-synthetic-staging-data-OwTPD

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Fix all issues with AI Agents 🤖

In @scripts/seed_staging.py:
- Around line 188-190: The comment for spreading readings over "past 2-4 weeks"
is inconsistent with the code that sets hours_offset via randint(24 * 7, 24 *
28) (which yields 1-4 weeks); update either the comment to say "past 1-4 weeks"
or change the randint call to randint(24 * 14, 24 * 28) to enforce 2-4 weeks,
locating the change in the loop that defines hours_offset in the for i in
range(num_readings) block.

🧹 Nitpick comments (2)

docs/adr/0005-synthetic-data-generation.md (1)

1-452: Comprehensive ADR with solid research and rationale.

The architecture decision record is well-structured, thoroughly researched, and provides clear justification for the custom Python seed script approach. The comparison of alternatives (MapEHR, openFHIR, Synthea) is particularly valuable.

Optional: Address markdown linting hints

The static analysis tool flagged several minor markdown formatting issues that could be addressed:

Bare URLs (lines 36, 58, 76, 101, 442-447): Consider using proper markdown link syntax [text](url) instead of bare URLs

Missing language specifiers (lines 94, 370): Add language identifiers to fenced code blocks (e.g., ```text or ```bash)

These are purely stylistic and don't affect functionality.
CLAUDE.md (1)
74-119: Clear and comprehensive seeding documentation.

The new synthetic data seeding section provides excellent guidance for both local development and Railway staging deployments. The breakdown of seed script behavior, configuration steps, and manual trigger options is particularly helpful.
Optional: Add language specifier to code block

At line 93, add a language identifier to the fenced code block (flagged by markdownlint):
-```
+```bash
 RAILWAY_ENVIRONMENT=staging
This is a minor formatting improvement for consistency.

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 5f5f894b8f5a191a580e46dd2980086c27898920 and aeec50a9af1afe8990807e08d49f1dfb8fd53e7e.

</details>

<details>
<summary>📒 Files selected for processing (5)</summary>

* `CLAUDE.md`
* `api/Dockerfile`
* `api/pyproject.toml`
* `docs/adr/0005-synthetic-data-generation.md`
* `scripts/seed_staging.py`

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧠 Learnings (2)</summary>

<details>
<summary>📚 Learning: 2026-01-02T20:46:00.339Z</summary>
Learnt from: CR
Repo: platzhersh/open-cis PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-02T20:46:00.339Z
Learning: Copy .env.example to .env locally and configure environment variables: DATABASE_URL for Prisma app database connection, EHRBASE_URL for EHRBase REST API endpoint, CORS_ORIGINS for allowed origins JSON array, and VITE_API_URL for frontend API base URL
**Applied to files:**
- `api/Dockerfile`
- `CLAUDE.md`

</details>
<details>
<summary>📚 Learning: 2026-01-02T20:46:00.339Z</summary>
Learnt from: CR
Repo: platzhersh/open-cis PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-02T20:46:00.339Z
Learning: EHRBase takes 30-60 seconds to become available after docker compose up; always check /ehrbase/rest/status before running the API
**Applied to files:**
- `CLAUDE.md`

</details>

</details><details>
<summary>🧬 Code graph analysis (1)</summary>

<details>
<summary>scripts/seed_staging.py (1)</summary><blockquote>

<details>
<summary>api/tests/conftest.py (1)</summary>

* `client` (12-18)

</details>

</blockquote></details>

</details><details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>CLAUDE.md</summary>

93-93: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>
<details>
<summary>docs/adr/0005-synthetic-data-generation.md</summary>

36-36: Bare URL used

(MD034, no-bare-urls)

---

58-58: Bare URL used

(MD034, no-bare-urls)

---

76-76: Bare URL used

(MD034, no-bare-urls)

---

94-94: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

101-101: Bare URL used

(MD034, no-bare-urls)

---

370-370: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

442-442: Bare URL used

(MD034, no-bare-urls)

---

443-443: Bare URL used

(MD034, no-bare-urls)

---

444-444: Bare URL used

(MD034, no-bare-urls)

---

445-445: Bare URL used

(MD034, no-bare-urls)

---

446-446: Bare URL used

(MD034, no-bare-urls)

---

447-447: Bare URL used

(MD034, no-bare-urls)

</details>

</details>

</details>

<details>
<summary>🔇 Additional comments (3)</summary><blockquote>

<details>
<summary>scripts/seed_staging.py (2)</summary><blockquote>

`115-141`: **Well-designed correlated vital signs generation.**

The logic for generating clinically realistic vital signs with correlated systolic/diastolic values is excellent. The correlation logic (higher systolic → higher diastolic) reflects real physiological patterns.

---

`34-74`: **Solid idempotent design with proper error handling.**

The overall structure demonstrates good practices:
- Environment-aware execution (checks `RAILWAY_ENVIRONMENT`)
- Idempotent with patient count threshold
- Graceful error handling with informative messages
- Realistic synthetic data generation using Faker
- Appropriate timeouts (10s for checks, 30s for operations)

The logic is sound, though the execution model needs adjustment per the Dockerfile review comment.




Also applies to: 144-163, 207-247

</blockquote></details>
<details>
<summary>api/pyproject.toml (1)</summary><blockquote>

`16-16`: The Faker dependency constraint `>=22.0.0` is appropriate. Version 22.0.0 and all subsequent versions (up to the latest 39.0.0 as of Dec 2025) are stable. No active security advisories or CVEs have been found for recent Faker releases.

</blockquote></details>

</blockquote></details>

</details>

coderabbitai · 2026-01-05T22:47:34Z

api/Dockerfile

+# Conditionally seed staging data if RAILWAY_ENVIRONMENT=staging
+CMD sh -c "echo 'Running database migrations...' && \
+  prisma migrate deploy && \
+  if [ \"$RAILWAY_ENVIRONMENT\" = \"staging\" ]; then \
+    echo 'Seeding staging data...' && python /scripts/seed_staging.py; \
+  fi && \
+  echo 'Starting uvicorn on port ${PORT:-8000}...' && \
+  uvicorn src.main:app --host 0.0.0.0 --port ${PORT:-8000}"


⚠️ Potential issue | 🔴 Critical

Critical: Seed script requires running API but executes before uvicorn starts.

The seed script is executed before starting uvicorn in the CMD chain, but scripts/seed_staging.py makes HTTP requests to API endpoints:

Line 50: GET /health

Line 61: GET /api/patients

Lines 82-85: POST /api/patients

Lines 101-104: POST /api/vital-signs

Since the API hasn't started yet, all HTTP requests will fail, causing should_seed() to return False (line 55-56), and seeding will be skipped every time.

This means automatic staging seeding will never execute successfully despite the documentation claiming it does.

Proposed solutions

Solution 1: Use database directly (Recommended)

Refactor seed_staging.py to use Prisma client directly instead of HTTP API:

from prisma import Prisma async def main(): db = Prisma() await db.connect() try: # Create patients directly in DB patient = await db.patient.create(data={...}) # Create compositions via ehrbase_client (imported from api/src/ehrbase/client.py) await ehrbase_client.create_composition(...) finally: await db.disconnect()

Solution 2: Run seed after API starts

Use Railway deployment lifecycle hooks or a separate service to run the seed script after the API is healthy:

# Remove seeding from CMD, handle via Railway deployment actions CMD sh -c "prisma migrate deploy && uvicorn src.main:app --host 0.0.0.0 --port ${PORT:-8000}"

Then configure Railway to run the seed script as a post-deployment hook.

Solution 3: Background seeding

Start uvicorn first, then run seed in the background:

CMD sh -c "prisma migrate deploy && \ uvicorn src.main:app --host 0.0.0.0 --port ${PORT:-8000} & \ sleep 5 && \ if [ \"$RAILWAY_ENVIRONMENT\" = \"staging\" ]; then python /scripts/seed_staging.py; fi && \ wait"

However, this adds complexity and fragility.

coderabbitai · 2026-01-05T22:47:34Z

scripts/seed_staging.py

+    for i in range(num_readings):
+        # Spread readings over past 2-4 weeks
+        hours_offset = randint(24 * 7, 24 * 28)  # 1-4 weeks ago


⚠️ Potential issue | 🟡 Minor

Fix comment/code inconsistency for time range.

Line 189 comment states "Spread readings over past 2-4 weeks" but line 190 generates readings from 1-4 weeks ago:

hours_offset = randint(24 * 7, 24 * 28) # 1-4 weeks ago

24 * 7 = 168 hours = 1 week, so the range is 1-4 weeks, not 2-4.

Proposed fix

Update the comment to match the code:

- # Spread readings over past 2-4 weeks + # Spread readings over past 1-4 weeks hours_offset = randint(24 * 7, 24 * 28) # 1-4 weeks ago

Or adjust the code if 2-4 weeks was intentional:

# Spread readings over past 2-4 weeks - hours_offset = randint(24 * 7, 24 * 28) # 1-4 weeks ago + hours_offset = randint(24 * 14, 24 * 28) # 2-4 weeks ago

🤖 Prompt for AI Agents

In @scripts/seed_staging.py around lines 188-190, The comment for spreading readings over "past 2-4 weeks" is inconsistent with the code that sets hours_offset via randint(24 * 7, 24 * 28) (which yields 1-4 weeks); update either the comment to say "past 1-4 weeks" or change the randint call to randint(24 * 14, 24 * 28) to enforce 2-4 weeks, locating the change in the loop that defines hours_offset in the for i in range(num_readings) block.

claude added 4 commits January 5, 2026 21:47

coderabbitai bot reviewed Jan 5, 2026

View reviewed changes

platzhersh merged commit 9e48730 into main Jan 6, 2026
2 checks passed

platzhersh deleted the claude/add-synthetic-staging-data-OwTPD branch January 6, 2026 20:52

coderabbitai bot mentioned this pull request Jan 8, 2026

Fix API seed connection error #26

Merged

coderabbitai bot mentioned this pull request Jan 31, 2026

fix: Refactor seed script to use PatientRegistry model only #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add synthetic data for staging deployment#24

feat: Add synthetic data for staging deployment#24
platzhersh merged 4 commits intomainfrom
claude/add-synthetic-staging-data-OwTPD

platzhersh commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 5, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 5, 2026

Uh oh!

coderabbitai bot Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

platzhersh commented Jan 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

platzhersh commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 5, 2026 •

edited

Loading