Skip to content

Synthetic Data

Bell Eapen edited this page Jan 17, 2026 · 2 revisions

Synthetic Data Generation

Testing healthcare applications requires data, but using real patient data is risky and heavily regulated. DHTI solves this by making it easy to generate realistic synthetic data.

LLM Synthetic Data

You can use Large Language Models (LLMs) to generate free-text synthetic data conformant to specific schemas or instructions.

npx dhti-cli synthetic [INPUT] [OUTPUT] [PROMPT]

Flags:

  • -r, --maxRecords: Number of records to generate.
  • -m, --maxCycles: Max cycles for iterative generation.
  • -i, --inputField, -o, --outputField: JSON fields to target.

Structured Data (Synthea)

For generating complete patient records (FHIR bundles) with realistic histories:

See Synthea for detailed instructions on generating and uploading cohorts.

MIMIC Data

For researchers who need de-identified real hospital data, DHTI also supports loading the MIMIC-IV demo dataset. This gives you access to rich ICU data structures for testing complex clinical scenarios.

Note: MIMIC data is de-identified but derived from real events, offering a different kind of realism compared to Synthea's fully generated histories.

Clone this wiki locally