Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
d7c840e
Switch to click for arg/option processing
malessi Jan 22, 2026
a01f675
Use constant strings for all fields in claims_generator
malessi Jan 23, 2026
1ac98a2
Cleanup claims_generator; use constants; reorganize methods/classes/m…
malessi Feb 2, 2026
f9e0862
Completely refactor adjudicated claim tables generation into separate…
malessi Feb 4, 2026
e27c301
Refactor pac claim tables generation into separate functions
malessi Feb 5, 2026
9b7bc08
Update contract plan tables generation to allow regeneration
malessi Feb 5, 2026
8632ddd
Use tqdm to display progress when loading files
malessi Feb 6, 2026
7d8145b
Fix printing empty log message when not regenerating patient data; fo…
malessi Feb 6, 2026
97511e1
Byte-identical regeneration of adj/pac claims data; refactor everything
malessi Feb 6, 2026
69584dc
Order CLM_VAL rows in order of generation to avoid re-ordering during…
malessi Feb 6, 2026
384611f
Organize related functions and members into appropriate files
malessi Feb 6, 2026
96f300f
Write out CNTRCT_* and PRVDR_HSTRY tables
malessi Feb 6, 2026
edacd9a
Fix reordering during regeneration caused by non-deterministic BENE_S…
malessi Feb 6, 2026
c8456cb
Order by CLM's BENE_SKs first instead of BENE_HSTRY
malessi Feb 6, 2026
fcee1c8
Ensure MDCR_MBR rows can be tied uniquely to its parent CLM
malessi Feb 9, 2026
aa53bb3
Reword variable to make usage obvious
malessi Feb 9, 2026
639fdb4
Respect --patients even if a BENE_HSTRY file is provided
malessi Feb 9, 2026
037fa4c
Ensure claims_generator output is realtime if --claims is provided
malessi Feb 9, 2026
852b7e6
Explicitly provide patients count in pipeline CI
malessi Feb 9, 2026
b7995d5
Error if no input file or --patients specified
malessi Feb 9, 2026
38a885c
Re-introduce 4541's CLM_LINE rx change
malessi Feb 9, 2026
4aaf8f9
Add CLM_LINE_RPTD_GAP_DSCNT_AMT to the list of fields written to CSV
malessi Feb 9, 2026
04bac0b
Update generation guard to just check if BENE_HSTRY or CLM was provided
malessi Feb 10, 2026
998fde7
Update README
malessi Feb 10, 2026
fb8b2ba
Format README; move output file section to correct section
malessi Feb 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci-pipeline-synthetic.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
- name: Ensure synthetic data pipeline runs
run: |
cd ${{ github.workspace }}/apps/bfd-model-idr
uv run patient_generator.py
uv run patient_generator.py --patients 100
uv run claims_generator.py ./out/SYNTHETIC_BENE_HSTRY.csv
cd ../bfd-pipeline-idr
./run-db.sh
Expand Down
112 changes: 88 additions & 24 deletions apps/bfd-model-idr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,23 +49,27 @@ uv sync
### Compile FSH Resources

To compile the .fsh files from this folder

```sh
cd sushi && sushi build && cd ..
```

This will generate the StructureDefinition and CodeSystem resources necessary for synthetic data generation. Running compile_resources.py is not necessary to generate synthetic data.
This will generate the StructureDefinition and CodeSystem resources necessary for synthetic data generation. Running compile_resources.py is not necessary to generate synthetic data.

### Get Matchbox up and running
To reduce dependencies on tx.fhir.org as well as improve the speed of validation, we use matchbox to run a local FHIR server. Read more about matchbox at https://ahdis.github.io/matchbox/

To reduce dependencies on tx.fhir.org as well as improve the speed of validation, we use matchbox to run a local FHIR server. Read more about matchbox at <https://ahdis.github.io/matchbox/>

Note: Matchbox uses a significant amount of memory. Allocating at least 8GB of RAM is recommended, and more may be necessary in the future.

To start matchbox, run
To start matchbox, run

```sh
docker compose up -d
```

Note, it takes several minutes and requires a good bit of RAM. It'll be ready once it says that packages have been loaded and some obviously untrue amount of RAM (generally half of what it actually used) was used. Additionally, one can check the logs for "Finished engines during startup" or running a health check using

```sh
curl -X GET "http://localhost:8080/matchboxv3/actuator/health"
```
Expand Down Expand Up @@ -102,7 +106,7 @@ uv run compile_resources.py \

#### Patient Data - `patient_generator.py`

##### Usage
##### `patient_generator.py` usage

```text
usage: patient_generator.py [-h] [--patients PATIENTS] [--claims]
Expand Down Expand Up @@ -140,7 +144,7 @@ options:

```

##### Generating Data
##### Generating patient data

To generate synthetic patient data, the patient_generator.py script is used.
To utilize it to generate an entirely _new_ set of data from nothing:
Expand Down Expand Up @@ -173,33 +177,93 @@ The files output will be in the `out` folder:

The patient generator creates synthetic beneficiary data with realistic but _synthetic_ MBIs, coverage information, and historical records. It can generate multiple MBI versions per beneficiary and handles beneficiary cross-references with kill credit switches.

#### Claims data
#### Claims data - `claims_generator.py`

<!-- TODO: Provide an official location for downloading synthetic claims data -->
> [!IMPORTANT]
> Synthetic claims data is _much_ larger in size relative to patient data, and so it is not stored in the repository under `./synthetic-data`. If you are looking to regnerate this data, please reach out in #bfd so that the existing dataset can be provided to you.

#### `claims_generator.py` usage

```text
Usage: claims_generator.py [OPTIONS] [PATHS]...

Generate synthetic claims data. Provided file PATHS will be updated with new
fields.

Options:
--sushi / --no-sushi Generate new StructureDefinitions. Use when
testing locally if new .fsh files have been
added.
--min-claims INTEGER Minimum number of claims to generate per
person
--max-claims INTEGER Maximum number of claims to generate per
person
--force-pac-claims / --no-force-pac-claims
Generate _new_ partially-adjudicated claims
when existing pac claims tables exist in the
synthetic data provided
--help Show this message and exit.
```

#### Generating claims data

> [!WARNING]
> Either `SYNTHETIC_CLM.csv` or `SYNTHETIC_BENE_HSTRY.csv` **must** be provided as claims data generation requires an existing `BENE_SK` or `CLM` to generate/regenerate data.

To generate synthetic claims data, the `claims_generator.py` script is used.

The synthetic claims data generated will be written to the `./out` folder in the form of CSVs, one per-table:

To generate synthetic claims data, the claims_generator.py script is used.
To utilize it:
- `SYNTHETIC_CLM.csv`
- `SYNTHETIC_CLM_RLT_COND_SGNTR_MBR.csv`
- This file contains an extra column, `CLM_UNIQ_ID`, that is purely metadata used by the synthetic claims generator and is not consumed by the IDR Pipeline
- `SYNTHETIC_CLM_LINE.csv`
- `SYNTHETIC_CLM_LINE_RX.csv`
- `SYNTHETIC_CLM_VAL.csv`
- `SYNTHETIC_CLM_DT_SGNTR.csv`
- `SYNTHETIC_CLM_PROD.csv`
- `SYNTHETIC_CLM_INSTNL.csv`
- `SYNTHETIC_CLM_LINE_INSTNL.csv`
- `SYNTHETIC_CLM_DCMTN.csv`
- `SYNTHETIC_CLM_FISS.csv`
- `SYNTHETIC_CLM_PRFNL.csv`
- `SYNTHETIC_CLM_LINE_PRFNL.csv`
- `SYNTHETIC_CLM_ANSI_SGNTR.csv`
- `SYNTHETIC_PRVDR_HSTRY.csv`
- `SYNTHETIC_CNTRCT_PBP_NUM.csv`
- `SYNTHETIC_CNTRCT_PBP_CNTCT.csv`

These files represent the schema of the tables the information is sourced from, although for tables other than `CLM_DT_SGNTR`, the `CLM_UNIQ_ID` is propagated instead of the 5 part unique key from the IDR.

##### Using `SYNTHETIC_BENE_HSTRY.csv`

The below will generate _entirely new claims_ for the given `BENE_SK`s in the provided file:

```sh
uv run claims_generator.py \
--sushi \
out/SYNTHETIC_BENE_HSTRY.csv
```

--sushi is not strictly needed, if you have a local copy of the compiled shorthand files, but recommended to reduce drift. To specify a list of benes, pass in a .csv file containing a column named BENE_SK.
The files output will be in the out folder, there are several files:
SYNTHETIC_CLM.csv
SYNTHETIC_CLM_LINE.csv
SYNTHETIC_CLM_VAL.csv
SYNTHETIC_CLM_DT_SGNTR.csv
SYNTHETIC_CLM_PROD.csv
SYNTHETIC_CLM_INSTNL.csv
SYNTHETIC_CLM_LINE_INSTNL.csv
SYNTHETIC_CLM_DCMTN.csv
SYNTHETIC_CLM_FISS.csv
SYNTHETIC_CLM_PRFNL.csv
SYNTHETIC_CLM_LINE_PRFNL.csv
SYNTHETIC_CLM_ANSI_SGNTR.csv

These files represent the schema of the tables the information is sourced from, although for tables other than CLM_DT_SGNTR, the CLM_UNIQ_ID is propagated instead of the 5 part unique key from the IDR.
##### Regenerating existing claims data

The below will _re-generate_ **existing claims data** (assume `<PATH_TO_CLAIMS_DATA>` is a local directory containing synthetic claims data):

```sh
uv run claims_generator.py \
--sushi \
./synthetic-data <PATH_TO_CLAIMS_DATA>
```

If _any_ claims-related tables have had columns added to their respective generation functions, those new columns will be populated with values without impacting existing values in other columns.

> [!CAUTION]
> If an **existing column value** must be updated, that column value **MUST BE DELETED** from the respective table CSV first so that the values can be regenerated.

#### `--sushi`

`--sushi` is not strictly needed, if you have a local copy of the compiled shorthand files, but recommended to reduce drift. To specify a list of benes, pass in a .csv file containing a column named `BENE_SK`.

## Data Dictionary

Expand Down
Loading