BFD-4466: Update synthetic claims generation to be in-place and idempotent by malessi · Pull Request #2997 · CMSgov/beneficiary-fhir-data

malessi · 2026-02-06T19:48:54Z

JIRA Ticket:
BFD-4466

What Does This PR Do?

This PR significantly refactors the original claims_generator.py to enable column-level generation for all claims-related synthetic tables/CSVs that introduce new columns as well as byte-identical, idempotent regeneration of tables without any changes. Aside from clm_rlt_cond_sgntr_mbr (see comments), no changes have been made to the actual generation logic, such that the claims data generated with these changes should be functionally identical to data generated prior to these changes.

Additionally:

Claims data generation now uses tqdm to print progress of file loading, data generation, and export
patients_generator.py now respects --patients <num_patients> even when existing synthetic data is provided. --patients is now treated as the number of new patients to generate for a given run
patients_generator.py --claims now prints the progress of the claims data generation in realtime

What Should Reviewers Watch For?

If you're reviewing this PR, please check for these things in particular:

Does the generation logic still make sense? Is this remotely maintainable?
Are the changes to the README understandable?
Any obvious simplifications that could be made?
Is the claims generation code organized understandably? Does it make sense?

What Security Implications Does This PR Have?

Please indicate if this PR does any of the following:

Adds any new software dependencies
Modifies any security controls
Adds new transmission or storage of data
Any other changes that could possibly affect security?
I have considered the above security implications as it relates to this PR. (If one or more of the above apply, it cannot be merged without the ISSO or team security engineer's (@sb-benohe) approval.)
I have created tests to sufficiently ensure the reliability of my code, if applicable. If this is a modification to an existing piece of code, I have audited the associated tests to ensure everything works as expected.

Validation

Have you fully verified and tested these changes? Is the acceptance criteria met? Please provide reproducible testing instructions, code snippets, or screenshots as applicable.

Running the updated claims generator without providing any prior claims data, verifying that:
- new claims data generation works
- run-db.sh works to load the data locally
Running generation providing existing claims data, verifying that:
- the data generated is identical in all but row order, column order, and new field values
- run-db.sh works to load the data locally
Adding a new column/field, new-col, to CLM_LINE (non-pharm, adjudicated) that randomly picks a value, then running generation on existing claims data, verifying that:
- the data generated is identical in all but row order, column order, and the new new-col column

malessi · 2026-02-09T17:07:38Z

apps/bfd-model-idr/claims_adj.py

+    # HACK: Of all claims tables that are derived from CLM, this particular table is the only one
+    # where rows can be orphaned from their root CLM row as some CLM rows may set their
+    # CLM_RLT_COND_SGNTR_SK to an invalid value. There are no other fields from which a foreign-key
+    # relationship can be derived, so we must store the CLM_UNIQ_ID of the parent CLM for each row
+    # of this table, otherwise we would have to special-case the logic for generating this table.
+    # CLM_UNIQ_ID will be ignored by the pipeline when loading, so this is OK
+    clm_rlt_cond_sgntr_mbr[f.CLM_UNIQ_ID] = clm[f.CLM_UNIQ_ID]


This is necessary as the regeneration logic assumes that all tables derive from CLM and that all rows/collection of rows per-table have some foreign-key relationship with CLM. For a lot of tables, that's the "four-part key" composite key, for some it's a unique identifier.

However, for this particular table, it seems that the original synthetic claims generation logic would orphan some rows in this table from its parent CLM during CLM generation by setting the CLM_RLT_COND_SGNTR_SK field to -1 or 0 in some specific circumstances. There are no other unique identifiers for a given clm_rlt_cond_sgntr_mbr row, nor anything we can use for a foreign-key relationship between it and CLM.

To ensure that each row is associated one-to-one with a CLM row, we store the CLM_UNIQ_ID of its parent CLM in the generated CSV only. The IDR Pipeline will ignore this additional field for each row, so it's only for the benefit of the synthetic claims generator.

Values of CLM_RLT_COND_SGNTR_SK are not unique to an individual claim (it's used to reduce cardinality among combinations of clm_rlt_cond_sgntr_mbr). Similar to CLM_DT_SGNTR_SK / others ending in _SK, the difference with this field is that the distribution of CLM_RLT_COND_SGNTR_SK is heavily concentrated in 0 or -1 rather than (seemingly) unique CLM_DT_SGNTR_SK values. In case that changes anything with the approach

Yeah, I figured as much given that CLM_RLT_COND_SGNTR_SK was being generated on CLM, then set on clm_rlt_cond_sgntr_mbr and then set to 0/-1 on CLM afterwards. The fact that we generate one clm_rlt_cond_sgntr_mbr for every CLM is really just coincidental and a quirk of how the synthetic data generator does generation of the two tables.

If we ever decide we want to change how we generate clm_rlt_cond_sgntr_mbr rows, e.g. make them one-to-many or entirely independent altogether, we can revisit this particular hack. But, having a one-to-one relationship here makes it convenient for regeneration (we just lookup the clm_rlt_cond_sgntr_mbr generated from a given CLM via CLM_UNIQ_ID) so we don't have to have a special case for this particular table and the corresponding CLM_RLT_COND_SGNTR_SK field on CLM.

# Conflicts: # apps/bfd-model-idr/pyproject.toml

# Conflicts: # apps/bfd-model-idr/claims_generator.py # Conflicts: # apps/bfd-model-idr/claims_generator.py

…embers

… functions

…rmatting

… regeneration

…K set ordering

Resolving the conflict through git's conflict resolution is not worth the effort for a single line change. This commit simply adds the additional field from 4541 to the CLM_LINE rx table generator.

bfd-sast · 2026-02-10T16:56:25Z

Quality Gate passed for 'bfd-parent'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

malessi · 2026-02-10T19:37:13Z

apps/bfd-model-idr/claims_generator.py

-    normalized_clms = pd.json_normalize(clm)  # type: ignore
-    normalized_clms["CLM_BLOOD_PT_FRNSH_QTY"] = normalized_clms["CLM_BLOOD_PT_FRNSH_QTY"].astype(
-        "Int64"
+class _ClaimsFile(StrEnum):


This enum was introduced to:

Collect all output CSVs/tables and their metadata into a single spot

Define a stable header order for each CSV

Due to (2), our existing synthetic claims data will not be byte-identical when regenerated, but each row's column values will remain the same. That is, the file will not be byte-identical but its data will be.

Subsequent generations should keep stable header and row ordering, so we should have more useful diffs after the first regeneration.

I am open to moving this out of claims_generator into a separate file/module, but I felt that since its only usage is in this file the only benefit would be to reduce the number of lines in claims_generator. I'm fine either way.

malessi · 2026-02-11T17:50:13Z

apps/bfd-model-idr/claims_adj.py

Any generation logic related to generation of adjudicated claims data was moved here.

malessi · 2026-02-11T17:55:12Z

apps/bfd-model-idr/claims_adj.py

+        # "*_static" dicts are used to match upon existing diagnosis rows so that their "*_data"
+        # columns can be updated during regeneration; otherwise, everytime claims data is
+        # regenerated these rows would have different column values
+        principal_static = {f.CLM_VAL_SQNC_NUM: "1", f.CLM_PROD_TYPE_CD: "P"}
+        principal_data = {
+            f.CLM_DGNS_CD: random.choice(get_icd_10_dgns_codes()),
+            f.CLM_DGNS_PRCDR_ICD_IND: "0",
+            f.CLM_POA_IND: "~",
+        } | principal_static
+        principal = match_diag(principal_static)
+        principal.extend(principal_data)


This generation function is unique in that it takes a list[RowAdapter] rather than a single RowAdapter. I opted for this because each set of CLM_PROD diagnoses rows are tightly coupled and have column values that depend on each other. Having separate generation functions for each type would make it too difficult to ensure those relationships/constraints are generated appropriately.

To support regeneration, this function matches rows based on what I've denoted as "static" column fields. For each type of diagnosis, there are usually at least two columns (CLM_VAL_SQNC_NUM and CLM_PROD_TYPE_CD) which have static or uniquely identifiable data that can be used to match a unique CLM_PROD row for a given batch of init_diagnoses. Once matched, new columns can be added.

malessi · 2026-02-11T17:59:54Z

apps/bfd-model-idr/claims_other.py

Any generation functions that are independent from CLM and/or non-adjudicated/non-pac were moved here.

malessi · 2026-02-11T18:00:38Z

apps/bfd-model-idr/claims_pac.py

All generation logic related to partially-adjudicated claims was moved here.

malessi · 2026-02-11T18:01:35Z

apps/bfd-model-idr/claims_static.py

Constants and static data were moved here.

malessi · 2026-02-11T18:03:22Z

apps/bfd-model-idr/field_constants.py

Short of creating dataclasss or Pydantic BaseModels for each claims table, there needed to be some way to ensure that the many lookups using long field names were consistent across all of the claims logic. I opted to replace every literal string lookup in the claims data generator with a corresponding constant and move that constant to this file.

malessi requested review from alex-dzeda, aschey-forpeople, mel1-G and mjburling as code owners February 6, 2026 19:48

malessi marked this pull request as draft February 6, 2026 19:49

malessi commented Feb 9, 2026

View reviewed changes

malessi force-pushed the alessio/BFD-4466__synthetic-claims-idempotency branch from c3fe9ec to 8902095 Compare February 9, 2026 22:28

malessi marked this pull request as ready for review February 10, 2026 16:36

malessi added 22 commits February 10, 2026 11:38

Switch to click for arg/option processing

d7c840e

# Conflicts: # apps/bfd-model-idr/pyproject.toml

Use constant strings for all fields in claims_generator

a01f675

# Conflicts: # apps/bfd-model-idr/claims_generator.py # Conflicts: # apps/bfd-model-idr/claims_generator.py

Cleanup claims_generator; use constants; reorganize methods/classes/m…

1ac98a2

…embers

Completely refactor adjudicated claim tables generation into separate…

f9e0862

… functions

Refactor pac claim tables generation into separate functions

e27c301

Update contract plan tables generation to allow regeneration

9b7bc08

Use tqdm to display progress when loading files

8632ddd

Fix printing empty log message when not regenerating patient data; fo…

7d8145b

…rmatting

Byte-identical regeneration of adj/pac claims data; refactor everything

97511e1

Order CLM_VAL rows in order of generation to avoid re-ordering during…

69584dc

… regeneration

Organize related functions and members into appropriate files

384611f

Write out CNTRCT_* and PRVDR_HSTRY tables

96f300f

Fix reordering during regeneration caused by non-deterministic BENE_S…

edacd9a

…K set ordering

Order by CLM's BENE_SKs first instead of BENE_HSTRY

c8456cb

Ensure MDCR_MBR rows can be tied uniquely to its parent CLM

fcee1c8

Reword variable to make usage obvious

aa53bb3

Respect --patients even if a BENE_HSTRY file is provided

639fdb4

Ensure claims_generator output is realtime if --claims is provided

037fa4c

Explicitly provide patients count in pipeline CI

852b7e6

Error if no input file or --patients specified

b7995d5

Re-introduce 4541's CLM_LINE rx change

38a885c

Resolving the conflict through git's conflict resolution is not worth the effort for a single line change. This commit simply adds the additional field from 4541 to the CLM_LINE rx table generator.

Add CLM_LINE_RPTD_GAP_DSCNT_AMT to the list of fields written to CSV

4aaf8f9

malessi added 2 commits February 10, 2026 11:38

Update generation guard to just check if BENE_HSTRY or CLM was provided

04bac0b

Update README

998fde7

malessi force-pushed the alessio/BFD-4466__synthetic-claims-idempotency branch from fd56693 to 998fde7 Compare February 10, 2026 16:38

Format README; move output file section to correct section

fb8b2ba

malessi commented Feb 10, 2026

View reviewed changes

malessi commented Feb 11, 2026

View reviewed changes

apps/bfd-model-idr/claims_pac.py

Copy link

Collaborator Author

malessi Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All generation logic related to partially-adjudicated claims was moved here.

malessi commented Feb 11, 2026

View reviewed changes

apps/bfd-model-idr/claims_static.py

Copy link

Collaborator Author

malessi Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constants and static data were moved here.

malessi commented Feb 11, 2026

View reviewed changes

Conversation

malessi commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This PR Do?

What Should Reviewers Watch For?

What Security Implications Does This PR Have?

Validation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malessi Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bfd-sast bot commented Feb 10, 2026

Quality Gate passed for 'bfd-parent'

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malessi Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

malessi commented Feb 6, 2026 •

edited

Loading

malessi Feb 10, 2026 •

edited

Loading

malessi Feb 11, 2026 •

edited

Loading