Skip to content

BFD-4466: Update synthetic claims generation to be in-place and idempotent#2997

Open
malessi wants to merge 25 commits intomasterfrom
alessio/BFD-4466__synthetic-claims-idempotency
Open

BFD-4466: Update synthetic claims generation to be in-place and idempotent#2997
malessi wants to merge 25 commits intomasterfrom
alessio/BFD-4466__synthetic-claims-idempotency

Conversation

@malessi
Copy link
Collaborator

@malessi malessi commented Feb 6, 2026

JIRA Ticket:
BFD-4466

What Does This PR Do?

This PR significantly refactors the original claims_generator.py to enable column-level generation for all claims-related synthetic tables/CSVs that introduce new columns as well as byte-identical, idempotent regeneration of tables without any changes. Aside from clm_rlt_cond_sgntr_mbr (see comments), no changes have been made to the actual generation logic, such that the claims data generated with these changes should be functionally identical to data generated prior to these changes.

Additionally:

  • Claims data generation now uses tqdm to print progress of file loading, data generation, and export
  • patients_generator.py now respects --patients <num_patients> even when existing synthetic data is provided. --patients is now treated as the number of new patients to generate for a given run
  • patients_generator.py --claims now prints the progress of the claims data generation in realtime

What Should Reviewers Watch For?

If you're reviewing this PR, please check for these things in particular:

  • Does the generation logic still make sense? Is this remotely maintainable?
  • Are the changes to the README understandable?
  • Any obvious simplifications that could be made?
  • Is the claims generation code organized understandably? Does it make sense?

What Security Implications Does This PR Have?

Please indicate if this PR does any of the following:

  • Adds any new software dependencies

  • Modifies any security controls

  • Adds new transmission or storage of data

  • Any other changes that could possibly affect security?

  • I have considered the above security implications as it relates to this PR. (If one or more of the above apply, it cannot be merged without the ISSO or team security engineer's (@sb-benohe) approval.)

  • I have created tests to sufficiently ensure the reliability of my code, if applicable. If this is a modification to an existing piece of code, I have audited the associated tests to ensure everything works as expected.

Validation

Have you fully verified and tested these changes? Is the acceptance criteria met? Please provide reproducible testing instructions, code snippets, or screenshots as applicable.

  • Running the updated claims generator without providing any prior claims data, verifying that:
    • new claims data generation works
    • run-db.sh works to load the data locally
  • Running generation providing existing claims data, verifying that:
    • the data generated is identical in all but row order, column order, and new field values
    • run-db.sh works to load the data locally
  • Adding a new column/field, new-col, to CLM_LINE (non-pharm, adjudicated) that randomly picks a value, then running generation on existing claims data, verifying that:
    • the data generated is identical in all but row order, column order, and the new new-col column

@malessi malessi marked this pull request as draft February 6, 2026 19:49
Comment on lines +209 to +215
# HACK: Of all claims tables that are derived from CLM, this particular table is the only one
# where rows can be orphaned from their root CLM row as some CLM rows may set their
# CLM_RLT_COND_SGNTR_SK to an invalid value. There are no other fields from which a foreign-key
# relationship can be derived, so we must store the CLM_UNIQ_ID of the parent CLM for each row
# of this table, otherwise we would have to special-case the logic for generating this table.
# CLM_UNIQ_ID will be ignored by the pipeline when loading, so this is OK
clm_rlt_cond_sgntr_mbr[f.CLM_UNIQ_ID] = clm[f.CLM_UNIQ_ID]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary as the regeneration logic assumes that all tables derive from CLM and that all rows/collection of rows per-table have some foreign-key relationship with CLM. For a lot of tables, that's the "four-part key" composite key, for some it's a unique identifier.

However, for this particular table, it seems that the original synthetic claims generation logic would orphan some rows in this table from its parent CLM during CLM generation by setting the CLM_RLT_COND_SGNTR_SK field to -1 or 0 in some specific circumstances. There are no other unique identifiers for a given clm_rlt_cond_sgntr_mbr row, nor anything we can use for a foreign-key relationship between it and CLM.

To ensure that each row is associated one-to-one with a CLM row, we store the CLM_UNIQ_ID of its parent CLM in the generated CSV only. The IDR Pipeline will ignore this additional field for each row, so it's only for the benefit of the synthetic claims generator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Values of CLM_RLT_COND_SGNTR_SK are not unique to an individual claim (it's used to reduce cardinality among combinations of clm_rlt_cond_sgntr_mbr). Similar to CLM_DT_SGNTR_SK / others ending in _SK, the difference with this field is that the distribution of CLM_RLT_COND_SGNTR_SK is heavily concentrated in 0 or -1 rather than (seemingly) unique CLM_DT_SGNTR_SK values. In case that changes anything with the approach

Copy link
Collaborator Author

@malessi malessi Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured as much given that CLM_RLT_COND_SGNTR_SK was being generated on CLM, then set on clm_rlt_cond_sgntr_mbr and then set to 0/-1 on CLM afterwards. The fact that we generate one clm_rlt_cond_sgntr_mbr for every CLM is really just coincidental and a quirk of how the synthetic data generator does generation of the two tables.

If we ever decide we want to change how we generate clm_rlt_cond_sgntr_mbr rows, e.g. make them one-to-many or entirely independent altogether, we can revisit this particular hack. But, having a one-to-one relationship here makes it convenient for regeneration (we just lookup the clm_rlt_cond_sgntr_mbr generated from a given CLM via CLM_UNIQ_ID) so we don't have to have a special case for this particular table and the corresponding CLM_RLT_COND_SGNTR_SK field on CLM.

@malessi malessi force-pushed the alessio/BFD-4466__synthetic-claims-idempotency branch from c3fe9ec to 8902095 Compare February 9, 2026 22:28
@malessi malessi marked this pull request as ready for review February 10, 2026 16:36
# Conflicts:
#	apps/bfd-model-idr/pyproject.toml
# Conflicts:
#	apps/bfd-model-idr/claims_generator.py

# Conflicts:
#	apps/bfd-model-idr/claims_generator.py
Resolving the conflict through git's conflict resolution is not worth the effort for a single line change. This commit simply adds the additional field from 4541 to the CLM_LINE rx table generator.
@malessi malessi force-pushed the alessio/BFD-4466__synthetic-claims-idempotency branch from fd56693 to 998fde7 Compare February 10, 2026 16:38
@bfd-sast
Copy link

bfd-sast bot commented Feb 10, 2026

Quality Gate passed Quality Gate passed for 'bfd-parent'

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

normalized_clms = pd.json_normalize(clm) # type: ignore
normalized_clms["CLM_BLOOD_PT_FRNSH_QTY"] = normalized_clms["CLM_BLOOD_PT_FRNSH_QTY"].astype(
"Int64"
class _ClaimsFile(StrEnum):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum was introduced to:

  1. Collect all output CSVs/tables and their metadata into a single spot
  2. Define a stable header order for each CSV

Due to (2), our existing synthetic claims data will not be byte-identical when regenerated, but each row's column values will remain the same. That is, the file will not be byte-identical but its data will be.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subsequent generations should keep stable header and row ordering, so we should have more useful diffs after the first regeneration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to moving this out of claims_generator into a separate file/module, but I felt that since its only usage is in this file the only benefit would be to reduce the number of lines in claims_generator. I'm fine either way.

Copy link
Collaborator Author

@malessi malessi Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any generation logic related to generation of adjudicated claims data was moved here.

Comment on lines +395 to +405
# "*_static" dicts are used to match upon existing diagnosis rows so that their "*_data"
# columns can be updated during regeneration; otherwise, everytime claims data is
# regenerated these rows would have different column values
principal_static = {f.CLM_VAL_SQNC_NUM: "1", f.CLM_PROD_TYPE_CD: "P"}
principal_data = {
f.CLM_DGNS_CD: random.choice(get_icd_10_dgns_codes()),
f.CLM_DGNS_PRCDR_ICD_IND: "0",
f.CLM_POA_IND: "~",
} | principal_static
principal = match_diag(principal_static)
principal.extend(principal_data)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generation function is unique in that it takes a list[RowAdapter] rather than a single RowAdapter. I opted for this because each set of CLM_PROD diagnoses rows are tightly coupled and have column values that depend on each other. Having separate generation functions for each type would make it too difficult to ensure those relationships/constraints are generated appropriately.

To support regeneration, this function matches rows based on what I've denoted as "static" column fields. For each type of diagnosis, there are usually at least two columns (CLM_VAL_SQNC_NUM and CLM_PROD_TYPE_CD) which have static or uniquely identifiable data that can be used to match a unique CLM_PROD row for a given batch of init_diagnoses. Once matched, new columns can be added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any generation functions that are independent from CLM and/or non-adjudicated/non-pac were moved here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All generation logic related to partially-adjudicated claims was moved here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constants and static data were moved here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short of creating dataclasss or Pydantic BaseModels for each claims table, there needed to be some way to ensure that the many lookups using long field names were consistent across all of the claims logic. I opted to replace every literal string lookup in the claims data generator with a corresponding constant and move that constant to this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants