BFD-4466: Update synthetic claims generation to be in-place and idempotent#2997
BFD-4466: Update synthetic claims generation to be in-place and idempotent#2997
Conversation
| # HACK: Of all claims tables that are derived from CLM, this particular table is the only one | ||
| # where rows can be orphaned from their root CLM row as some CLM rows may set their | ||
| # CLM_RLT_COND_SGNTR_SK to an invalid value. There are no other fields from which a foreign-key | ||
| # relationship can be derived, so we must store the CLM_UNIQ_ID of the parent CLM for each row | ||
| # of this table, otherwise we would have to special-case the logic for generating this table. | ||
| # CLM_UNIQ_ID will be ignored by the pipeline when loading, so this is OK | ||
| clm_rlt_cond_sgntr_mbr[f.CLM_UNIQ_ID] = clm[f.CLM_UNIQ_ID] |
There was a problem hiding this comment.
This is necessary as the regeneration logic assumes that all tables derive from CLM and that all rows/collection of rows per-table have some foreign-key relationship with CLM. For a lot of tables, that's the "four-part key" composite key, for some it's a unique identifier.
However, for this particular table, it seems that the original synthetic claims generation logic would orphan some rows in this table from its parent CLM during CLM generation by setting the CLM_RLT_COND_SGNTR_SK field to -1 or 0 in some specific circumstances. There are no other unique identifiers for a given clm_rlt_cond_sgntr_mbr row, nor anything we can use for a foreign-key relationship between it and CLM.
To ensure that each row is associated one-to-one with a CLM row, we store the CLM_UNIQ_ID of its parent CLM in the generated CSV only. The IDR Pipeline will ignore this additional field for each row, so it's only for the benefit of the synthetic claims generator.
There was a problem hiding this comment.
Values of CLM_RLT_COND_SGNTR_SK are not unique to an individual claim (it's used to reduce cardinality among combinations of clm_rlt_cond_sgntr_mbr). Similar to CLM_DT_SGNTR_SK / others ending in _SK, the difference with this field is that the distribution of CLM_RLT_COND_SGNTR_SK is heavily concentrated in 0 or -1 rather than (seemingly) unique CLM_DT_SGNTR_SK values. In case that changes anything with the approach
There was a problem hiding this comment.
Yeah, I figured as much given that CLM_RLT_COND_SGNTR_SK was being generated on CLM, then set on clm_rlt_cond_sgntr_mbr and then set to 0/-1 on CLM afterwards. The fact that we generate one clm_rlt_cond_sgntr_mbr for every CLM is really just coincidental and a quirk of how the synthetic data generator does generation of the two tables.
If we ever decide we want to change how we generate clm_rlt_cond_sgntr_mbr rows, e.g. make them one-to-many or entirely independent altogether, we can revisit this particular hack. But, having a one-to-one relationship here makes it convenient for regeneration (we just lookup the clm_rlt_cond_sgntr_mbr generated from a given CLM via CLM_UNIQ_ID) so we don't have to have a special case for this particular table and the corresponding CLM_RLT_COND_SGNTR_SK field on CLM.
c3fe9ec to
8902095
Compare
# Conflicts: # apps/bfd-model-idr/pyproject.toml
# Conflicts: # apps/bfd-model-idr/claims_generator.py # Conflicts: # apps/bfd-model-idr/claims_generator.py
Resolving the conflict through git's conflict resolution is not worth the effort for a single line change. This commit simply adds the additional field from 4541 to the CLM_LINE rx table generator.
fd56693 to
998fde7
Compare
|
| normalized_clms = pd.json_normalize(clm) # type: ignore | ||
| normalized_clms["CLM_BLOOD_PT_FRNSH_QTY"] = normalized_clms["CLM_BLOOD_PT_FRNSH_QTY"].astype( | ||
| "Int64" | ||
| class _ClaimsFile(StrEnum): |
There was a problem hiding this comment.
This enum was introduced to:
- Collect all output CSVs/tables and their metadata into a single spot
- Define a stable header order for each CSV
Due to (2), our existing synthetic claims data will not be byte-identical when regenerated, but each row's column values will remain the same. That is, the file will not be byte-identical but its data will be.
There was a problem hiding this comment.
Subsequent generations should keep stable header and row ordering, so we should have more useful diffs after the first regeneration.
There was a problem hiding this comment.
I am open to moving this out of claims_generator into a separate file/module, but I felt that since its only usage is in this file the only benefit would be to reduce the number of lines in claims_generator. I'm fine either way.
There was a problem hiding this comment.
Any generation logic related to generation of adjudicated claims data was moved here.
| # "*_static" dicts are used to match upon existing diagnosis rows so that their "*_data" | ||
| # columns can be updated during regeneration; otherwise, everytime claims data is | ||
| # regenerated these rows would have different column values | ||
| principal_static = {f.CLM_VAL_SQNC_NUM: "1", f.CLM_PROD_TYPE_CD: "P"} | ||
| principal_data = { | ||
| f.CLM_DGNS_CD: random.choice(get_icd_10_dgns_codes()), | ||
| f.CLM_DGNS_PRCDR_ICD_IND: "0", | ||
| f.CLM_POA_IND: "~", | ||
| } | principal_static | ||
| principal = match_diag(principal_static) | ||
| principal.extend(principal_data) |
There was a problem hiding this comment.
This generation function is unique in that it takes a list[RowAdapter] rather than a single RowAdapter. I opted for this because each set of CLM_PROD diagnoses rows are tightly coupled and have column values that depend on each other. Having separate generation functions for each type would make it too difficult to ensure those relationships/constraints are generated appropriately.
To support regeneration, this function matches rows based on what I've denoted as "static" column fields. For each type of diagnosis, there are usually at least two columns (CLM_VAL_SQNC_NUM and CLM_PROD_TYPE_CD) which have static or uniquely identifiable data that can be used to match a unique CLM_PROD row for a given batch of init_diagnoses. Once matched, new columns can be added.
There was a problem hiding this comment.
Any generation functions that are independent from CLM and/or non-adjudicated/non-pac were moved here.
There was a problem hiding this comment.
All generation logic related to partially-adjudicated claims was moved here.
There was a problem hiding this comment.
Constants and static data were moved here.
There was a problem hiding this comment.
Short of creating dataclasss or Pydantic BaseModels for each claims table, there needed to be some way to ensure that the many lookups using long field names were consistent across all of the claims logic. I opted to replace every literal string lookup in the claims data generator with a corresponding constant and move that constant to this file.





JIRA Ticket:
BFD-4466
What Does This PR Do?
This PR significantly refactors the original
claims_generator.pyto enable column-level generation for all claims-related synthetic tables/CSVs that introduce new columns as well as byte-identical, idempotent regeneration of tables without any changes. Aside fromclm_rlt_cond_sgntr_mbr(see comments), no changes have been made to the actual generation logic, such that the claims data generated with these changes should be functionally identical to data generated prior to these changes.Additionally:
tqdmto print progress of file loading, data generation, and exportpatients_generator.pynow respects--patients <num_patients>even when existing synthetic data is provided.--patientsis now treated as the number of new patients to generate for a given runpatients_generator.py --claimsnow prints the progress of the claims data generation in realtimeWhat Should Reviewers Watch For?
If you're reviewing this PR, please check for these things in particular:
READMEunderstandable?What Security Implications Does This PR Have?
Please indicate if this PR does any of the following:
Adds any new software dependencies
Modifies any security controls
Adds new transmission or storage of data
Any other changes that could possibly affect security?
I have considered the above security implications as it relates to this PR. (If one or more of the above apply, it cannot be merged without the ISSO or team security engineer's (
@sb-benohe) approval.)I have created tests to sufficiently ensure the reliability of my code, if applicable. If this is a modification to an existing piece of code, I have audited the associated tests to ensure everything works as expected.
Validation
Have you fully verified and tested these changes? Is the acceptance criteria met? Please provide reproducible testing instructions, code snippets, or screenshots as applicable.
run-db.shworks to load the data locallyrun-db.shworks to load the data locallynew-col, toCLM_LINE(non-pharm, adjudicated) that randomly picks a value, then running generation on existing claims data, verifying that:new-colcolumn