You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To generate synthetic patient data, the patient_generator.py script is used.
146
146
To utilize it to generate an entirely _new_ set of data from nothing:
@@ -173,33 +173,87 @@ The files output will be in the `out` folder:
173
173
174
174
The patient generator creates synthetic beneficiary data with realistic but _synthetic_ MBIs, coverage information, and historical records. It can generate multiple MBI versions per beneficiary and handles beneficiary cross-references with kill credit switches.
175
175
176
-
#### Claims data
176
+
#### Claims data - `claims_generator.py`
177
177
178
-
To generate synthetic claims data, the claims_generator.py script is used.
179
-
To utilize it:
178
+
<!-- TODO: Provide an official location for downloading synthetic claims data -->
179
+
> [!IMPORTANT]
180
+
> Synthetic claims data is _much_ larger in size relative to patient data, and so it is not stored in the repository under `./synthetic-data`. If you are looking to regnerate this data, please reach out in #bfd so that the existing dataset can be provided to you.
181
+
182
+
#### `claims_generator.py` usage
183
+
184
+
```text
185
+
Usage: claims_generator.py [OPTIONS] [PATHS]...
186
+
187
+
Generate synthetic claims data. Provided file PATHS will be updated with new
188
+
fields.
189
+
190
+
Options:
191
+
--sushi / --no-sushi Generate new StructureDefinitions. Use when
192
+
testing locally if new .fsh files have been
193
+
added.
194
+
--min-claims INTEGER Minimum number of claims to generate per
195
+
person
196
+
--max-claims INTEGER Maximum number of claims to generate per
197
+
person
198
+
--force-pac-claims / --no-force-pac-claims
199
+
Generate _new_ partially-adjudicated claims
200
+
when existing pac claims tables exist in the
201
+
synthetic data provided
202
+
--help Show this message and exit.
203
+
```
204
+
205
+
#### Generating claims data
206
+
207
+
> [!WARNING]
208
+
> Either `SYNTHETIC_CLM.csv` or `SYNTHETIC_BENE_HSTRY.csv`**must** be provided as claims data generation requires an existing `BENE_SK` or `CLM` to generate/regenerate data.
209
+
210
+
To generate synthetic claims data, the `claims_generator.py` script is used.
211
+
212
+
##### Using `SYNTHETIC_BENE_HSTRY.csv`
213
+
214
+
The below will generate _entirely new claims_ for the given `BENE_SK`s in the provided file:
180
215
181
216
```sh
182
217
uv run claims_generator.py \
183
218
--sushi \
184
219
out/SYNTHETIC_BENE_HSTRY.csv
185
220
```
186
221
187
-
--sushi is not strictly needed, if you have a local copy of the compiled shorthand files, but recommended to reduce drift. To specify a list of benes, pass in a .csv file containing a column named BENE_SK.
188
-
The files output will be in the out folder, there are several files:
189
-
SYNTHETIC_CLM.csv
190
-
SYNTHETIC_CLM_LINE.csv
191
-
SYNTHETIC_CLM_VAL.csv
192
-
SYNTHETIC_CLM_DT_SGNTR.csv
193
-
SYNTHETIC_CLM_PROD.csv
194
-
SYNTHETIC_CLM_INSTNL.csv
195
-
SYNTHETIC_CLM_LINE_INSTNL.csv
196
-
SYNTHETIC_CLM_DCMTN.csv
197
-
SYNTHETIC_CLM_FISS.csv
198
-
SYNTHETIC_CLM_PRFNL.csv
199
-
SYNTHETIC_CLM_LINE_PRFNL.csv
200
-
SYNTHETIC_CLM_ANSI_SGNTR.csv
201
-
202
-
These files represent the schema of the tables the information is sourced from, although for tables other than CLM_DT_SGNTR, the CLM_UNIQ_ID is propagated instead of the 5 part unique key from the IDR.
222
+
##### Regenerating existing claims data
223
+
224
+
The below will _re-generate_**existing claims data** (assume `<PATH_TO_CLAIMS_DATA>` is a local directory containing synthetic claims data):
225
+
226
+
```sh
227
+
uv run claims_generator.py \
228
+
--sushi \
229
+
./synthetic-data <PATH_TO_CLAIMS_DATA>
230
+
```
231
+
232
+
If _any_ claims-related tables have had columns added to their respective generation functions, those new columns will be populated with values without impacting existing values in other columns.
233
+
234
+
> [!CAUTION]
235
+
> If an **existing column value** must be updated, that column value **MUST BE DELETED** from the respective table CSV first so that the values can be regenerated.
236
+
237
+
#### `--sushi`
238
+
239
+
`--sushi` is not strictly needed, if you have a local copy of the compiled shorthand files, but recommended to reduce drift. To specify a list of benes, pass in a .csv file containing a column named `BENE_SK`.
240
+
241
+
The files output will be in the `./out` folder, there are several files:
242
+
243
+
-`SYNTHETIC_CLM.csv`
244
+
-`SYNTHETIC_CLM_LINE.csv`
245
+
-`SYNTHETIC_CLM_VAL.csv`
246
+
-`SYNTHETIC_CLM_DT_SGNTR.csv`
247
+
-`SYNTHETIC_CLM_PROD.csv`
248
+
-`SYNTHETIC_CLM_INSTNL.csv`
249
+
-`SYNTHETIC_CLM_LINE_INSTNL.csv`
250
+
-`SYNTHETIC_CLM_DCMTN.csv`
251
+
-`SYNTHETIC_CLM_FISS.csv`
252
+
-`SYNTHETIC_CLM_PRFNL.csv`
253
+
-`SYNTHETIC_CLM_LINE_PRFNL.csv`
254
+
-`SYNTHETIC_CLM_ANSI_SGNTR.csv`
255
+
256
+
These files represent the schema of the tables the information is sourced from, although for tables other than `CLM_DT_SGNTR`, the `CLM_UNIQ_ID` is propagated instead of the 5 part unique key from the IDR.
0 commit comments