You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To generate synthetic patient data, the patient_generator.py script is used.
119
119
To utilize it to generate an entirely _new_ set of data from nothing:
@@ -146,33 +146,87 @@ The files output will be in the `out` folder:
146
146
147
147
The patient generator creates synthetic beneficiary data with realistic but _synthetic_ MBIs, coverage information, and historical records. It can generate multiple MBI versions per beneficiary and handles beneficiary cross-references with kill credit switches.
148
148
149
-
#### Claims data
149
+
#### Claims data - `claims_generator.py`
150
150
151
-
To generate synthetic claims data, the claims_generator.py script is used.
152
-
To utilize it:
151
+
<!-- TODO: Provide an official location for downloading synthetic claims data -->
152
+
> [!IMPORTANT]
153
+
> Synthetic claims data is _much_ larger in size relative to patient data, and so it is not stored in the repository under `./synthetic-data`. If you are looking to regnerate this data, please reach out in #bfd so that the existing dataset can be provided to you.
154
+
155
+
#### `claims_generator.py` usage
156
+
157
+
```text
158
+
Usage: claims_generator.py [OPTIONS] [PATHS]...
159
+
160
+
Generate synthetic claims data. Provided file PATHS will be updated with new
161
+
fields.
162
+
163
+
Options:
164
+
--sushi / --no-sushi Generate new StructureDefinitions. Use when
165
+
testing locally if new .fsh files have been
166
+
added.
167
+
--min-claims INTEGER Minimum number of claims to generate per
168
+
person
169
+
--max-claims INTEGER Maximum number of claims to generate per
170
+
person
171
+
--force-pac-claims / --no-force-pac-claims
172
+
Generate _new_ partially-adjudicated claims
173
+
when existing pac claims tables exist in the
174
+
synthetic data provided
175
+
--help Show this message and exit.
176
+
```
177
+
178
+
#### Generating claims data
179
+
180
+
> [!WARNING]
181
+
> Either `SYNTHETIC_CLM.csv` or `SYNTHETIC_BENE_HSTRY.csv`**must** be provided as claims data generation requires an existing `BENE_SK` or `CLM` to generate/regenerate data.
182
+
183
+
To generate synthetic claims data, the `claims_generator.py` script is used.
184
+
185
+
##### Using `SYNTHETIC_BENE_HSTRY.csv`
186
+
187
+
The below will generate _entirely new claims_ for the given `BENE_SK`s in the provided file:
153
188
154
189
```sh
155
190
uv run claims_generator.py \
156
191
--sushi \
157
192
out/SYNTHETIC_BENE_HSTRY.csv
158
193
```
159
194
160
-
--sushi is not strictly needed, if you have a local copy of the compiled shorthand files, but recommended to reduce drift. To specify a list of benes, pass in a .csv file containing a column named BENE_SK.
161
-
The files output will be in the out folder, there are several files:
162
-
SYNTHETIC_CLM.csv
163
-
SYNTHETIC_CLM_LINE.csv
164
-
SYNTHETIC_CLM_VAL.csv
165
-
SYNTHETIC_CLM_DT_SGNTR.csv
166
-
SYNTHETIC_CLM_PROD.csv
167
-
SYNTHETIC_CLM_INSTNL.csv
168
-
SYNTHETIC_CLM_LINE_INSTNL.csv
169
-
SYNTHETIC_CLM_DCMTN.csv
170
-
SYNTHETIC_CLM_FISS.csv
171
-
SYNTHETIC_CLM_PRFNL.csv
172
-
SYNTHETIC_CLM_LINE_PRFNL.csv
173
-
SYNTHETIC_CLM_ANSI_SGNTR.csv
174
-
175
-
These files represent the schema of the tables the information is sourced from, although for tables other than CLM_DT_SGNTR, the CLM_UNIQ_ID is propagated instead of the 5 part unique key from the IDR.
195
+
##### Regenerating existing claims data
196
+
197
+
The below will _re-generate_**existing claims data** (assume `<PATH_TO_CLAIMS_DATA>` is a local directory containing synthetic claims data):
198
+
199
+
```sh
200
+
uv run claims_generator.py \
201
+
--sushi \
202
+
./synthetic-data <PATH_TO_CLAIMS_DATA>
203
+
```
204
+
205
+
If _any_ claims-related tables have had columns added to their respective generation functions, those new columns will be populated with values without impacting existing values in other columns.
206
+
207
+
> [!CAUTION]
208
+
> If an **existing column value** must be updated, that column value **MUST BE DELETED** from the respective table CSV first so that the values can be regenerated.
209
+
210
+
#### `--sushi`
211
+
212
+
`--sushi` is not strictly needed, if you have a local copy of the compiled shorthand files, but recommended to reduce drift. To specify a list of benes, pass in a .csv file containing a column named `BENE_SK`.
213
+
214
+
The files output will be in the `./out` folder, there are several files:
215
+
216
+
-`SYNTHETIC_CLM.csv`
217
+
-`SYNTHETIC_CLM_LINE.csv`
218
+
-`SYNTHETIC_CLM_VAL.csv`
219
+
-`SYNTHETIC_CLM_DT_SGNTR.csv`
220
+
-`SYNTHETIC_CLM_PROD.csv`
221
+
-`SYNTHETIC_CLM_INSTNL.csv`
222
+
-`SYNTHETIC_CLM_LINE_INSTNL.csv`
223
+
-`SYNTHETIC_CLM_DCMTN.csv`
224
+
-`SYNTHETIC_CLM_FISS.csv`
225
+
-`SYNTHETIC_CLM_PRFNL.csv`
226
+
-`SYNTHETIC_CLM_LINE_PRFNL.csv`
227
+
-`SYNTHETIC_CLM_ANSI_SGNTR.csv`
228
+
229
+
These files represent the schema of the tables the information is sourced from, although for tables other than `CLM_DT_SGNTR`, the `CLM_UNIQ_ID` is propagated instead of the 5 part unique key from the IDR.
176
230
177
231
## Data Dictionary
178
232
@@ -193,4 +247,4 @@ Run:
193
247
DESCRIBE VIEW CMS_VDM_VIEW_MDCR_PRD.{TABLE_NAME}
194
248
```
195
249
196
-
Export the results as a CSV named {TABLE_NAME}.csv and save it under ReferenceTables.
250
+
Export the results as a CSV named {TABLE_NAME}.csv and save it under ReferenceTables.
0 commit comments