You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+97-15Lines changed: 97 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -155,6 +155,7 @@ PopHIVE/Ingest/
155
155
156
156
```r
157
157
# Create new data source folder structure
158
+
### Important!! When adding a new data source, you MUST run this function. Otherwise the process.json files will not be initialized correctly, causing the pieplein to fail
158
159
dcf::dcf_add_source("source_name")
159
160
160
161
# Initialize processing record for tracking changes
@@ -330,6 +331,8 @@ if (!identical(process$raw_state, raw_state)) {
330
331
331
332
## measure_info.json Template
332
333
334
+
Each `measure_info.json` file should include variable definitions and a centralized `_sources` object. Variables reference sources by ID.
335
+
333
336
```json
334
337
{
335
338
"variable_name": {
@@ -343,23 +346,47 @@ if (!identical(process$raw_state, raw_state)) {
"description": "Detailed narrative description of the data source, including methodology, coverage, limitations, and any important caveats for users.",
367
+
"restrictions": "License and usage restrictions. Examples: 'Public domain. CDC data is generally not subject to copyright restrictions.' or 'CC BY 4.0. Attribution required for reuse.' or 'Attribution required. Cite [citation].'",
368
+
"date_accessed": 2025
369
+
}
359
370
}
360
371
}
361
372
```
362
373
374
+
### _sources Field Requirements
375
+
376
+
Every `_sources` entry MUST include:
377
+
-**name**: Full name of the data source
378
+
-**url**: Primary URL for the data source
379
+
-**organization**: Name of the organization providing the data
380
+
-**organization_url**: URL for the organization
381
+
-**description**: Narrative description of the source (methodology, coverage, limitations)
382
+
-**restrictions**: License and usage restrictions
383
+
384
+
Special restriction wording:
385
+
-**Epic Cosmos**: "The data can be re-used with appropriate attribution. A suggested citation relating to this data is 'Results of research performed with Epic Cosmos were obtained from the PopHIVE platform (https://github.com/PopHIVE/Ingest).'"
386
+
-**Google Health Trends**: "Data can be reused with attribution of data from the Google Health Trends API, obtained via the PopHIVE platform (https://github.com/PopHIVE/Ingest)."
387
+
-**CDC/CMS data**: "Public domain. CDC data is generally not subject to copyright restrictions."
### Issue: Error "process file process.json does not exist"
530
-
```r
531
-
# Problem: dcf::dcf_process_record() fails on first run of new data source
532
-
# Solution: Check if process.json exists before calling dcf_process_record()
533
-
if (!file.exists("process.json")) {
534
-
process<-list(raw_state=NULL)
535
-
} else {
536
-
process<-dcf::dcf_process_record()
557
+
This is caused by failure to initialize a new datasource with `dcf::dcf_add_source()`. If this is not done, the process.json file is not properly initialized.
558
+
559
+
**Preferred solution**: Run `dcf::dcf_add_source("source_name")` to create the folder structure properly.
560
+
561
+
**Manual fix**: If you need to create the process.json manually, use this structure (replace `source_name` with your data folder name):
# Also ensure project.Rproj and README.md exist in the source folder
556
601
```
557
602
603
+
### Issue: Error "vec_math.arrow_binary() not implemented" when running dcf_process()
604
+
605
+
This error occurs when a script works fine when run directly but fails via `dcf_process()`. The cause is vroom's Arrow ALTREP (lazy loading) backend:
606
+
607
+
-**When running directly**: Interactive sessions may materialize data earlier or have different environment state
608
+
-**When running via dcf_process()**: Scripts run in a cleaner context where Arrow ALTREP stays active, keeping columns as Arrow binary types until an operation forces materialization
609
+
610
+
The error typically triggers when using `if_else()` with mixed types (e.g., comparing integers with Arrow-backed columns) or when `cdlTools::fips()` returns integers that get mixed with other types.
611
+
612
+
```r
613
+
# Problem: cdlTools::fips() with if_else causes Arrow type issues
8.**Commit changes**: Include raw data sample, ingest.R, measure_info.json, standard output
720
+
8.**Update documentation**: The data source documentation is auto-generated from `measure_info.json` files
721
+
```r
722
+
Rscriptscripts/build_docs.R
723
+
```
724
+
This generates `docs/index.html` with variable tables and source information. The GitHub Action will also rebuild docs automatically when `measure_info.json` files change.
725
+
726
+
9.**Commit changes**: Include raw data sample, ingest.R, measure_info.json, standard output, and updated docs/
"description": "The National Respiratory and Enteric Virus Surveillance System (NREVSS) is a voluntary, laboratory-based surveillance system that monitors temporal and geographic trends for respiratory syncytial virus (RSV), human parainfluenza viruses, respiratory adenoviruses, human metapneumovirus, human coronaviruses, and rotavirus circulation in the United States. Participating laboratories report weekly to CDC on the number of tests performed and the number positive for each virus. NREVSS data are used to characterize seasonal patterns of these viruses and to help public health officials anticipate and prepare for outbreaks. Data are aggregated at the HHS regional and national levels. The system has been operational since 1987 and includes approximately 300 participating laboratories across the United States.",
200
+
"restrictions": "Public domain. CDC data is generally not subject to copyright restrictions."
"description": "The National Respiratory and Enteric Virus Surveillance System (NREVSS) is a voluntary, laboratory-based surveillance system that monitors temporal and geographic trends for respiratory syncytial virus (RSV), human parainfluenza viruses, respiratory adenoviruses, human metapneumovirus, human coronaviruses, and rotavirus circulation in the United States. Participating laboratories report weekly to CDC on the number of tests performed and the number positive for each virus. NREVSS data are used to characterize seasonal patterns of these viruses and to help public health officials anticipate and prepare for outbreaks. Data are aggregated at the HHS regional and national levels. The system has been operational since 1987 and includes approximately 300 participating laboratories across the United States.",
799
+
"restrictions": "Public domain. CDC data is generally not subject to copyright restrictions."
0 commit comments