Skip to content

Commit 38de7ce

Browse files
committed
Add claudecode_assistant D4D files and extraction reports
Added claudecode_assistant concatenated D4D files: - AI_READI_d4d.yaml - CHORUS_d4d.yaml - CM4AI_d4d.yaml - VOICE_d4d.yaml Updated extraction reports: - data/raw/organized_extraction_report.md - data/raw/organized_extraction_summary.json The claudecode_assistant method represents in-session synthesis with direct user interaction during generation. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
1 parent 4144574 commit 38de7ce

File tree

6 files changed

+1418
-122
lines changed

6 files changed

+1418
-122
lines changed
Lines changed: 394 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,394 @@
1+
# D4D Datasheet for AI-READI Dataset
2+
# Generated by: Claude Code Assistant (In-Session Synthesis)
3+
# Source: data/preprocessed/concatenated/AI_READI_preprocessed.txt (13 source files)
4+
# Schema: data_sheets_schema_all.yaml
5+
# Generation Date: 2025-12-06
6+
7+
id: https://fairhub.io/datasets/2
8+
name: AI-READI Dataset
9+
title: Flagship Dataset of Type 2 Diabetes from the AI-READI Project
10+
description: 'The AI-READI (Artificial Intelligence Ready and Exploratory Atlas for Diabetes Insights)
11+
dataset
12+
13+
consists of data collected from individuals with and without Type 2 Diabetes Mellitus (T2DM),
14+
15+
harmonized across 3 data collection sites (University of Washington, University of Alabama at
16+
17+
Birmingham, and University of California San Diego). The composition was designed with future
18+
19+
AI/Machine Learning studies in mind, including recruitment sampling procedures aimed at achieving
20+
21+
triple-balanced distribution (race/ethnicity, diabetes severity, biological sex) and a multi-domain
22+
23+
data acquisition protocol (survey data, physical measurements, clinical data, imaging data, wearable
24+
25+
device data, environmental sensors) to enable downstream AI/ML analyses. Target enrollment is 4,000
26+
27+
participants with approximately 400 participants in longitudinal follow-up. The goal is to better
28+
29+
understand salutogenesis (the pathway from disease to health) in T2DM through pseudotime manifold
30+
31+
analysis.'
32+
page: https://fairhub.io/datasets/2
33+
language: en
34+
keywords:
35+
- Type 2 Diabetes
36+
- T2DM
37+
- AI-READI
38+
- Machine Learning
39+
- multimodal
40+
- harmonized
41+
- multi-site
42+
- salutogenesis
43+
- survey data
44+
- clinical data
45+
- imaging data
46+
- wearable device data
47+
- retinal images
48+
- ECG
49+
- blood glucose
50+
- laboratory results
51+
- environmental data
52+
- FAIR principles
53+
- Bridge2AI
54+
purposes:
55+
- description: 'Create a flagship ethically-sourced dataset to enable future generations of artificial
56+
57+
intelligence/machine learning (AI/ML) research to provide critical insights into type 2
58+
59+
diabetes mellitus (T2DM), including salutogenic pathways to return to health. Develop a
60+
61+
foundational dataset in diabetes, agnostic to existing classification criteria, which can
62+
63+
be used to reconstruct a temporal atlas of T2DM development and reversal towards health.'
64+
tasks:
65+
- description: 'Enable downstream AI/ML analyses including pseudotime manifold analysis to predict disease
66+
67+
trajectories across survey, clinical, imaging, wearable, and environmental domains related
68+
69+
to T2DM that may not be feasible with existing data sources such as claims or electronic
70+
71+
health records data.'
72+
addressing_gaps:
73+
- description: 'Address the lack of well-designed, high quality, and large multimodal datasets needed
74+
to
75+
76+
understand and affect the course of complex, multi-organ diseases such as T2DM. Provide
77+
78+
a harmonized, multi-site, multi-domain dataset with triple-balanced recruitment (race/ethnicity,
79+
80+
diabetes severity, biological sex) enabling AI/ML analyses not feasible with existing sources.'
81+
creators:
82+
- name: AI-READI Consortium
83+
description: 'Multi-institutional team led by Contact PI Aaron Lee (University of Washington, Department
84+
85+
of Ophthalmology), with teams across eight institutions working on six cross-disciplinary
86+
87+
project modules.'
88+
instances:
89+
- description: 'Individual participants with and without Type 2 Diabetes Mellitus (T2DM) with multi-domain
90+
91+
measurements. Each participant represents an instance with cross-sectional data including
92+
93+
survey responses, physical and clinical measurements, blood and urine lab results, retinal
94+
95+
imaging, ECG, wearable device time-series (10 days), continuous blood glucose monitoring
96+
97+
(10 days), and environmental sensor data. Target enrollment 4,000 participants across three
98+
99+
sites, with approximately 400 participants (10%) in longitudinal follow-up.'
100+
subsets:
101+
- id: aireadi:public-dataset
102+
name: Public dataset
103+
description: 'Includes survey data, blood and urine lab results, fitness activity levels from Garmin
104+
105+
tracker, clinical measurements (monofilament and cognitive function testing), retinal
106+
107+
images, ECG, continuous blood sugar levels from Dexcom CGM, and environmental variables
108+
109+
such as home air quality from environmental sensors. Available for public download upon
110+
111+
agreement with a license that defines how the data can be used.'
112+
- id: aireadi:controlled-access-dataset
113+
name: Controlled-access dataset
114+
description: 'Includes 5-digit zip code, sex, race, ethnicity, genetic sequencing data (future),
115+
116+
past health records, medications, and traffic and accident reports (environmental data).
117+
118+
Accessible by entering into a data use agreement.'
119+
sampling_strategies:
120+
- description: 'Recruitment sampling procedures aimed at achieving triple-balanced distribution across
121+
122+
four race/ethnic groups (Asian, Black, White, Hispanic), four categories of T2DM severity
123+
124+
(no diabetes, pre-diabetes/lifestyle-controlled, diabetes treated with oral/non-insulin
125+
126+
injectable medications, insulin-controlled diabetes), and biological sex (male, female).
127+
128+
Recruitment in waves with monitoring and adjustment through under- and oversampling as needed.'
129+
is_sample:
130+
- true
131+
is_random:
132+
- false
133+
is_representative:
134+
- false
135+
strategies:
136+
- Triple-balanced targeted recruitment across race/ethnicity, diabetes severity, and biological sex
137+
- Wave-based recruitment with ongoing monitoring and adjustment
138+
- Screening electronic health records using ICD-10 codes (R73.09 for pre-diabetes, E11.X for T2DM)
139+
- Personalized invitation letters and emails with REDCap recruitment interface
140+
subpopulations:
141+
- description: 'Participants with and without Type 2 Diabetes Mellitus, stratified by diabetes severity.
142+
143+
Four T2DM categories: no diabetes, pre-diabetes/lifestyle-controlled, diabetes treated
144+
145+
with medications, insulin-controlled diabetes. Four race/ethnicity groups: Asian, Black,
146+
147+
White, Hispanic. Equal distribution by biological sex.'
148+
anomalies:
149+
- description: 'As enrollment is ongoing (began July 18, 2023, continues through November 30, 2026),
150+
151+
early data releases may not have achieved balanced distribution across all groups due
152+
153+
to wave-based recruitment and ongoing enrollment.'
154+
anomaly_details:
155+
- Early releases may exhibit unbalanced distributions across diabetes severity, race/ethnicity, or sex
156+
groups
157+
- Balanced distribution target is for final complete dataset
158+
external_resources:
159+
- name: AI-READI Dataset Documentation
160+
description: 'Comprehensive documentation for the AI-READI dataset on the FAIRhub data portal,
161+
162+
including dataset landing page, data dictionary, and versioned documentation.'
163+
external_resources:
164+
- https://docs.aireadi.org
165+
- https://fairhub.io/datasets/2
166+
future_guarantees:
167+
- Documentation versions correspond to dataset versions
168+
- name: Related Publications
169+
description: AI-READI publications describing study protocol and design
170+
external_resources:
171+
- https://doi.org/10.1038/s42255-024-01165-x
172+
- https://doi.org/10.1136/bmjopen-2024-097449
173+
- name: Zenodo Archive
174+
description: Archived dataset record on Zenodo
175+
external_resources:
176+
- https://doi.org/10.5281/zenodo.10642459
177+
- name: NIH RePORTER
178+
description: NIH project record for Bridge2AI Salutogenesis Data Generation
179+
external_resources:
180+
- https://reporter.nih.gov/project-details/10471118
181+
confidential_elements:
182+
- description: 'Contains protected health information elements under controlled access including
183+
184+
past health records, medications, genetic sequencing data (future), and 5-digit zip codes.'
185+
confidential_elements_present: true
186+
confidentiality_details:
187+
- 5-digit zip code held under controlled access
188+
- Genetic sequencing data (future) will be held under controlled access
189+
- Past health records and medications held under controlled access
190+
- Sex, race, ethnicity held under controlled access
191+
sensitive_elements:
192+
- description: 'Sensitive demographic and health data held under controlled access to protect
193+
194+
participant privacy.'
195+
sensitive_elements_present: true
196+
sensitivity_details:
197+
- Sex, race, ethnicity held under controlled access
198+
- Genetic sequencing data (future)
199+
- Past health records and medications
200+
- Traffic and accident reports (environmental data)
201+
acquisition_methods:
202+
- description: 'Harmonized, multi-domain data acquisition across three collection sites (University
203+
204+
of Washington, University of Alabama at Birmingham, University of California San Diego)
205+
206+
using surveys, clinical exams, imaging devices, wearable sensors, and environmental
207+
208+
monitors. In-person study visit for clinical assessments, followed by 10-day at-home
209+
210+
monitoring with wearable devices.'
211+
was_directly_observed: true
212+
was_reported_by_subjects: true
213+
acquisition_details:
214+
- Physical and clinical measurements directly observed during in-person visit
215+
- Survey data self-reported by participants via REDCap
216+
- Retinal imaging directly captured during visit
217+
- ECG directly measured during visit
218+
- Wearable device data passively collected over 10 days at home (Garmin fitness tracker)
219+
- Continuous glucose monitoring over 10 days at home (Dexcom CGM)
220+
- Environmental sensor data directly measured over 10 days at home
221+
- Blood and urine laboratory tests from samples collected during visit
222+
collection_mechanisms:
223+
- description: 'Multi-modal data collection using hardware devices, clinical procedures, and
224+
225+
software-driven capture. REDCap used for patient-reported questionnaires and
226+
227+
clinical data entry.'
228+
mechanism_details:
229+
- REDCap for survey data and clinical measurements entry
230+
- Retinal imaging devices exporting to DICOM format
231+
- ECG devices exporting raw format
232+
- Garmin fitness tracker for 10-day activity and heart rate monitoring
233+
- Dexcom Continuous Glucose Monitor for 10-day glucose tracking
234+
- Environmental sensors for home air quality monitoring
235+
- Laboratory equipment for blood and urine analysis
236+
- Clinical procedures (monofilament testing, cognitive function testing, visual acuity, contrast sensitivity)
237+
data_collectors:
238+
- description: 'Three data collection sites with trained study coordinators and clinical staff.
239+
240+
Community Advisory Board of 11 persons from three sites contributes to protocol development.'
241+
collector_details:
242+
- University of Washington (Seattle, WA)
243+
- University of Alabama at Birmingham (Birmingham, AL)
244+
- University of California San Diego (San Diego, CA)
245+
- Study coordinators and clinical staff at each site
246+
- Community Advisory Board representation from all three sites
247+
collection_timeframes:
248+
- description: 'Enrollment began July 18, 2023 and will continue through November 30, 2026.
249+
250+
Data collected in waves to facilitate efficient sampling. Periodic data releases
251+
252+
planned as enrollment proceeds.'
253+
timeframe_details:
254+
- 'Enrollment start: July 18, 2023'
255+
- 'Enrollment end: November 30, 2026'
256+
- Wave-based recruitment with periodic releases
257+
- 'Each participant: single in-person visit plus 10-day at-home monitoring'
258+
- 'Longitudinal cohort (10% of participants): follow-up visits'
259+
preprocessing_strategies:
260+
- description: 'Domain-specific processing and harmonization described in the Dataset Documentation
261+
262+
for each data domain. Images exported from devices in raw format with some requiring
263+
264+
conversion to DICOM standard format prior to upload. Data from wearable devices and
265+
266+
sensors exported in their raw formats.'
267+
preprocessing_details:
268+
- File formats, data standards, metadata, and example outputs provided per domain
269+
- Harmonization across three collection sites
270+
- Retinal images converted to DICOM format where needed
271+
- Wearable device data exported in device-native formats
272+
- Environmental sensor data exported in raw format
273+
cleaning_strategies:
274+
- description: 'Harmonization and processing across three sites with domain-specific details
275+
276+
provided in the documentation. Quality control procedures for clinical measurements
277+
278+
and imaging data.'
279+
cleaning_details:
280+
- Cross-site harmonization procedures
281+
- Domain-specific data cleaning as documented
282+
- Quality control for retinal images and ECG data
283+
- Validation of laboratory results
284+
labeling_strategies:
285+
- description: 'Domain-specific labeling and annotation where applicable, as described in the
286+
287+
documentation for each data domain. Clinical test outputs and imaging outputs
288+
289+
labeled according to clinical standards.'
290+
labeling_details:
291+
- Clinical test outputs annotated per domain protocols
292+
- Retinal images labeled with image quality and clinical findings
293+
- Diabetes classification based on clinical criteria and ICD-10 codes
294+
human_subject_research:
295+
involves_human_subjects: true
296+
irb_approval:
297+
- Approved by University of Washington Institutional Review Board (approval number STUDY00016228)
298+
- Reliance agreements with IRBs of University of Alabama at Birmingham and University of California
299+
San Diego
300+
ethics_review_board:
301+
- University of Washington IRB (lead institution)
302+
- University of Alabama at Birmingham IRB (reliance agreement)
303+
- University of California San Diego IRB (reliance agreement)
304+
informed_consent:
305+
- consent_obtained: true
306+
consent_type:
307+
- Written informed consent provided by all participants
308+
- Informed consent document available for download during recruitment
309+
- Consent covers data collection and sharing of de-identified research data
310+
description: 'Written informed consent is provided by all participants. Participants can review
311+
312+
informed consent document during REDCap recruitment interface before enrollment.
313+
314+
Consent covers participation in study visit, at-home monitoring, and data sharing.'
315+
is_deidentified:
316+
description: 'Public dataset is de-identified. Controlled-access dataset contains limited identifiers
317+
318+
(5-digit zip code, demographic data) under data use agreement protections.'
319+
future_use_impacts:
320+
- description: 'Dataset designed for AI/ML research on T2DM and salutogenesis. Early-release
321+
322+
imbalance across groups may affect AI/ML model performance and fairness until
323+
324+
final balanced dataset is complete. Users should account for group balance and
325+
326+
potential distribution shifts. Sensitive elements must be handled under DUA to
327+
328+
mitigate privacy risks.'
329+
impact_details:
330+
- Potential bias from unbalanced early releases (will be resolved in final dataset)
331+
- Privacy considerations for controlled-access data
332+
- Designed for pseudotime manifold analysis and trajectory modeling
333+
- Multi-modal data enables novel AI/ML approaches not feasible with single-domain datasets
334+
discouraged_uses:
335+
- description: 'Users must adhere to license terms for public data and the Data Use Agreement
336+
337+
for controlled-access data. Uses that would violate participant privacy protections
338+
339+
or re-identification attempts are prohibited.'
340+
discouragement_details:
341+
- Uses not permitted by license terms
342+
- Uses that would violate participant privacy protections
343+
- Attempts to re-identify participants
344+
- Uses that would discriminate against participants or populations
345+
distribution_formats:
346+
- description: 'Public dataset downloadable upon agreement with a license; full dataset available
347+
348+
via controlled access through Data Use Agreement. Data formats vary by domain
349+
350+
as documented in dataset documentation.'
351+
access_urls:
352+
- https://fairhub.io/datasets/2
353+
license_and_use_terms:
354+
description: 'Public data available for download upon agreement with a license defining permitted
355+
356+
uses. Full dataset access (including controlled-access components) contingent on
357+
358+
entering into a Data Use Agreement. Adheres to FAIR (Findable, Accessible,
359+
360+
Interoperable, Reusable) principles.'
361+
license_terms:
362+
- Public data under license with defined permitted uses
363+
- Full dataset requires Data Use Agreement for controlled-access components
364+
- FAIR principles for data sharing
365+
- Attribution required
366+
maintainers:
367+
- name: AI-READI Project Team
368+
description: 'Multi-institutional project team responsible for maintaining the dataset.
369+
370+
Contact PI: Aaron Lee, University of Washington, Department of Ophthalmology.'
371+
maintainer_details:
372+
- See Documentation site Contact Us and GitHub references
373+
- Contact information available through FAIRhub portal
374+
updates:
375+
description: 'Periodic updates to data releases are planned as enrollment proceeds (enrollment
376+
377+
through November 30, 2026). Documentation versions align with dataset versions
378+
379+
to ensure users can track changes across releases.'
380+
update_details:
381+
- Periodic data releases as enrollment continues
382+
- Documentation versioned with dataset (e.g., v1.0.0, v2.0.0)
383+
- 'Target completion: November 2026'
384+
version_access:
385+
description: 'Separate documentation versions align to dataset versions. Users can navigate
386+
387+
between versions via the documentation site. FAIRhub provides version control
388+
389+
and DOI assignment for each release.'
390+
version_details:
391+
- Version dropdown available in documentation
392+
- Each dataset version has corresponding documentation version
393+
- DOI assigned to each major release
394+
- Version history tracked on FAIRhub

0 commit comments

Comments
 (0)