|
| 1 | +# D4D Datasheet for AI-READI Dataset |
| 2 | +# Generated by: Claude Code Assistant (In-Session Synthesis) |
| 3 | +# Source: data/preprocessed/concatenated/AI_READI_preprocessed.txt (13 source files) |
| 4 | +# Schema: data_sheets_schema_all.yaml |
| 5 | +# Generation Date: 2025-12-06 |
| 6 | + |
| 7 | +id: https://fairhub.io/datasets/2 |
| 8 | +name: AI-READI Dataset |
| 9 | +title: Flagship Dataset of Type 2 Diabetes from the AI-READI Project |
| 10 | +description: 'The AI-READI (Artificial Intelligence Ready and Exploratory Atlas for Diabetes Insights) |
| 11 | + dataset |
| 12 | +
|
| 13 | + consists of data collected from individuals with and without Type 2 Diabetes Mellitus (T2DM), |
| 14 | +
|
| 15 | + harmonized across 3 data collection sites (University of Washington, University of Alabama at |
| 16 | +
|
| 17 | + Birmingham, and University of California San Diego). The composition was designed with future |
| 18 | +
|
| 19 | + AI/Machine Learning studies in mind, including recruitment sampling procedures aimed at achieving |
| 20 | +
|
| 21 | + triple-balanced distribution (race/ethnicity, diabetes severity, biological sex) and a multi-domain |
| 22 | +
|
| 23 | + data acquisition protocol (survey data, physical measurements, clinical data, imaging data, wearable |
| 24 | +
|
| 25 | + device data, environmental sensors) to enable downstream AI/ML analyses. Target enrollment is 4,000 |
| 26 | +
|
| 27 | + participants with approximately 400 participants in longitudinal follow-up. The goal is to better |
| 28 | +
|
| 29 | + understand salutogenesis (the pathway from disease to health) in T2DM through pseudotime manifold |
| 30 | +
|
| 31 | + analysis.' |
| 32 | +page: https://fairhub.io/datasets/2 |
| 33 | +language: en |
| 34 | +keywords: |
| 35 | +- Type 2 Diabetes |
| 36 | +- T2DM |
| 37 | +- AI-READI |
| 38 | +- Machine Learning |
| 39 | +- multimodal |
| 40 | +- harmonized |
| 41 | +- multi-site |
| 42 | +- salutogenesis |
| 43 | +- survey data |
| 44 | +- clinical data |
| 45 | +- imaging data |
| 46 | +- wearable device data |
| 47 | +- retinal images |
| 48 | +- ECG |
| 49 | +- blood glucose |
| 50 | +- laboratory results |
| 51 | +- environmental data |
| 52 | +- FAIR principles |
| 53 | +- Bridge2AI |
| 54 | +purposes: |
| 55 | +- description: 'Create a flagship ethically-sourced dataset to enable future generations of artificial |
| 56 | +
|
| 57 | + intelligence/machine learning (AI/ML) research to provide critical insights into type 2 |
| 58 | +
|
| 59 | + diabetes mellitus (T2DM), including salutogenic pathways to return to health. Develop a |
| 60 | +
|
| 61 | + foundational dataset in diabetes, agnostic to existing classification criteria, which can |
| 62 | +
|
| 63 | + be used to reconstruct a temporal atlas of T2DM development and reversal towards health.' |
| 64 | +tasks: |
| 65 | +- description: 'Enable downstream AI/ML analyses including pseudotime manifold analysis to predict disease |
| 66 | +
|
| 67 | + trajectories across survey, clinical, imaging, wearable, and environmental domains related |
| 68 | +
|
| 69 | + to T2DM that may not be feasible with existing data sources such as claims or electronic |
| 70 | +
|
| 71 | + health records data.' |
| 72 | +addressing_gaps: |
| 73 | +- description: 'Address the lack of well-designed, high quality, and large multimodal datasets needed |
| 74 | + to |
| 75 | +
|
| 76 | + understand and affect the course of complex, multi-organ diseases such as T2DM. Provide |
| 77 | +
|
| 78 | + a harmonized, multi-site, multi-domain dataset with triple-balanced recruitment (race/ethnicity, |
| 79 | +
|
| 80 | + diabetes severity, biological sex) enabling AI/ML analyses not feasible with existing sources.' |
| 81 | +creators: |
| 82 | +- name: AI-READI Consortium |
| 83 | + description: 'Multi-institutional team led by Contact PI Aaron Lee (University of Washington, Department |
| 84 | +
|
| 85 | + of Ophthalmology), with teams across eight institutions working on six cross-disciplinary |
| 86 | +
|
| 87 | + project modules.' |
| 88 | +instances: |
| 89 | +- description: 'Individual participants with and without Type 2 Diabetes Mellitus (T2DM) with multi-domain |
| 90 | +
|
| 91 | + measurements. Each participant represents an instance with cross-sectional data including |
| 92 | +
|
| 93 | + survey responses, physical and clinical measurements, blood and urine lab results, retinal |
| 94 | +
|
| 95 | + imaging, ECG, wearable device time-series (10 days), continuous blood glucose monitoring |
| 96 | +
|
| 97 | + (10 days), and environmental sensor data. Target enrollment 4,000 participants across three |
| 98 | +
|
| 99 | + sites, with approximately 400 participants (10%) in longitudinal follow-up.' |
| 100 | +subsets: |
| 101 | +- id: aireadi:public-dataset |
| 102 | + name: Public dataset |
| 103 | + description: 'Includes survey data, blood and urine lab results, fitness activity levels from Garmin |
| 104 | +
|
| 105 | + tracker, clinical measurements (monofilament and cognitive function testing), retinal |
| 106 | +
|
| 107 | + images, ECG, continuous blood sugar levels from Dexcom CGM, and environmental variables |
| 108 | +
|
| 109 | + such as home air quality from environmental sensors. Available for public download upon |
| 110 | +
|
| 111 | + agreement with a license that defines how the data can be used.' |
| 112 | +- id: aireadi:controlled-access-dataset |
| 113 | + name: Controlled-access dataset |
| 114 | + description: 'Includes 5-digit zip code, sex, race, ethnicity, genetic sequencing data (future), |
| 115 | +
|
| 116 | + past health records, medications, and traffic and accident reports (environmental data). |
| 117 | +
|
| 118 | + Accessible by entering into a data use agreement.' |
| 119 | +sampling_strategies: |
| 120 | +- description: 'Recruitment sampling procedures aimed at achieving triple-balanced distribution across |
| 121 | +
|
| 122 | + four race/ethnic groups (Asian, Black, White, Hispanic), four categories of T2DM severity |
| 123 | +
|
| 124 | + (no diabetes, pre-diabetes/lifestyle-controlled, diabetes treated with oral/non-insulin |
| 125 | +
|
| 126 | + injectable medications, insulin-controlled diabetes), and biological sex (male, female). |
| 127 | +
|
| 128 | + Recruitment in waves with monitoring and adjustment through under- and oversampling as needed.' |
| 129 | + is_sample: |
| 130 | + - true |
| 131 | + is_random: |
| 132 | + - false |
| 133 | + is_representative: |
| 134 | + - false |
| 135 | + strategies: |
| 136 | + - Triple-balanced targeted recruitment across race/ethnicity, diabetes severity, and biological sex |
| 137 | + - Wave-based recruitment with ongoing monitoring and adjustment |
| 138 | + - Screening electronic health records using ICD-10 codes (R73.09 for pre-diabetes, E11.X for T2DM) |
| 139 | + - Personalized invitation letters and emails with REDCap recruitment interface |
| 140 | +subpopulations: |
| 141 | +- description: 'Participants with and without Type 2 Diabetes Mellitus, stratified by diabetes severity. |
| 142 | +
|
| 143 | + Four T2DM categories: no diabetes, pre-diabetes/lifestyle-controlled, diabetes treated |
| 144 | +
|
| 145 | + with medications, insulin-controlled diabetes. Four race/ethnicity groups: Asian, Black, |
| 146 | +
|
| 147 | + White, Hispanic. Equal distribution by biological sex.' |
| 148 | +anomalies: |
| 149 | +- description: 'As enrollment is ongoing (began July 18, 2023, continues through November 30, 2026), |
| 150 | +
|
| 151 | + early data releases may not have achieved balanced distribution across all groups due |
| 152 | +
|
| 153 | + to wave-based recruitment and ongoing enrollment.' |
| 154 | + anomaly_details: |
| 155 | + - Early releases may exhibit unbalanced distributions across diabetes severity, race/ethnicity, or sex |
| 156 | + groups |
| 157 | + - Balanced distribution target is for final complete dataset |
| 158 | +external_resources: |
| 159 | +- name: AI-READI Dataset Documentation |
| 160 | + description: 'Comprehensive documentation for the AI-READI dataset on the FAIRhub data portal, |
| 161 | +
|
| 162 | + including dataset landing page, data dictionary, and versioned documentation.' |
| 163 | + external_resources: |
| 164 | + - https://docs.aireadi.org |
| 165 | + - https://fairhub.io/datasets/2 |
| 166 | + future_guarantees: |
| 167 | + - Documentation versions correspond to dataset versions |
| 168 | +- name: Related Publications |
| 169 | + description: AI-READI publications describing study protocol and design |
| 170 | + external_resources: |
| 171 | + - https://doi.org/10.1038/s42255-024-01165-x |
| 172 | + - https://doi.org/10.1136/bmjopen-2024-097449 |
| 173 | +- name: Zenodo Archive |
| 174 | + description: Archived dataset record on Zenodo |
| 175 | + external_resources: |
| 176 | + - https://doi.org/10.5281/zenodo.10642459 |
| 177 | +- name: NIH RePORTER |
| 178 | + description: NIH project record for Bridge2AI Salutogenesis Data Generation |
| 179 | + external_resources: |
| 180 | + - https://reporter.nih.gov/project-details/10471118 |
| 181 | +confidential_elements: |
| 182 | +- description: 'Contains protected health information elements under controlled access including |
| 183 | +
|
| 184 | + past health records, medications, genetic sequencing data (future), and 5-digit zip codes.' |
| 185 | + confidential_elements_present: true |
| 186 | + confidentiality_details: |
| 187 | + - 5-digit zip code held under controlled access |
| 188 | + - Genetic sequencing data (future) will be held under controlled access |
| 189 | + - Past health records and medications held under controlled access |
| 190 | + - Sex, race, ethnicity held under controlled access |
| 191 | +sensitive_elements: |
| 192 | +- description: 'Sensitive demographic and health data held under controlled access to protect |
| 193 | +
|
| 194 | + participant privacy.' |
| 195 | + sensitive_elements_present: true |
| 196 | + sensitivity_details: |
| 197 | + - Sex, race, ethnicity held under controlled access |
| 198 | + - Genetic sequencing data (future) |
| 199 | + - Past health records and medications |
| 200 | + - Traffic and accident reports (environmental data) |
| 201 | +acquisition_methods: |
| 202 | +- description: 'Harmonized, multi-domain data acquisition across three collection sites (University |
| 203 | +
|
| 204 | + of Washington, University of Alabama at Birmingham, University of California San Diego) |
| 205 | +
|
| 206 | + using surveys, clinical exams, imaging devices, wearable sensors, and environmental |
| 207 | +
|
| 208 | + monitors. In-person study visit for clinical assessments, followed by 10-day at-home |
| 209 | +
|
| 210 | + monitoring with wearable devices.' |
| 211 | + was_directly_observed: true |
| 212 | + was_reported_by_subjects: true |
| 213 | + acquisition_details: |
| 214 | + - Physical and clinical measurements directly observed during in-person visit |
| 215 | + - Survey data self-reported by participants via REDCap |
| 216 | + - Retinal imaging directly captured during visit |
| 217 | + - ECG directly measured during visit |
| 218 | + - Wearable device data passively collected over 10 days at home (Garmin fitness tracker) |
| 219 | + - Continuous glucose monitoring over 10 days at home (Dexcom CGM) |
| 220 | + - Environmental sensor data directly measured over 10 days at home |
| 221 | + - Blood and urine laboratory tests from samples collected during visit |
| 222 | +collection_mechanisms: |
| 223 | +- description: 'Multi-modal data collection using hardware devices, clinical procedures, and |
| 224 | +
|
| 225 | + software-driven capture. REDCap used for patient-reported questionnaires and |
| 226 | +
|
| 227 | + clinical data entry.' |
| 228 | + mechanism_details: |
| 229 | + - REDCap for survey data and clinical measurements entry |
| 230 | + - Retinal imaging devices exporting to DICOM format |
| 231 | + - ECG devices exporting raw format |
| 232 | + - Garmin fitness tracker for 10-day activity and heart rate monitoring |
| 233 | + - Dexcom Continuous Glucose Monitor for 10-day glucose tracking |
| 234 | + - Environmental sensors for home air quality monitoring |
| 235 | + - Laboratory equipment for blood and urine analysis |
| 236 | + - Clinical procedures (monofilament testing, cognitive function testing, visual acuity, contrast sensitivity) |
| 237 | +data_collectors: |
| 238 | +- description: 'Three data collection sites with trained study coordinators and clinical staff. |
| 239 | +
|
| 240 | + Community Advisory Board of 11 persons from three sites contributes to protocol development.' |
| 241 | + collector_details: |
| 242 | + - University of Washington (Seattle, WA) |
| 243 | + - University of Alabama at Birmingham (Birmingham, AL) |
| 244 | + - University of California San Diego (San Diego, CA) |
| 245 | + - Study coordinators and clinical staff at each site |
| 246 | + - Community Advisory Board representation from all three sites |
| 247 | +collection_timeframes: |
| 248 | +- description: 'Enrollment began July 18, 2023 and will continue through November 30, 2026. |
| 249 | +
|
| 250 | + Data collected in waves to facilitate efficient sampling. Periodic data releases |
| 251 | +
|
| 252 | + planned as enrollment proceeds.' |
| 253 | + timeframe_details: |
| 254 | + - 'Enrollment start: July 18, 2023' |
| 255 | + - 'Enrollment end: November 30, 2026' |
| 256 | + - Wave-based recruitment with periodic releases |
| 257 | + - 'Each participant: single in-person visit plus 10-day at-home monitoring' |
| 258 | + - 'Longitudinal cohort (10% of participants): follow-up visits' |
| 259 | +preprocessing_strategies: |
| 260 | +- description: 'Domain-specific processing and harmonization described in the Dataset Documentation |
| 261 | +
|
| 262 | + for each data domain. Images exported from devices in raw format with some requiring |
| 263 | +
|
| 264 | + conversion to DICOM standard format prior to upload. Data from wearable devices and |
| 265 | +
|
| 266 | + sensors exported in their raw formats.' |
| 267 | + preprocessing_details: |
| 268 | + - File formats, data standards, metadata, and example outputs provided per domain |
| 269 | + - Harmonization across three collection sites |
| 270 | + - Retinal images converted to DICOM format where needed |
| 271 | + - Wearable device data exported in device-native formats |
| 272 | + - Environmental sensor data exported in raw format |
| 273 | +cleaning_strategies: |
| 274 | +- description: 'Harmonization and processing across three sites with domain-specific details |
| 275 | +
|
| 276 | + provided in the documentation. Quality control procedures for clinical measurements |
| 277 | +
|
| 278 | + and imaging data.' |
| 279 | + cleaning_details: |
| 280 | + - Cross-site harmonization procedures |
| 281 | + - Domain-specific data cleaning as documented |
| 282 | + - Quality control for retinal images and ECG data |
| 283 | + - Validation of laboratory results |
| 284 | +labeling_strategies: |
| 285 | +- description: 'Domain-specific labeling and annotation where applicable, as described in the |
| 286 | +
|
| 287 | + documentation for each data domain. Clinical test outputs and imaging outputs |
| 288 | +
|
| 289 | + labeled according to clinical standards.' |
| 290 | + labeling_details: |
| 291 | + - Clinical test outputs annotated per domain protocols |
| 292 | + - Retinal images labeled with image quality and clinical findings |
| 293 | + - Diabetes classification based on clinical criteria and ICD-10 codes |
| 294 | +human_subject_research: |
| 295 | + involves_human_subjects: true |
| 296 | + irb_approval: |
| 297 | + - Approved by University of Washington Institutional Review Board (approval number STUDY00016228) |
| 298 | + - Reliance agreements with IRBs of University of Alabama at Birmingham and University of California |
| 299 | + San Diego |
| 300 | + ethics_review_board: |
| 301 | + - University of Washington IRB (lead institution) |
| 302 | + - University of Alabama at Birmingham IRB (reliance agreement) |
| 303 | + - University of California San Diego IRB (reliance agreement) |
| 304 | +informed_consent: |
| 305 | +- consent_obtained: true |
| 306 | + consent_type: |
| 307 | + - Written informed consent provided by all participants |
| 308 | + - Informed consent document available for download during recruitment |
| 309 | + - Consent covers data collection and sharing of de-identified research data |
| 310 | + description: 'Written informed consent is provided by all participants. Participants can review |
| 311 | +
|
| 312 | + informed consent document during REDCap recruitment interface before enrollment. |
| 313 | +
|
| 314 | + Consent covers participation in study visit, at-home monitoring, and data sharing.' |
| 315 | +is_deidentified: |
| 316 | + description: 'Public dataset is de-identified. Controlled-access dataset contains limited identifiers |
| 317 | +
|
| 318 | + (5-digit zip code, demographic data) under data use agreement protections.' |
| 319 | +future_use_impacts: |
| 320 | +- description: 'Dataset designed for AI/ML research on T2DM and salutogenesis. Early-release |
| 321 | +
|
| 322 | + imbalance across groups may affect AI/ML model performance and fairness until |
| 323 | +
|
| 324 | + final balanced dataset is complete. Users should account for group balance and |
| 325 | +
|
| 326 | + potential distribution shifts. Sensitive elements must be handled under DUA to |
| 327 | +
|
| 328 | + mitigate privacy risks.' |
| 329 | + impact_details: |
| 330 | + - Potential bias from unbalanced early releases (will be resolved in final dataset) |
| 331 | + - Privacy considerations for controlled-access data |
| 332 | + - Designed for pseudotime manifold analysis and trajectory modeling |
| 333 | + - Multi-modal data enables novel AI/ML approaches not feasible with single-domain datasets |
| 334 | +discouraged_uses: |
| 335 | +- description: 'Users must adhere to license terms for public data and the Data Use Agreement |
| 336 | +
|
| 337 | + for controlled-access data. Uses that would violate participant privacy protections |
| 338 | +
|
| 339 | + or re-identification attempts are prohibited.' |
| 340 | + discouragement_details: |
| 341 | + - Uses not permitted by license terms |
| 342 | + - Uses that would violate participant privacy protections |
| 343 | + - Attempts to re-identify participants |
| 344 | + - Uses that would discriminate against participants or populations |
| 345 | +distribution_formats: |
| 346 | +- description: 'Public dataset downloadable upon agreement with a license; full dataset available |
| 347 | +
|
| 348 | + via controlled access through Data Use Agreement. Data formats vary by domain |
| 349 | +
|
| 350 | + as documented in dataset documentation.' |
| 351 | + access_urls: |
| 352 | + - https://fairhub.io/datasets/2 |
| 353 | +license_and_use_terms: |
| 354 | + description: 'Public data available for download upon agreement with a license defining permitted |
| 355 | +
|
| 356 | + uses. Full dataset access (including controlled-access components) contingent on |
| 357 | +
|
| 358 | + entering into a Data Use Agreement. Adheres to FAIR (Findable, Accessible, |
| 359 | +
|
| 360 | + Interoperable, Reusable) principles.' |
| 361 | + license_terms: |
| 362 | + - Public data under license with defined permitted uses |
| 363 | + - Full dataset requires Data Use Agreement for controlled-access components |
| 364 | + - FAIR principles for data sharing |
| 365 | + - Attribution required |
| 366 | +maintainers: |
| 367 | +- name: AI-READI Project Team |
| 368 | + description: 'Multi-institutional project team responsible for maintaining the dataset. |
| 369 | +
|
| 370 | + Contact PI: Aaron Lee, University of Washington, Department of Ophthalmology.' |
| 371 | + maintainer_details: |
| 372 | + - See Documentation site Contact Us and GitHub references |
| 373 | + - Contact information available through FAIRhub portal |
| 374 | +updates: |
| 375 | + description: 'Periodic updates to data releases are planned as enrollment proceeds (enrollment |
| 376 | +
|
| 377 | + through November 30, 2026). Documentation versions align with dataset versions |
| 378 | +
|
| 379 | + to ensure users can track changes across releases.' |
| 380 | + update_details: |
| 381 | + - Periodic data releases as enrollment continues |
| 382 | + - Documentation versioned with dataset (e.g., v1.0.0, v2.0.0) |
| 383 | + - 'Target completion: November 2026' |
| 384 | +version_access: |
| 385 | + description: 'Separate documentation versions align to dataset versions. Users can navigate |
| 386 | +
|
| 387 | + between versions via the documentation site. FAIRhub provides version control |
| 388 | +
|
| 389 | + and DOI assignment for each release.' |
| 390 | + version_details: |
| 391 | + - Version dropdown available in documentation |
| 392 | + - Each dataset version has corresponding documentation version |
| 393 | + - DOI assigned to each major release |
| 394 | + - Version history tracked on FAIRhub |
0 commit comments