Skip to content

Commit 6b5fd78

Browse files
authored
Merge pull request #19 from bridge2ai/schema-extend
Extend LinkML schema with LBNL DOE model card md template coverage
2 parents 8477dcd + 8dc97a9 commit 6b5fd78

File tree

8 files changed

+3317
-2
lines changed

8 files changed

+3317
-2
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,3 +129,4 @@ dmypy.json
129129

130130
# Pyre type checker
131131
.pyre/
132+
.DS_Store

CLAUDE.md

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -282,3 +282,221 @@ See `utils/README.md` for complete tool documentation.
282282
- `modelcards.yaml` - Current production schema
283283
- `modelcards_harmonized.yaml` - Proposed harmonized schema (conceptual, has naming conflicts)
284284
- External reference pattern (recommended) - See examples in `src/data/examples/harmonized/`
285+
286+
## Model Card Extended Template
287+
288+
### Branch: `schema-extend`
289+
290+
The schema has been extended on the `schema-extend` branch to provide **100% coverage** for DOE scientific models through an extended template. This extended template emphasizes compute infrastructure, reproducibility, and mission relevance for scientific computing applications.
291+
292+
### Extensions Overview
293+
294+
**Schema Size**: ~1,500 lines (from 967 baseline)
295+
**New Classes**: 10 extended template classes
296+
**Enhanced Classes**: 6 existing classes
297+
**New Slots**: ~40 new fields
298+
**New Enums**: 1 (ContributorRoleEnum)
299+
300+
### New Classes (10)
301+
302+
1. **Contributor** - Role-based contributor attribution
303+
- Fields: name, role (ContributorRoleEnum), email, orcid, affiliation
304+
- Replaces/enhances simple `owner` class
305+
- Example: `{name: "Jane Doe", role: developed_by, orcid: "https://orcid.org/0000-0002-1234-5678"}`
306+
307+
2. **ComputeInfrastructure** - Hardware/software used for training
308+
- Fields: hardware, hardware_list, software, software_dependencies, training_speed
309+
- Captures DOE facility information (NERSC, ALCF, OLCF)
310+
- Example: `hardware_list: ["64 nodes × 4 NVIDIA A100 GPUs", "NERSC Perlmutter"]`
311+
312+
3. **Hyperparameters** - Complete training hyperparameters
313+
- Fields: optimizer, learning_rate, batch_size, training_epochs, training_steps, etc.
314+
- Supports LLM-specific fields (prompting_template, fine_tuning_method)
315+
- Example: `{optimizer: AdamW, learning_rate: 0.0001, batch_size: 512}`
316+
317+
4. **ReproducibilityInfo** - Reproducibility documentation
318+
- Fields: random_seed, environment_config, pipeline_url, hyperparameters
319+
- Example: `{random_seed: 42, hyperparameters: {...}}`
320+
321+
5. **CodeExample** - Code snippets with language
322+
- Fields: code, code_language, description
323+
- Example: `{code: "import torch...", code_language: python}`
324+
325+
6. **UsageDocumentation** - Installation and usage
326+
- Fields: installation_instructions, training_configuration, inference_configuration, code_examples
327+
- Supports conda/docker/SLURM workflows
328+
329+
7. **MissionRelevance** - DOE mission alignment
330+
- Fields: doe_project, doe_facility, funding_source, description
331+
- Example: `{doe_facility: "NERSC Perlmutter", doe_project: "Climate Model Development"}`
332+
333+
8. **OutOfScopeUse** - Prohibited uses
334+
- Fields: description
335+
- Example: `{description: "Not for real-time weather forecasting"}`
336+
337+
9. **TrainingProcedure** - Training methodology
338+
- Fields: description, methodology, reproducibility_info, pre_training_info, training_data_separate
339+
- Nested hyperparameters and reproducibility info
340+
341+
10. **EvaluationProcedure** - Evaluation methodology
342+
- Fields: description, benchmarks, baselines, sota_comparison, uncertainty_quantification, evaluation_data_separate
343+
- Example: Benchmark comparisons, SOTA references, uncertainty analysis
344+
345+
### Enhanced Classes (6)
346+
347+
1. **Version** - Added `last_updated`, `superseded_by`
348+
2. **License** - Added `license_name`, `license_link` for custom licenses
349+
3. **ModelDetails** - Added `short_description`, `contributors` (role-based)
350+
4. **ModelParameters** - Added `compute_infrastructure`, `training_procedure`
351+
5. **QuantitativeAnalysis** - Added `evaluation_procedure`
352+
6. **Considerations** - Added `out_of_scope_uses`
353+
354+
### New Root-Level Fields (2)
355+
356+
Added to `modelCard` class:
357+
- `mission_relevance` (MissionRelevance)
358+
- `usage_documentation` (UsageDocumentation)
359+
360+
### Extended Template Coverage
361+
362+
| Template Section | Schema Mapping | Coverage |
363+
|---------------|----------------|----------|
364+
| Model Details → Description | `model_details.short_description` | ✅ 100% |
365+
| Model Details → Developed By | `model_details.contributors` (role: developed_by) | ✅ 100% |
366+
| Model Details → Shared By | `model_details.contributors` (role: contributed_by) | ✅ 100% |
367+
| Model Details → Version | `model_details.version` (enhanced) | ✅ 100% |
368+
| Model Details → License | `model_details.licenses` (enhanced) | ✅ 100% |
369+
| Compute Infrastructure → Hardware | `compute_infrastructure.hardware_list` | ✅ 100% |
370+
| Compute Infrastructure → Software | `compute_infrastructure.software_dependencies` | ✅ 100% |
371+
| Training → Dataset | `model_parameters.data` | ✅ 100% |
372+
| Training → Procedure | `model_parameters.training_procedure` | ✅ 100% |
373+
| Training → Reproducibility | `training_procedure.reproducibility_info` | ✅ 100% |
374+
| Training → Hyperparameters | `reproducibility_info.hyperparameters` | ✅ 100% |
375+
| Evaluation → Metrics | `quantitative_analysis.performance_metrics` | ✅ 100% |
376+
| Evaluation → Procedure | `quantitative_analysis.evaluation_procedure` | ✅ 100% |
377+
| Uses → Intended Uses | `considerations.use_cases` | ✅ 100% |
378+
| Uses → Out-of-Scope | `considerations.out_of_scope_uses` | ✅ 100% |
379+
| Limitations | `considerations.limitations` | ✅ 100% |
380+
| Ethical Considerations | `considerations.ethical_considerations` | ✅ 100% |
381+
| DOE Mission Relevance | `mission_relevance` | ✅ 100% |
382+
| Usage Documentation | `usage_documentation` | ✅ 100% |
383+
384+
**Overall Coverage**: ✅ **100%**
385+
386+
### Examples
387+
388+
**Extended Template Example**: `src/data/examples/extended/climate-model-extended.yaml`
389+
- Complete ClimateNet-v2 model card
390+
- Demonstrates all extended template features
391+
- Realistic DOE scientific model (climate AI)
392+
- Includes:
393+
- Role-based contributors with ORCID
394+
- NERSC Perlmutter compute infrastructure
395+
- Complete hyperparameters (optimizer, learning rate, batch size, etc.)
396+
- Reproducibility info (random seed, environment)
397+
- DOE mission relevance (BER funding, NERSC facility)
398+
- Complete usage documentation (conda/docker/SLURM)
399+
- Code examples in Python and Bash
400+
401+
**Example Documentation**: `src/data/examples/extended/README.md`
402+
- Complete extended template feature documentation
403+
- Before/after migration examples
404+
- Coverage table
405+
- Validation instructions
406+
407+
### Validation
408+
409+
Schema validates successfully with linkml-lint:
410+
```bash
411+
poetry run linkml-lint src/linkml/modelcards.yaml
412+
```
413+
414+
Only non-blocking naming convention warnings (same as baseline).
415+
416+
### Use Cases
417+
418+
The extended template is ideal for:
419+
420+
1. **DOE Scientific Models**
421+
- Climate models (E3SM, CESM, MPAS)
422+
- Materials science, fusion, bioinformatics
423+
- Any model trained at DOE facilities
424+
425+
2. **HPC/Supercomputing Applications**
426+
- Models trained on NERSC Perlmutter, ALCF Polaris/Aurora, OLCF Frontier
427+
- Large-scale distributed training
428+
- Petabyte-scale datasets
429+
430+
3. **Reproducible Science**
431+
- Complete environment specifications
432+
- Random seeds and hyperparameters
433+
- Training pipeline URLs
434+
- Detailed methodology
435+
436+
4. **DOE Mission-Aligned Projects**
437+
- Office of Science grants (BER, ASCR, NP, HEP)
438+
- Facility-specific documentation
439+
- Funding transparency
440+
441+
### Backward Compatibility
442+
443+
All extended template features are **fully backward compatible**:
444+
- Existing model cards remain valid
445+
- Extended fields are optional
446+
- Legacy `owner` class preserved (alongside new `contributors`)
447+
- No breaking changes to existing schema
448+
449+
### Migration Path
450+
451+
To upgrade an existing model card with extended template features:
452+
453+
1. **Add contributors** (optional, recommended):
454+
```yaml
455+
model_details:
456+
contributors:
457+
- name: "Jane Doe"
458+
role: developed_by
459+
orcid: "https://orcid.org/0000-0002-1234-5678"
460+
```
461+
462+
2. **Add compute infrastructure** (optional):
463+
```yaml
464+
model_parameters:
465+
compute_infrastructure:
466+
hardware_list: ["64 × NVIDIA A100 GPUs"]
467+
software_dependencies: "pytorch=2.1.0\nhorovod=0.28.1"
468+
```
469+
470+
3. **Add reproducibility info** (optional):
471+
```yaml
472+
model_parameters:
473+
training_procedure:
474+
reproducibility_info:
475+
random_seed: 42
476+
hyperparameters:
477+
optimizer: AdamW
478+
learning_rate: 0.0001
479+
```
480+
481+
4. **Add DOE mission relevance** (optional):
482+
```yaml
483+
mission_relevance:
484+
doe_facility: "NERSC Perlmutter"
485+
doe_project: "My DOE Project"
486+
```
487+
488+
5. **Add usage documentation** (optional):
489+
```yaml
490+
usage_documentation:
491+
installation_instructions: "pip install my-model"
492+
code_examples:
493+
- code: "import my_model"
494+
code_language: "python"
495+
```
496+
497+
### Related Files
498+
499+
- **Schema**: `src/linkml/modelcards.yaml` (on `schema-extend` branch)
500+
- **Template Source**: `data/input_docs/KOGUT/model-card.md` (original LBNL DOE KOGUT template - path preserved for historical reference)
501+
- **Example**: `src/data/examples/extended/climate-model-extended.yaml`
502+
- **Example Docs**: `src/data/examples/extended/README.md`

0 commit comments

Comments
 (0)