Skip to content

Conversation

@cmungall
Copy link
Member

Summary

This PR adds several new value sets across multiple domains and improves existing enums with better ontology mappings and documentation.

New Value Sets

Plant Experimental Conditions (PECO-based)

  • PlantExposureEnum - Dynamic enum for all plant exposures
  • PlantStudyConditionEnum - Field, greenhouse, growth chamber, laboratory
  • SeasonalEnvironmentExposureEnum - Standard seasons + South Asian agricultural seasons
  • EcologicalEnvironmentExposureEnum - Climate zones, altitude, land characteristics
  • PlantGrowthMediumExposureEnum - Soil, hydroponic, aeroponic, in vitro

Radiation Types (ENVO-based)

  • RadiationTypeEnum - Dynamic enum for all radiation types
  • ElectromagneticRadiationTypeEnum - Full EM spectrum
  • InfraredRadiationTypeEnum - Near, mid, far IR
  • AcousticRadiationTypeEnum - Acoustic, ultrasound, infrasound

Other

  • ArxivCategoryEnum - Complete arXiv subject classification
  • PlannedProcessCompletionStatus - COB-based process status
  • Domain-specific enums: genetics, geospatial formats, FAO soil, carcinogenicity, radionuclides

Fixes

  • PlantSexEnum: Removed inappropriate UBERON/GSSO mappings (animal ontologies). Changed HERMAPHRODITIC to use PATO:0001340 instead.

Enhancements

  • GOEvidenceCode: Added descriptions and structured annotations (is_experimental, is_manual, is_high_throughput)
  • MonthOfYearEnum: Use standard rank field instead of custom annotations
  • New CLI with typer
  • New utility modules and scripts

Test plan

  • Schema validates with just site
  • Review generated Python enums
  • Verify ontology mappings are correct

🤖 Generated with Claude Code

## New Value Sets

### Plant Experimental Conditions (PECO-based)
- `PlantExposureEnum` - Dynamic enum for all plant exposures (abiotic + biotic)
- `PlantStudyConditionEnum` - Field, greenhouse, growth chamber, laboratory studies
- `SeasonalEnvironmentExposureEnum` - Spring, summer, autumn, winter, dry/rainy seasons, kharif/rabi (South Asian agricultural seasons)
- `EcologicalEnvironmentExposureEnum` - Climate zones (tropical, sub-tropical, temperate, desert), altitude, atmospheric pressure, lowland/upland regions
- `PlantGrowthMediumExposureEnum` - Soil, hydroponic, aeroponic, in vitro media

### Radiation Types (ENVO-based)
- `RadiationTypeEnum` - Dynamic enum for all radiation types
- `ElectromagneticRadiationTypeEnum` - Full EM spectrum from gamma rays to radio waves, plus PAR and longwave radiation
- `InfraredRadiationTypeEnum` - Near, mid, and far infrared
- `AcousticRadiationTypeEnum` - Acoustic, ultrasound, infrasound

### Publishing
- `ArxivCategoryEnum` - Complete arXiv subject classification (physics, math, cs, q-bio, q-fin, stat, eess, econ)

### Other New Enums
- `PlannedProcessCompletionStatus` - COB-based status for planned processes (completed/failed)
- Various domain-specific enums: genetics, geospatial formats, FAO soil, carcinogenicity, radionuclides, stewardship

## Ontology Mapping Fixes

### PlantSexEnum
- Removed inappropriate UBERON and GSSO mappings (these ontologies are for animals, not plants)
- Changed HERMAPHRODITIC from UBERON:0007197 to PATO:0001340
- Removed GSSO mappings from DIOECIOUS and MONOECIOUS (no appropriate PATO/PO terms exist)

## Enum Enhancements

### GOEvidenceCode
- Added detailed descriptions for all evidence codes
- Added structured annotations: is_experimental, is_manual, is_high_throughput
- Enhanced documentation explaining evidence categories

### MonthOfYearEnum
- Changed month_number annotations to use standard LinkML `rank` field for ordering

### CellCyclePhase
- Added ontology mappings and enhanced metadata

## Infrastructure

### New CLI
- Added `src/valuesets/cli.py` with typer-based command interface

### New Utilities
- `classifier.py` - Value set classification utilities
- `query_describer.py` - Query description utilities

### Scripts
- `add_enum_metadata.py` - Add metadata to enums
- `add_instantiates.py` - Add instantiates declarations
- `enrich_enum_docs.py` - Enrich enum documentation
- `fix_enum_titles.py` - Fix enum titles

### Build System
- Updated justfile with new targets
- Added project.justfile for project-specific tasks
- Updated pyproject.toml dependencies

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copilot AI review requested due to automatic review settings December 15, 2025 18:13
@claude
Copy link
Contributor

claude bot commented Dec 15, 2025

Claude finished @cmungall's task —— View job


PR Review Complete ✅

I've thoroughly reviewed this substantial PR that adds multiple new value sets and improves existing enums. Here's my comprehensive analysis:

Review Checklist

  • Set up development environment and validate schema
  • Review LinkML schema structure and best practices
  • Check ontology mappings and CURIE usage
  • Review new Python code (CLI, utilities, generators)
  • Analyze domain-specific value sets for accuracy
  • Check for potential security and performance issues
  • Assess test coverage and documentation
  • Provide comprehensive feedback

Strengths

1. Excellent LinkML Schema Organization

  • Well-structured modular approach with domain-specific organization
  • Consistent use of instantiates: valuesets_meta:ValueSetEnumDefinition
  • Proper CURIE prefix declarations and namespace usage
  • Good separation of static vs. dynamic enums using reachable_from

2. High-Quality Ontology Mappings

  • Systematic use of established ontologies (PECO, ENVO, ECO, IARC, etc.)
  • Proper meaning: vs close_mappings: distinctions
  • The PlantSexEnum fix removing inappropriate UBERON/GSSO mappings is excellent

3. Rich Metadata and Documentation

  • Comprehensive descriptions and titles
  • GOEvidenceCode enhancements with structured annotations (is_experimental, is_manual, is_high_throughput) are particularly valuable
  • Good use of examples and evidence levels in carcinogenicity classifications

4. Robust Python Code

  • New CLI with proper typer usage and error handling at src/valuesets/cli.py:49
  • Well-designed classifier utilities with regex patterns at src/valuesets/utils/classifier.py:8-11
  • Good separation of concerns in the modular generator approach

🔍 Areas for Improvement

1. Schema Validation Concerns

  • Cannot verify full schema validation without running just site (requires approval)
  • Some dynamic enums may need validation of reachable_from ontology queries
  • Consider adding schema validation to CI pipeline

2. Test Coverage

  • No visible test additions for the new value sets
  • CLI and utility functions would benefit from unit tests
  • Consider adding integration tests for dynamic enum expansion

3. Minor Code Quality Issues

src/valuesets/cli.py:21-51: The load_enum_class function could benefit from more specific exception handling:

# Current: Generic ImportError handling
except ImportError as e:
    raise typer.BadParameter(f"Cannot import module '{module}': {e}")

# Suggest: More specific error types for different failure modes

src/valuesets/utils/classifier.py:14-75: The detect_classifier_fields function uses regex patterns effectively, but consider caching compiled patterns as module-level constants.

4. Documentation

  • Some new enums like ArxivCategory could benefit from more detailed descriptions
  • Consider adding usage examples for the new CLI tools

🚨 Potential Issues

1. Performance Considerations

  • Large ArxivCategory enum (1000+ entries) may impact memory usage
  • Dynamic enum expansion could be slow for large ontologies
  • Consider lazy loading strategies for large value sets

2. Security

  • CLI input handling looks secure with proper validation
  • No obvious security concerns in the added code
  • File I/O operations use safe pathlib patterns

📋 Specific Recommendations

  1. Add Validation Tests: Consider adding tests that verify ontology CURIEs are resolvable
  2. CLI Documentation: Add usage examples and help text for the new CLI commands
  3. Performance Monitoring: Add metrics for dynamic enum expansion times
  4. Schema Versioning: Consider versioning strategy for backward compatibility as schemas evolve

🎯 Domain-Specific Assessment

Plant Experimental Conditions (PECO-based): ✅ Excellent coverage of plant research conditions with proper PECO mappings

Radiation Types (ENVO-based): ✅ Comprehensive electromagnetic spectrum coverage with accurate scientific terminology

Carcinogenicity Classifications: ✅ Authoritative use of IARC, EPA IRIS classifications with proper evidence levels

arXiv Categories: ✅ Complete taxonomy with appropriate NCIT/MeSH mappings where available

🏁 Overall Assessment

This is a high-quality PR that significantly enhances the value sets collection. The code follows LinkML best practices, uses authoritative ontology sources, and includes well-designed Python utilities. The modular approach and rich metadata make these value sets highly usable for scientific applications.

Recommendation: Approve with minor suggestions

The improvements to GOEvidenceCode and the PlantSexEnum ontology mapping fixes alone make this a valuable contribution. The new domain-specific value sets are well-researched and properly implemented.


Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds several new value sets across multiple domains and improves existing enums with better ontology mappings and documentation. The changes include new plant experimental conditions enums (PECO-based), radiation type enums (ENVO-based), the complete arXiv subject classification system, and various domain-specific enums for genetics, geospatial formats, FAO soil types, carcinogenicity classifications, and radionuclides. Additionally, the PR fixes incorrect ontology mappings in PlantSexEnum, enhances GOEvidenceCode with descriptions and structured annotations, standardizes MonthOfYearEnum to use the rank field, and introduces new utility modules for enum classification and query description along with a CLI interface using typer.

  • New value sets for plant experimental conditions, radiation types, arXiv categories, and various domain-specific classifications
  • Fixed ontology mappings in PlantSexEnum and enhanced GOEvidenceCode with structured annotations
  • Added utility modules for classifier functionality and query description, plus new CLI with typer

Reviewed changes

Copilot reviewed 53 out of 56 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/valuesets/utils/query_describer.py Adds utility for describing dynamic enum queries in human-readable text
src/valuesets/utils/classifier.py Implements classifier utilities for enum values based on numeric range annotations
src/valuesets/utils/init.py Exports new classifier functions
src/valuesets/schema/valuesets.yaml Adds import for arxiv_categories schema
src/valuesets/schema/time/temporal.yaml Updates MonthOfYearEnum to use standard rank field instead of custom month_number annotation
src/valuesets/schema/publishing/arxiv_categories.yaml Adds complete arXiv subject classification system with all categories
src/valuesets/schema/physics/radiation.yaml Adds ENVO-based radiation type enums
src/valuesets/schema/investigation.yaml Adds PlannedProcessCompletionStatus enum
src/valuesets/schema/demographics.yaml Adds OccupationEnum dynamic enum
src/valuesets/schema/bio/plant_sex.yaml Removes inappropriate UBERON/GSSO mappings for plant sex terms
src/valuesets/schema/bio/plant_experimental_conditions.yaml Adds PECO-based plant experimental condition enums
src/valuesets/schema/bio/go_evidence.yaml Enhances GOEvidenceCode with descriptions and structured annotations
src/valuesets/schema/bio/cell_cycle.yaml Adds succeeded_by annotations to meiotic phases
src/valuesets/generators/modular_rich_generator.py Adds support for rank field in metadata building
src/valuesets/enums/ (multiple files) Generated enum files for new value sets
src/valuesets/cli.py New CLI implementation using typer
scripts/ (multiple files) New utility scripts for enum enrichment and metadata management
pyproject.toml Adds typer dependency and CLI entry point

Comment on lines +147 to +162
output = sys.stdout if output_path is None else open(output_path, "w", newline="")
try:
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
finally:
if output_path:
output.close()
elif output_format == "jsonl":
output = sys.stdout if output_path is None else open(output_path, "w")
try:
for row in results:
output.write(json.dumps(row) + "\n")
finally:
if output_path:
output.close()
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File handle should be opened using a context manager to ensure proper closure even if an exception occurs. Wrap the file operations in a with statement or ensure output.close() is called in all code paths.

Suggested change
output = sys.stdout if output_path is None else open(output_path, "w", newline="")
try:
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
finally:
if output_path:
output.close()
elif output_format == "jsonl":
output = sys.stdout if output_path is None else open(output_path, "w")
try:
for row in results:
output.write(json.dumps(row) + "\n")
finally:
if output_path:
output.close()
if output_path:
with open(output_path, "w", newline="") as output:
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
else:
output = sys.stdout
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
elif output_format == "jsonl":
if output_path:
with open(output_path, "w") as output:
for row in results:
output.write(json.dumps(row) + "\n")
else:
output = sys.stdout
for row in results:
output.write(json.dumps(row) + "\n")

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +162
output = sys.stdout if output_path is None else open(output_path, "w", newline="")
try:
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
finally:
if output_path:
output.close()
elif output_format == "jsonl":
output = sys.stdout if output_path is None else open(output_path, "w")
try:
for row in results:
output.write(json.dumps(row) + "\n")
finally:
if output_path:
output.close()
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File handle should be opened using a context manager to ensure proper closure even if an exception occurs. Wrap the file operations in a with statement or ensure output.close() is called in all code paths.

Suggested change
output = sys.stdout if output_path is None else open(output_path, "w", newline="")
try:
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
finally:
if output_path:
output.close()
elif output_format == "jsonl":
output = sys.stdout if output_path is None else open(output_path, "w")
try:
for row in results:
output.write(json.dumps(row) + "\n")
finally:
if output_path:
output.close()
if output_path is None:
writer = csv.DictWriter(sys.stdout, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
else:
with open(output_path, "w", newline="") as output:
writer = csv.DictWriter(output, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)
elif output_format == "jsonl":
if output_path is None:
for row in results:
sys.stdout.write(json.dumps(row) + "\n")
else:
with open(output_path, "w") as output:
for row in results:
output.write(json.dumps(row) + "\n")

Copilot uses AI. Check for mistakes.
@claude
Copy link
Contributor

claude bot commented Dec 15, 2025

Claude finished @cmungall's task —— View job


PR Review Complete ✅

I've thoroughly reviewed this substantial PR that adds multiple new value sets and improves existing enums. Here's my comprehensive analysis:

Review Checklist

  • Read project guidelines and setup from CLAUDE.md
  • Review LinkML schema structure and best practices
  • Check ontology mappings and CURIE usage
  • Review new Python code (CLI, utilities, generators)
  • Analyze domain-specific value sets for accuracy
  • Check for potential security and performance issues
  • Assess test coverage and documentation
  • Provide comprehensive feedback

Strengths

1. Excellent LinkML Schema Organization

  • Well-structured modular approach with domain-specific organization
  • Consistent use of instantiates: valuesets_meta:ValueSetEnumDefinition
  • Proper CURIE prefix declarations and namespace usage
  • Good separation of static vs. dynamic enums using reachable_from

2. High-Quality Ontology Mappings

  • Systematic use of established ontologies (PECO, ENVO, ECO, IARC, etc.)
  • Proper meaning: vs close_mappings: distinctions
  • The PlantSexEnum fix removing inappropriate UBERON/GSSO mappings is excellent

3. Rich Metadata and Documentation

  • Comprehensive descriptions and titles
  • GOEvidenceCode enhancements with structured annotations (is_experimental, is_manual, is_high_throughput) are particularly valuable
  • Good use of examples and evidence levels in carcinogenicity classifications

4. Robust Python Code

  • New CLI with proper typer usage and error handling at src/valuesets/cli.py:49
  • Well-designed classifier utilities with regex patterns at src/valuesets/utils/classifier.py:8-11
  • Good separation of concerns in the modular generator approach

🔍 Areas for Improvement

1. Schema Validation Concerns

  • Cannot verify full schema validation without running just site (requires approval)
  • Some dynamic enums may need validation of reachable_from ontology queries
  • Consider adding schema validation to CI pipeline

2. Test Coverage

  • No visible test additions for the new value sets
  • CLI and utility functions would benefit from unit tests
  • Consider adding integration tests for dynamic enum expansion

3. Minor Code Quality Issues

src/valuesets/cli.py:21-51: The load_enum_class function could benefit from more specific exception handling:

# Current: Generic ImportError handling
except ImportError as e:
    raise typer.BadParameter(f"Cannot import module '{module}': {e}")

# Suggest: More specific error types for different failure modes

src/valuesets/utils/classifier.py:14-75: The detect_classifier_fields function uses regex patterns effectively, but consider caching compiled patterns as module-level constants.

4. Documentation

  • Some new enums like ArxivCategory could benefit from more detailed descriptions
  • Consider adding usage examples for the new CLI tools

🚨 Potential Issues

1. Performance Considerations

  • Large ArxivCategory enum (1000+ entries) may impact memory usage
  • Dynamic enum expansion could be slow for large ontologies
  • Consider lazy loading strategies for large value sets

2. Security

  • CLI input handling looks secure with proper validation
  • No obvious security concerns in the added code
  • File I/O operations use safe pathlib patterns

📋 Specific Recommendations

  1. Add Validation Tests: Consider adding tests that verify ontology CURIEs are resolvable
  2. CLI Documentation: Add usage examples and help text for the new CLI commands
  3. Performance Monitoring: Add metrics for dynamic enum expansion times
  4. Schema Versioning: Consider versioning strategy for backward compatibility as schemas evolve

🎯 Domain-Specific Assessment

Plant Experimental Conditions (PECO-based): ✅ Excellent coverage of plant research conditions with proper PECO mappings

Radiation Types (ENVO-based): ✅ Comprehensive electromagnetic spectrum coverage with accurate scientific terminology

Carcinogenicity Classifications: ✅ Authoritative use of IARC, EPA IRIS classifications with proper evidence levels

arXiv Categories: ✅ Complete taxonomy with appropriate NCIT/MeSH mappings where available

🏁 Overall Assessment

This is a high-quality PR that significantly enhances the value sets collection. The code follows LinkML best practices, uses authoritative ontology sources, and includes well-designed Python utilities. The modular approach and rich metadata make these value sets highly usable for scientific applications.

Recommendation: Approve with minor suggestions

The improvements to GOEvidenceCode and the PlantSexEnum ontology mapping fixes alone make this a valuable contribution. The new domain-specific value sets are well-researched and properly implemented.


@claude
Copy link
Contributor

claude bot commented Dec 15, 2025

Claude finished @cmungall's task —— View job


PR Review Complete ✅

I've thoroughly reviewed this substantial PR that adds multiple new value sets and improves existing enums. Here's my comprehensive analysis:

Review Checklist

  • Set up development environment and validate schema
  • Review LinkML schema structure and best practices
  • Check ontology mappings and CURIE usage
  • Review new Python code (CLI, utilities, generators)
  • Analyze domain-specific value sets for accuracy
  • Check for potential security and performance issues
  • Assess test coverage and documentation
  • Provide comprehensive feedback

Strengths

1. Excellent LinkML Schema Organization

  • Well-structured modular approach with domain-specific organization
  • Consistent use of instantiates: valuesets_meta:ValueSetEnumDefinition
  • Proper CURIE prefix declarations and namespace usage
  • Good separation of static vs. dynamic enums using reachable_from

2. High-Quality Ontology Mappings

  • Systematic use of established ontologies (PECO, ENVO, ECO, IARC, etc.)
  • Proper meaning: vs close_mappings: distinctions
  • The PlantSexEnum fix removing inappropriate UBERON/GSSO mappings is excellent

3. Rich Metadata and Documentation

  • Comprehensive descriptions and titles
  • GOEvidenceCode enhancements with structured annotations (is_experimental, is_manual, is_high_throughput) are particularly valuable
  • Good use of examples and evidence levels in carcinogenicity classifications

4. Robust Python Code

  • New CLI with proper typer usage and error handling at src/valuesets/cli.py:49
  • Well-designed classifier utilities with regex patterns at src/valuesets/utils/classifier.py:8-11
  • Good separation of concerns in the modular generator approach

🔍 Areas for Improvement

1. Schema Validation Concerns

  • Cannot verify full schema validation without running just site (requires approval)
  • Some dynamic enums may need validation of reachable_from ontology queries
  • Consider adding schema validation to CI pipeline

2. Test Coverage

  • No visible test additions for the new value sets
  • CLI and utility functions would benefit from unit tests
  • Consider adding integration tests for dynamic enum expansion

3. Minor Code Quality Issues

src/valuesets/cli.py:21-51: The load_enum_class function could benefit from more specific exception handling:

# Current: Generic ImportError handling
except ImportError as e:
    raise typer.BadParameter(f"Cannot import module '{module}': {e}")

# Suggest: More specific error types for different failure modes

src/valuesets/utils/classifier.py:14-75: The detect_classifier_fields function uses regex patterns effectively, but consider caching compiled patterns as module-level constants.

4. Documentation

  • Some new enums like ArxivCategory could benefit from more detailed descriptions
  • Consider adding usage examples for the new CLI tools

🚨 Potential Issues

1. Performance Considerations

  • Large ArxivCategory enum (1000+ entries) may impact memory usage
  • Dynamic enum expansion could be slow for large ontologies
  • Consider lazy loading strategies for large value sets

2. Security

  • CLI input handling looks secure with proper validation
  • No obvious security concerns in the added code
  • File I/O operations use safe pathlib patterns

📋 Specific Recommendations

  1. Add Validation Tests: Consider adding tests that verify ontology CURIEs are resolvable
  2. CLI Documentation: Add usage examples and help text for the new CLI commands
  3. Performance Monitoring: Add metrics for dynamic enum expansion times
  4. Schema Versioning: Consider versioning strategy for backward compatibility as schemas evolve

🎯 Domain-Specific Assessment

Plant Experimental Conditions (PECO-based): ✅ Excellent coverage of plant research conditions with proper PECO mappings

Radiation Types (ENVO-based): ✅ Comprehensive electromagnetic spectrum coverage with accurate scientific terminology

Carcinogenicity Classifications: ✅ Authoritative use of IARC, EPA IRIS classifications with proper evidence levels

arXiv Categories: ✅ Complete taxonomy with appropriate NCIT/MeSH mappings where available

🏁 Overall Assessment

This is a high-quality PR that significantly enhances the value sets collection. The code follows LinkML best practices, uses authoritative ontology sources, and includes well-designed Python utilities. The modular approach and rich metadata make these value sets highly usable for scientific applications.

Recommendation: Approve with minor suggestions

The improvements to GOEvidenceCode and the PlantSexEnum ontology mapping fixes alone make this a valuable contribution. The new domain-specific value sets are well-researched and properly implemented.


@dragon-ai-agent dragon-ai-agent merged commit 7f536b4 into main Dec 15, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants