Skip to content

Conversation

Copy link

Copilot AI commented Dec 4, 2025

Implements a Croissant extension that integrates Darwin Core terms with FAIR4AI requirements to make biodiversity datasets AI-ready. Rather than redefining existing standards, ecoCroissant uses Darwin Core terms directly and adds AI-specific metadata for machine learning applications.

FAIR4AI Requirements Implemented

  1. Queryable Metadata: Metadata stored in queryable formats (Parquet) without requiring large file downloads
  2. Ontology Integration: Darwin Core terms queryable with synonyms from GBIF, NCBI, EOL
  3. Content/Context Extraction: recordType property distinguishes occurrence-based vs image-based data

Darwin Core Integration (docs/eco-spec.md)

ecoCroissant uses Darwin Core terms directly:

  • Taxonomic: dwc:scientificName, dwc:taxonRank, dwc:kingdom, dwc:family, dwc:genus, dwc:taxonID, dwc:higherClassification, dwc:vernacularName
  • Geographic: dwc:locality, dwc:habitat, dwc:continent, dwc:country, dwc:decimalLatitude, dwc:decimalLongitude, elevation/depth terms
  • Temporal: dwc:eventDate, dwc:year, dwc:month, dwc:day, dwc:lifeStage
  • Data Quality: dwc:basisOfRecord, dwc:identifiedBy, dwc:identificationVerificationStatus, dwc:samplingProtocol

AI-Ready Properties (ecoCroissant Extensions)

  • Data Distribution: eco:dataDistribution, eco:trainTestSplit, eco:stratificationVariable, eco:dataSplitRationale
  • Preprocessing: eco:preprocessingSteps, eco:standardizationMethod
  • Model Provenance: eco:generatedBy, eco:modelConfidence, eco:humanVerified, eco:generationMethod
  • API/Streaming: eco:apiEndpoint, eco:streamingSupported, eco:rateLimitRequests, eco:rateLimitPeriod
  • Record Context: eco:recordType, eco:occurrenceToImageRatio, eco:imageAnnotationType
  • Ecological: eco:biome, eco:trophicLevel, eco:ecologicalRole, eco:speciesInteractions
  • Conservation: eco:iucnStatus, eco:populationTrend, eco:threats, eco:protectedArea

Schema (schema/eco-context.jsonld)

JSON-LD context defining the http://imageomics.org/ecoCroissant/ namespace with Darwin Core integration.

Example

{
  "@context": {
    "dwc": "http://rs.tdwg.org/dwc/terms/",
    "eco": "http://imageomics.org/ecoCroissant/",
    "cr": "http://mlcommons.org/croissant/"
  },
  "dct:conformsTo": [
    "http://mlcommons.org/croissant/1.0",
    "http://imageomics.org/ecoCroissant/1.0"
  ],
  "dwc:scientificName": "Heliconius Kluk, 1780",
  "dwc:taxonRank": "genus",
  "dwc:habitat": ["tropical rainforest"],
  "dwc:basisOfRecord": "PreservedSpecimen",
  
  "eco:recordType": "image-based",
  "eco:dataDistribution": "long-tailed: 50 species, 10-1000 images each",
  "eco:trainTestSplit": "80/10/10 stratified by species",
  "eco:preprocessingSteps": ["resized to 224x224", "ImageNet normalization"]
}

Includes AI-ready TreeOfLife-200M example in examples/treeoflife-200m.json demonstrating Darwin Core integration with preprocessing, data distribution, and streaming metadata.

Original prompt

This repository is for developing a croissant-based biodiversity metadata schema following FAIR4AI principles. The idea is to allow for capture of ecologically-relevant information from datasets beyond what is included in the croissant format (ex: huggingface.co/datasets/imageomics/TreeOfLife-200M, in croissant: huggingface.co/api/datasets/imageomics/TreeOfLife-200M/croissant). See also https://github.com/mlcommons/croissant.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Add biodiversity metadata schema under FAIR4AI principles Add ecoCroissant biodiversity metadata schema extension Dec 4, 2025
Copilot AI requested a review from egrace479 December 4, 2025 00:45
@egrace479
Copy link
Member

@copilot, please re-evaluate the proposed structure using the definition of FAIR4AI provided below. Additionally, this should be compatible with Darwin Core, not redefining those terms. The idea is to build on the existing standards for better incorporation into the Croissant structure.

Distinguishing FAIR4AI and AI-Ready from FAIR

FAIR4AI Working Definitions

FAIR_4AI_ extends FAIR standards with AI specific requirements

  1. Data/metadata can be queried without downloading large files or specialized file types.
  2. Ontology used can be queried with synonyms from other ontologies.
  3. Content/context of the data point is extractable: occurrence-based vs image-based.

Extended Definitions

AI-ready Data: Data that are ready to be included in AI infrastructure at scale.

AI-enabled Data: Data and/or metadata that has been augmented using AI, specifically for the purpose of being AI-ready.

FAIR4AI Data: Data that are FAIR enough for AI to be used in making data fully “AI-Ready”.

AI-Ready Working Distinction

Setting a "dividing line" to distinguish AI-ready from FAIR4AI.

  1. We use the term AI-ready to indicate that data are such that they can be included in AI infrastructure at scale.

  2. Appropriately formatted documentation regarding distribution information and other values required for plugging the data into a standard model pre-processing pipeline is provided.

These should not be taken to mean that the data are plug-and-play, but they may be. We provide some examples below to help clarify.

Example distinctions

  • Processing and standardization are not required, but information about whether or not the data have been processed or standardized in some manner and how the processing or standardization was done is required.

  • Data distribution should be included in usable form (e.g., long tail on this value, simplistic idea: stratify on y = long_tail_value). Alternatively, splits may already be provided.

Additional Considerations

  • Inclusion of model provenance for generated data.
  • Rate limiting information from data providers for server profiling (similar to what we did with distributed-downloader). If data can be streamed from the source, this is essential to success.

- Use Darwin Core terms directly instead of redefining as eco: prefixed terms
- Add AI-specific properties for data distribution, preprocessing, splits
- Add model provenance properties for AI-generated data
- Add API/streaming properties for scalable data access
- Add recordType property to distinguish occurrence vs image-based data
- Update examples to demonstrate Darwin Core + AI-ready metadata
- Update README to clarify Darwin Core integration and FAIR4AI requirements

Co-authored-by: egrace479 <[email protected]>
Copy link
Author

Copilot AI commented Dec 17, 2025

@copilot, please re-evaluate the proposed structure using the definition of FAIR4AI provided below. Additionally, this should be compatible with Darwin Core, not redefining those terms. The idea is to...

I've restructured ecoCroissant to properly implement FAIR4AI requirements and use Darwin Core directly. Key changes in commit c679a0f:

Darwin Core Integration:

  • Now using Darwin Core terms directly (e.g., dwc:scientificName, dwc:taxonRank, dwc:habitat) instead of redefining them
  • All biodiversity metadata uses standard Darwin Core vocabulary
  • Maintains queryability with synonyms from GBIF, NCBI, and EOL

FAIR4AI Requirements:

  1. Queryable without downloading: Metadata stored in Parquet format for querying without image downloads
  2. Ontology synonyms: Darwin Core terms link to multiple taxonomic databases
  3. Content/context extraction: Added eco:recordType property to distinguish "occurrence-based", "image-based", or "mixed" datasets

AI-Ready Properties:

  • eco:dataDistribution: Class distribution and stratification details
  • eco:preprocessingSteps: Documentation of standardization and augmentation
  • eco:trainTestSplit: Split rationale and proportions
  • eco:stratificationVariable: Variables used for stratification
  • eco:generatedBy: Model provenance for AI-generated annotations
  • eco:apiEndpoint, eco:streamingSupported: Streaming and rate limit info
  • eco:recordType, eco:imageAnnotationType: Context extraction for AI pipelines

The specification now clearly separates Darwin Core terms (used directly) from ecoCroissant AI-specific extensions.

Copilot AI changed the title Add ecoCroissant biodiversity metadata schema extension Add ecoCroissant: Darwin Core integration with FAIR4AI for AI-ready biodiversity datasets Dec 17, 2025
@egrace479 egrace479 requested a review from hlapp December 17, 2025 22:12
@egrace479
Copy link
Member

@hlapp, made the changes we discussed.

@egrace479 egrace479 marked this pull request as ready for review December 17, 2025 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants