-
Notifications
You must be signed in to change notification settings - Fork 0
Add ecoCroissant: Darwin Core integration with FAIR4AI for AI-ready biodiversity datasets #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: egrace479 <[email protected]>
Co-authored-by: egrace479 <[email protected]>
|
@copilot, please re-evaluate the proposed structure using the definition of FAIR4AI provided below. Additionally, this should be compatible with Darwin Core, not redefining those terms. The idea is to build on the existing standards for better incorporation into the Croissant structure. Distinguishing FAIR4AI and AI-Ready from FAIRFAIR4AI Working DefinitionsFAIR_4AI_ extends FAIR standards with AI specific requirements
Extended DefinitionsAI-ready Data: Data that are ready to be included in AI infrastructure at scale. AI-enabled Data: Data and/or metadata that has been augmented using AI, specifically for the purpose of being AI-ready. FAIR4AI Data: Data that are FAIR enough for AI to be used in making data fully “AI-Ready”. AI-Ready Working DistinctionSetting a "dividing line" to distinguish AI-ready from FAIR4AI.
These should not be taken to mean that the data are plug-and-play, but they may be. We provide some examples below to help clarify. Example distinctions
Additional Considerations
|
- Use Darwin Core terms directly instead of redefining as eco: prefixed terms - Add AI-specific properties for data distribution, preprocessing, splits - Add model provenance properties for AI-generated data - Add API/streaming properties for scalable data access - Add recordType property to distinguish occurrence vs image-based data - Update examples to demonstrate Darwin Core + AI-ready metadata - Update README to clarify Darwin Core integration and FAIR4AI requirements Co-authored-by: egrace479 <[email protected]>
I've restructured ecoCroissant to properly implement FAIR4AI requirements and use Darwin Core directly. Key changes in commit c679a0f: Darwin Core Integration:
FAIR4AI Requirements:
AI-Ready Properties:
The specification now clearly separates Darwin Core terms (used directly) from ecoCroissant AI-specific extensions. |
intended for reference (what we do and don't want), not use
|
@hlapp, made the changes we discussed. |
Implements a Croissant extension that integrates Darwin Core terms with FAIR4AI requirements to make biodiversity datasets AI-ready. Rather than redefining existing standards, ecoCroissant uses Darwin Core terms directly and adds AI-specific metadata for machine learning applications.
FAIR4AI Requirements Implemented
recordTypeproperty distinguishes occurrence-based vs image-based dataDarwin Core Integration (
docs/eco-spec.md)ecoCroissant uses Darwin Core terms directly:
dwc:scientificName,dwc:taxonRank,dwc:kingdom,dwc:family,dwc:genus,dwc:taxonID,dwc:higherClassification,dwc:vernacularNamedwc:locality,dwc:habitat,dwc:continent,dwc:country,dwc:decimalLatitude,dwc:decimalLongitude, elevation/depth termsdwc:eventDate,dwc:year,dwc:month,dwc:day,dwc:lifeStagedwc:basisOfRecord,dwc:identifiedBy,dwc:identificationVerificationStatus,dwc:samplingProtocolAI-Ready Properties (ecoCroissant Extensions)
eco:dataDistribution,eco:trainTestSplit,eco:stratificationVariable,eco:dataSplitRationaleeco:preprocessingSteps,eco:standardizationMethodeco:generatedBy,eco:modelConfidence,eco:humanVerified,eco:generationMethodeco:apiEndpoint,eco:streamingSupported,eco:rateLimitRequests,eco:rateLimitPeriodeco:recordType,eco:occurrenceToImageRatio,eco:imageAnnotationTypeeco:biome,eco:trophicLevel,eco:ecologicalRole,eco:speciesInteractionseco:iucnStatus,eco:populationTrend,eco:threats,eco:protectedAreaSchema (
schema/eco-context.jsonld)JSON-LD context defining the
http://imageomics.org/ecoCroissant/namespace with Darwin Core integration.Example
{ "@context": { "dwc": "http://rs.tdwg.org/dwc/terms/", "eco": "http://imageomics.org/ecoCroissant/", "cr": "http://mlcommons.org/croissant/" }, "dct:conformsTo": [ "http://mlcommons.org/croissant/1.0", "http://imageomics.org/ecoCroissant/1.0" ], "dwc:scientificName": "Heliconius Kluk, 1780", "dwc:taxonRank": "genus", "dwc:habitat": ["tropical rainforest"], "dwc:basisOfRecord": "PreservedSpecimen", "eco:recordType": "image-based", "eco:dataDistribution": "long-tailed: 50 species, 10-1000 images each", "eco:trainTestSplit": "80/10/10 stratified by species", "eco:preprocessingSteps": ["resized to 224x224", "ImageNet normalization"] }Includes AI-ready TreeOfLife-200M example in
examples/treeoflife-200m.jsondemonstrating Darwin Core integration with preprocessing, data distribution, and streaming metadata.Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.