Detailed step-by-step instructions for ontology curation workflow
This guide walks you through the complete curation workflow, from getting your assignment to submitting a pull request. Follow these steps carefully, especially if you're new to ontology curation or Git workflows.
- One-Time Setup
- Getting Your Assignment
- Understanding Your Assignment
- Creating Definitions with LLMs
- Finding Definition Sources
- Validating Your Work
- Git Workflow
- Syncing Your Fork
- Troubleshooting
- Best Practices
These steps only need to be done once.
# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install just (command runner)
uv tool install just
# Verify installations
uv --version
just --version
git --version # Should already be installedFor Interns (using forks):
# 1. Fork the repo on GitHub (click "Fork" button)
# 2. Clone your fork
git clone https://github.com/YOUR-USERNAME/metpo-kgm-studio.git
cd metpo-kgm-studio
# 3. Add upstream remote
git remote add upstream https://github.com/berkeleybop/metpo-kgm-studio.git
git remote -v # VerifyFor Team Members (direct access):
git clone https://github.com/berkeleybop/metpo-kgm-studio.git
cd metpo-kgm-studiojust setupThis will:
- Install Python dependencies
- Create necessary directories
- Verify everything is working
The METPO class definitions are stored in a Google Sheet.
To split this into individual curator assignments:
just fetch-assignmentsThis will:
- Download the current ROBOT template from Google Sheets
- Split it into 3 overlapping assignments
- Save files as
assignments/curator1.tsv,assignments/curator2.tsv, etc. - Include 30% overlap for inter-curator agreement assessment
just assignment-statsOutput example:
Assignment Statistics:
=====================
curator1.tsv: 45 classes
curator2.tsv: 47 classes
curator3.tsv: 46 classes
Determine which curator you are (curator1, curator2, or curator3) and open your file:
cat assignments/curator1.tsv | head -20 # Preview first 20 linesYour assignment file is a TSV (tab-separated values) file following ROBOT template conventions.
Structure:
ID LABEL A definition A definition source
>ID RDFS:label IAO:0000115 ...
A SPLIT=| SPLIT=| ...
METPO:0000123 methanogenesis [TO BE FILLED] [TO BE FILLED]
METPO:0000124 anaerobic respiration [TO BE FILLED] [TO BE FILLED]
...
Key columns:
- Row 1 (header): Human-readable column names
- Row 2 (ROBOT markers): Starts with
>, defines RDF properties - Row 3 (ROBOT config): Starts with
A, configuration options - Row 4+: Actual class data
Your task: Fill in the A definition and A definition source columns for each class.
Classes may be:
- Missing definitions: Blank
A definitioncolumn - Missing sources: Definition exists but no
A definition source - Poor quality definitions: Don't follow OBO Foundry principles
Focus on these in order of priority (ask your supervisor for specific priorities).
Pick a class from your assignment file. Example:
- Class ID: METPO:0000123
- Class Label: methanogenesis
- Current Definition: [blank or needs improvement]
Before using an LLM:
- Research the term: Read about it in Wikipedia, textbooks, or papers
- Understand the biology: What is this process/trait/environment?
- Identify parent classes: Where does this fit in the ontology hierarchy?
Never blindly trust an LLM without understanding the domain!
-
Open
prompts/templates/definition-generation.md -
Copy the template
-
Fill in the template variables:
{CLASS_ID}: METPO:0000123{CLASS_LABEL}: methanogenesis{PARENT_CLASSES}: anaerobic respiration (METPO:0000120){EXISTING_DEFINITION}: [current definition or "None"]
-
Save your customized prompt to
prompts/executed/:# Naming format: YYYY-MM-DD_CLASSID_description.md cp prompts/templates/definition-generation.md \ prompts/executed/2025-10-02_METPO_0000123_definition-generation.md # Edit to fill in variables
Copy your customized prompt and paste it into:
- Claude (recommended): https://claude.ai
- ChatGPT: https://chat.openai.com
- CBORG (if available): Ask your supervisor for access
Copy the LLM's response and save it to outputs/raw/[your-name]/:
# Create JSON file with output
cat > outputs/raw/curator1/2025-10-02_METPO_0000123_output.json << 'EOF'
{
"class_id": "METPO:0000123",
"class_label": "methanogenesis",
"proposed_definition": "An anaerobic respiration process in which methane is produced as the primary metabolic end product, typically using carbon dioxide or acetic acid as terminal electron acceptors.",
"confidence": "high",
"reasoning": "...",
"suggested_sources": ["PMID:15073711", "PMID:23645609"]
}
EOFAsk yourself:
- ✅ Does this definition make biological sense?
- ✅ Does it follow genus-differentia form ("An [parent] that [characteristics]")?
- ✅ Is it clear and unambiguous?
- ✅ Does it avoid circularity?
- ✅ Are the suggested sources real and relevant?
If NO to any question: Revise the prompt or definition!
Good definitions need authoritative sources. Use the definition-source-finding.md template if the LLM didn't suggest good sources.
-
PubMed IDs:
PMID:12345678- Search: https://pubmed.ncbi.nlm.nih.gov/
- Verify: Open the paper and check it supports your definition
-
DOIs:
DOI:10.1234/example- Search: https://doi.org/ or journal websites
- Verify: Read abstract/methods
-
ISBNs:
ISBN:1234567890(for textbooks)- Example: Madigan's Brock Biology of Microorganisms
-
URLs:
https://example.com(only for highly authoritative sources)- Include access date in notes
✅ DO USE:
- Peer-reviewed journal articles
- Review articles (preferred over primary research)
- Standard microbiology textbooks
- Authoritative databases (e.g., KEGG, MetaCyc)
❌ DON'T USE:
- Wikipedia
- Non-peer-reviewed sources
- Preprints (unless exceptional)
- Blog posts or forums
CRITICAL: Always verify that PMIDs and DOIs exist and support your definition!
# Example: Verify PMID
open "https://pubmed.ncbi.nlm.nih.gov/15073711"
# Example: Verify DOI
open "https://doi.org/10.1038/nrmicro2386"Run validators to check your definitions against OBO Foundry principles:
# Validate your assignment file
just validate-file assignments/curator1.tsvOutput example:
✓ Definition is present
✓ Definition length appropriate (127 chars)
✗ [WARNING]: Definition may not follow genus-differentia form
✓ No obvious circular definition detected
✓ Class ID format valid: METPO:0000123
✓ 2 valid source(s) found
| Error | Meaning | How to Fix |
|---|---|---|
| "Definition too short" | < 20 characters | Add more detail about distinguishing characteristics |
| "Definition too long" | > 500 characters | Make more concise; split into multiple sentences if needed |
| "Does not follow genus-differentia form" | Missing "An [X] that..." pattern | Rewrite as "An [parent class] that [characteristics]" |
| "Definition may be circular" | Uses term in its own definition | Remove term name from definition text |
| "Invalid source format" | Wrong PMID/DOI format | Check format: PMID:12345678 or DOI:10.1234/example |
| "Label violates FP-012" | Incorrect capitalization | Change to lowercase unless proper noun/acronym |
Beyond automated checks, ask yourself:
- Would a microbiologist understand this definition?
- Is it scientifically accurate?
- Does it clearly distinguish this class from siblings?
- Are the sources authoritative and accessible?
Follow the issue → branch → commits → PR workflow.
Naming convention: curator[number]-batch[number] or curator[number]-CLASSID
# Create and switch to new branch
just new-branch curator1 1
# Or manually:
git checkout -b curator1-batch1Edit your assignment file (assignments/curator1.tsv) to add:
- Definitions in the
A definitioncolumn - Sources in the
A definition sourcecolumn (use|to separate multiple sources)
Example row after curation:
METPO:0000123 methanogenesis An anaerobic respiration process in which methane is produced as the primary metabolic end product, typically using carbon dioxide or acetic acid as terminal electron acceptors. PMID:15073711|PMID:23645609
just validate-file assignments/curator1.tsvOnly commit if validation passes or you understand and accept the warnings!
# Add all changes
git add .
# Or selectively:
git add assignments/curator1.tsv
git add prompts/executed/2025-10-02_METPO_0000123_*.md
git add outputs/raw/curator1/Good commit messages:
git commit -m "Add definitions for METPO:0000123-0000130 (methanogenesis pathway)"
git commit -m "Fix definition sources for METPO:0000145"
git commit -m "Improve genus-differentia form for respiratory processes"Bad commit messages:
git commit -m "updates"
git commit -m "stuff"
git commit -m "fixed things"# For forks:
git push origin curator1-batch1
# For direct access:
git push origin curator1-batch1- Go to GitHub: https://github.com/berkeleybop/metpo-kgm-studio
- Click "Pull requests" → "New pull request"
- Select your branch
- Write a clear PR description:
## Summary
Added definitions for 10 methanogenesis-related classes (METPO:0000123-0000132)
## Changes
- Added genus-differentia definitions following FP-006
- Included PubMed sources for all definitions
- Validated with `just validate-file`
## Notes
- Some warnings about definition length, but necessary for accuracy
- Used Thauer et al. 2008 review as primary source (PMID:15073711)
## Checklist
- [x] Definitions follow genus-differentia form
- [x] All sources verified
- [x] Validation passes
- [x] Prompt templates saved to prompts/executed/- Request review from Montana, Mark, or supervisor
- Address any feedback
- Celebrate when merged! 🎉
For interns using forks: Regularly sync with the main repository to get updates.
# Fetch upstream changes
git fetch upstream
# Switch to main branch
git checkout main
# Merge upstream changes
git merge upstream/main
# Or use the helper command:
just sync-upstream
# Push to your fork
git push origin mainWhen to sync:
- Before starting new work
- When main repository has updates
- Weekly (to avoid large divergence)
Problem: You don't have write access to the repository.
Solution (for interns):
- Make sure you forked the repository
- Check your remote URLs:
git remote -v originshould point to YOUR fork, not the main repo- If wrong:
git remote set-url origin https://github.com/YOUR-USERNAME/metpo-kgm-studio.git
Problem: Pre-commit hooks failed.
Solution:
- Read the error messages carefully
- Fix the issues (usually formatting, spelling, or linting)
- Run
just formatto auto-fix some issues - Try committing again
Problem: Your changes conflict with upstream changes.
Solution:
- Don't panic! This is normal.
- Git will mark conflicts in files with
<<<<<<<,=======,>>>>>>> - Edit files to resolve conflicts
git addthe resolved filesgit committo complete the merge- Ask for help if stuck!
Problem: Dependencies not installed.
Solution:
just install
# Or:
uv sync --group devProblem: PMID or DOI doesn't exist or doesn't support the definition.
Solution:
- Search PubMed/DOI yourself for the topic
- Find real sources that support the definition
- Update the LLM prompt to be more specific about verifiable sources
- Never use fake sources
- ✅ Understand the biology before using LLMs
- ✅ Verify all facts against authoritative sources
- ✅ Ask domain experts (Montana, Mark, Chris) if unsure
- ✅ Prefer review articles over primary research for sources
- ✅ Start with approved prompt templates
- ✅ Save all prompts and outputs to git
- ✅ Critically review all LLM outputs
- ✅ Iterate on prompts if quality is poor
- ❌ Never trust LLM outputs blindly
- ❌ Never use LLM-suggested sources without verifying they exist
- ✅ Commit often with clear messages
- ✅ One logical change per commit
- ✅ Validate before committing
- ✅ Keep branches focused and short-lived
- ✅ Sync regularly with upstream
- ❌ Don't commit broken code
- ❌ Don't commit huge batches without validation
- ✅ Ask questions early and often
- ✅ Review others' PRs when asked
- ✅ Give constructive feedback
- ✅ Document your decisions (commit messages, PR descriptions)
- ✅ Communicate blockers or challenges
- ❌ Don't struggle silently
- ❌ Don't submit PRs without review
- ✅ Start with easier classes to build confidence
- ✅ Break work into small batches (5-10 classes)
- ✅ Take breaks between batches
- ✅ Track your progress:
just progress curator1 - ✅ Set realistic goals (quality over quantity)
Stuck? Confused? Not sure if something is right?
- Ontology/Biology Questions: Montana or Mark
- Python/Technical Questions: Sujay
- LLM/Prompting Questions: Chris
- Git/GitHub Questions: Anyone on the team
- General Workflow: This guide or README.md
How to ask for help:
- Describe what you're trying to do
- Describe what's happening (include error messages)
- Show what you've tried
- Ask a specific question
Good example:
"I'm trying to create a definition for METPO:0000145 (sulfate reduction). The LLM suggested 'A process that reduces sulfate' but validation says it doesn't follow genus-differentia form. Should I include the parent class 'anaerobic respiration' in the definition?"
Less helpful:
"The validator doesn't like my definition. Help?"
-
OBO Foundry Principles:
-
Chris Mungall's Best Practices: https://berkeleybop.org/best_practice/
-
ROBOT Documentation: http://robot.obolibrary.org/
-
Git Tutorials:
-
Python Type Hints: https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html
Questions? Suggestions for improving this guide?
Open an issue or ask in your next team meeting!
Last Updated: 2025-10-02