Skip to content

Commit 37659db

Browse files
authored
Merge pull request #575 from VariantEffect/release-2025.5.0
Release 2025.5.0
2 parents 6419b01 + 4f67293 commit 37659db

File tree

135 files changed

+16581
-4631
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

135 files changed

+16581
-4631
lines changed

.github/instructions/ai-prompt-engineering-safety-best-practices.instructions.md

Lines changed: 867 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# MaveDB API Copilot Instructions
2+
3+
## Core Directives & Control Principles
4+
5+
### Hierarchy of Operations
6+
**These rules have the highest priority and must not be violated:**
7+
8+
1. **Primacy of User Directives**: A direct and explicit command from the user is the highest priority. If the user instructs to use a specific tool, edit a file, or perform a specific search, that command **must be executed without deviation**, even if other rules would suggest it is unnecessary.
9+
10+
2. **Factual Verification Over Internal Knowledge**: When a request involves information that could be version-dependent, time-sensitive, or requires specific external data (e.g., bioinformatics library documentation, latest genomics standards, API details), prioritize using tools to find the current, factual answer over relying on general knowledge.
11+
12+
3. **Adherence to MaveDB Philosophy**: In the absence of a direct user directive or the need for factual verification, all other rules regarding interaction, code generation, and modification must be followed within the context of bioinformatics and software development best practices.
13+
14+
### Interaction Philosophy for Bioinformatics
15+
- **Code on Request Only**: Default response should be clear, natural language explanation. Do NOT provide code blocks unless explicitly asked, or if a small example is essential to illustrate a bioinformatics concept.
16+
- **Direct and Concise**: Answers must be precise and free from unnecessary filler. Get straight to the solution for genomic data processing challenges.
17+
- **Bioinformatics Best Practices**: All suggestions must align with established bioinformatics standards (HGVS, VRS, GA4GH) and proven genomics research practices.
18+
- **Explain the Scientific "Why"**: Don't just provide code; explain the biological reasoning. Why is this approach standard in genomics? What scientific problem does this pattern solve?
19+
20+
## Related Instructions
21+
22+
**Domain-Specific Guidance**: This file provides MaveDB-specific development guidance. For specialized topics, reference these additional instruction files:
23+
24+
- **AI Safety & Ethics**: See `.github/instructions/ai-prompt-engineering-safety-best-practices.instructions.md` for comprehensive AI safety protocols, bias mitigation, responsible AI usage, and security frameworks
25+
- **Python Standards**: Follow `.github/instructions/python.instructions.md` for Python-specific coding conventions, PEP 8 compliance, type hints, docstring requirements, and testing practices
26+
- **Documentation Standards**: Reference `.github/instructions/markdown.instructions.md` for documentation formatting, content creation guidelines, and validation requirements
27+
- **Prompt Engineering**: Use `.github/instructions/prompt.instructions.md` for creating effective prompts and AI interaction optimization
28+
- **Instruction File Management**: See `.github/instructions/instructions.instructions.md` for guidelines on creating and maintaining instruction files
29+
30+
**Integration Principle**: These specialized files provide expert-level guidance in their respective domains. Apply their principles alongside the MaveDB-specific patterns documented here. When conflicts arise, prioritize the specialized file's guidance within its domain scope.
31+
32+
**Hierarchy for Conflicts**:
33+
1. **User directives** (highest priority)
34+
2. **MaveDB-specific bioinformatics patterns** (this file)
35+
3. **Domain-specific specialized files** (python.instructions.md, etc.)
36+
4. **General best practices** (lowest priority)
37+
38+
## Architecture Overview
39+
40+
MaveDB API is a bioinformatics database API for Multiplex Assays of Variant Effect (MAVE) datasets. The architecture follows these key patterns:
41+
42+
### Core Domain Model
43+
- **Hierarchical URN system**: ExperimentSet (`urn:mavedb:00000001`) → Experiment (`00000001-a`) → ScoreSet (`00000001-a-1`) → Variant (`00000001-a-1` + # + variant number)
44+
- **Temporary URNs** during development: `tmp:uuid` format, converted to permanent URNs on publication
45+
- **Resource lifecycle**: Draft → Published (with background worker processing)
46+
47+
### Service Architecture
48+
- **FastAPI application** (`src/mavedb/server_main.py`) with router-based endpoint organization
49+
- **Background worker** using ARQ/Redis for async processing (mapping, publication, annotation)
50+
- **Multi-container setup**: API server, worker, PostgreSQL, Redis, external services (cdot-rest, dcd-mapping, seqrepo)
51+
- **External bioinformatics services**: HGVS data providers, SeqRepo for sequence data, VRS mapping for variant representation
52+
53+
## Development Patterns
54+
55+
### Database & Models
56+
- **SQLAlchemy 2.0** with declarative models in `src/mavedb/models/`
57+
- **Alembic migrations** with manual migrations in `alembic/manual_migrations/`
58+
- **Association tables** for many-to-many relationships (contributors, publications, keywords)
59+
- **Enum classes** for controlled vocabularies (UserRole, ProcessingState, MappingState)
60+
61+
### Key Dependencies & Injections
62+
```python
63+
# Database session
64+
def get_db() -> Generator[Session, Any, None]
65+
66+
# Worker queue
67+
async def get_worker() -> AsyncGenerator[ArqRedis, Any]
68+
69+
# External data providers
70+
def hgvs_data_provider() -> RESTDataProvider
71+
def get_seqrepo() -> SeqRepo
72+
```
73+
74+
### Authentication & Authorization
75+
- **ORCID JWT tokens** and **API keys** for authentication
76+
- **Role-based permissions** with `Action` enum and `assert_permission()` helper
77+
- **User data context** available via `UserData` dataclass
78+
79+
### Router Patterns
80+
- Endpoints organized by resource type in `src/mavedb/routers/`
81+
- **Dependency injection** for auth, DB sessions, and external services
82+
- **Structured exception handling** with custom exception types
83+
- **Background job enqueueing** for publish/update operations
84+
85+
## Development Commands
86+
87+
### Environment Setup
88+
```bash
89+
# Local development with Docker
90+
docker-compose -f docker-compose-dev.yml up --build -d
91+
92+
# Direct Python execution (requires env vars)
93+
export PYTHONPATH="${PYTHONPATH}:`pwd`/src"
94+
uvicorn mavedb.server_main:app --reload
95+
```
96+
97+
### Testing
98+
```bash
99+
# Core dependencies only
100+
poetry install --no-dev
101+
poetry run pytest tests/
102+
103+
# Full test suite with optional dependencies
104+
poetry install --with dev --extras server
105+
poetry run pytest tests/ --cov=src
106+
```
107+
108+
### Database Management
109+
```bash
110+
# Run migrations
111+
alembic upgrade head
112+
113+
# Create new migration
114+
alembic revision --autogenerate -m "Description"
115+
116+
# Manual migration (for complex data changes)
117+
# Place in alembic/manual_migrations/ and reference in version file
118+
```
119+
120+
## Project Conventions
121+
122+
### Naming Conventions
123+
- **Variables & functions**: `snake_case` (e.g., `score_set_id`, `create_variants_for_score_set`)
124+
- **Classes**: `PascalCase` (e.g., `ScoreSet`, `UserData`, `ProcessingState`)
125+
- **Constants**: `UPPER_SNAKE_CASE` (e.g., `MAPPING_QUEUE_NAME`, `DEFAULT_LDH_SUBMISSION_BATCH_SIZE`)
126+
- **Enum values**: `snake_case` (e.g., `ProcessingState.success`, `MappingState.incomplete`)
127+
- **Database tables**: `snake_case` with descriptive association table names (e.g., `scoreset_contributors`, `experiment_set_doi_identifiers`)
128+
- **API endpoints**: kebab-case paths (e.g., `/score-sets`, `/experiment-sets`)
129+
130+
### Documentation Conventions
131+
*For general Python documentation standards, see `.github/instructions/python.instructions.md`. The following are MaveDB-specific additions:*
132+
133+
- **Algorithm explanations**: Include comments explaining complex logic, especially URN generation and bioinformatics operations
134+
- **Design decisions**: Comment on why certain architectural choices were made
135+
- **External dependencies**: Explain purpose of external bioinformatics libraries (HGVS, SeqRepo, etc.)
136+
- **Bioinformatics context**: Document biological reasoning behind genomic data processing patterns
137+
138+
### Commenting Guidelines
139+
**Core Principle: Write self-explanatory code. Comment only to explain WHY, not WHAT.**
140+
141+
**WRITE Comments For:**
142+
- **Complex bioinformatics algorithms**: Variant mapping algorithms, external service interactions
143+
- **Business logic**: Why specific validation rules exist, regulatory requirements
144+
- **External API constraints**: Rate limits, data format requirements
145+
- **Non-obvious calculations**: Score normalization, statistical methods
146+
- **Configuration values**: Why specific timeouts, batch sizes, or thresholds were chosen
147+
148+
**AVOID Comments For:**
149+
- **Obvious operations**: Variable assignments, simple loops, basic conditionals
150+
- **Redundant descriptions**: Comments that repeat what the code clearly shows
151+
- **Outdated information**: Comments that don't match current implementation
152+
153+
### Error Handling Conventions
154+
- **Structured logging**: Always use `logger` with `extra=logging_context()` for correlation IDs
155+
- **HTTP exceptions**: Use FastAPI `HTTPException` with appropriate status codes and descriptive messages
156+
- **Custom exceptions**: Define domain-specific exceptions in `src/mavedb/lib/exceptions.py`
157+
- **Worker job errors**: Send Slack notifications via `send_slack_error()` and log with full context
158+
- **Validation errors**: Use Pydantic validators and raise `ValueError` with clear messages
159+
160+
### Code Style and Organization Conventions
161+
*For general Python style conventions, see `.github/instructions/python.instructions.md`. The following are MaveDB-specific patterns:*
162+
163+
- **Async patterns**: Use `async def` for I/O operations, regular functions for CPU-bound work
164+
- **Database operations**: Use SQLAlchemy 2.0 style with `session.scalars(select(...)).one()`
165+
- **Pydantic models**: Separate request/response models with clear inheritance hierarchies
166+
- **Bioinformatics data flow**: Structure code to clearly show genomic data transformations
167+
168+
### Testing Conventions
169+
*For general Python testing standards, see `.github/instructions/python.instructions.md`. The following are MaveDB-specific patterns:*
170+
171+
- **Test function naming**: Use descriptive names that reflect bioinformatics operations (e.g., `test_cannot_publish_score_set_without_variants`)
172+
- **Fixtures**: Use `conftest.py` for shared fixtures, especially database and worker setup
173+
- **Mocking**: Use `unittest.mock.patch` for external bioinformatics services and worker jobs
174+
- **Constants**: Define test data including genomic sequences and variants in `tests/helpers/constants.py`
175+
- **Integration testing**: Test full bioinformatics workflows including external service interactions
176+
177+
## Codebase Conventions
178+
179+
### URN Validation
180+
- Use regex patterns from `src/mavedb/lib/validation/urn_re.py`
181+
- Validate URNs in Pydantic models with `@field_validator`
182+
- URN generation logic in `src/mavedb/lib/urns.py` and `temp_urns.py`
183+
184+
### Worker Jobs (ARQ/Redis)
185+
- **Job definitions**: All background jobs in `src/mavedb/worker/jobs.py`
186+
- **Settings**: Worker configuration in `src/mavedb/worker/settings.py` with function registry and cron jobs
187+
- **Job patterns**:
188+
- Use `setup_job_state()` for logging context with correlation IDs
189+
- Implement exponential backoff with `enqueue_job_with_backoff()`
190+
- Handle database sessions within job context
191+
- Send Slack notifications on failures via `send_slack_error()`
192+
- **Key job types**:
193+
- `create_variants_for_score_set` - Process uploaded CSV data
194+
- `map_variants_for_score_set` - External variant mapping via VRS
195+
- `submit_score_set_mappings_to_*` - Submit to external annotation services
196+
- **Enqueueing**: Use `ArqRedis.enqueue_job()` from routers with correlation ID for request tracing
197+
198+
### View Models (Pydantic)
199+
- **Base model** (`src/mavedb/view_models/base/base.py`) converts empty strings to None and uses camelCase aliases
200+
- **Inheritance patterns**: `Base``Create``Modify``Saved` model hierarchy
201+
- **Field validation**: Use `@field_validator` for single fields, `@model_validator(mode="after")` for cross-field validation
202+
- **URN validation**: Validate URNs with regex patterns from `urn_re.py` in field validators
203+
- **Transform functions**: Use functions in `validation/transform.py` for complex data transformations
204+
- **Separate models**: Request (`Create`, `Modify`) vs response (`Saved`) models with different field requirements
205+
206+
### External Integrations
207+
- **HGVS/SeqRepo** for genomic sequence operations
208+
- **DCD Mapping** for variant mapping and VRS transformation
209+
- **CDOT** for transcript/genomic coordinate conversion
210+
- **GA4GH VRS** for variant representation standardization
211+
- **ClinGen services** for allele registry and linked data hub submissions
212+
213+
## Key Files to Reference
214+
- `src/mavedb/models/score_set.py` - Primary data model patterns
215+
- `src/mavedb/routers/score_sets.py` - Complex router with worker integration
216+
- `src/mavedb/worker/jobs.py` - Background processing patterns
217+
- `src/mavedb/view_models/score_set.py` - Pydantic model hierarchy examples
218+
- `src/mavedb/server_main.py` - Application setup and dependency injection
219+
- `src/mavedb/data_providers/services.py` - External service integration patterns
220+
- `src/mavedb/lib/authentication.py` - Authentication and authorization patterns
221+
- `tests/conftest.py` - Test fixtures and database setup
222+
- `docker-compose-dev.yml` - Service architecture and dependencies

0 commit comments

Comments
 (0)