Skip to content

Commit e0fdc5d

Browse files
committed
docs(rna): update README with dual-pipeline strategy and ENA documentation
1 parent 1ce49cb commit e0fdc5d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+2143
-89
lines changed

.cursorrules

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,11 @@
3030
- **Networks**: `output/networks/<network_type>/` (e.g., `ppi/`, `regulatory/`, `pathways/`)
3131
- **ML**: `output/ml/<task>/` (e.g., `classification/`, `regression/`, `features/`)
3232
- **Multi-Omics**: `output/multiomics/<integration>/` (e.g., `integrated/`, `plots/`)
33+
- **Long Read**: `output/longread/<analysis_type>/` (e.g., `basecalling/`, `assembly/`, `methylation/`)
34+
- **Metagenomics**: `output/metagenomics/<analysis_type>/` (e.g., `amplicon/`, `assembly/`, `functional/`)
35+
- **Structural Variants**: `output/structural_variants/<analysis_type>/` (e.g., `detection/`, `annotation/`)
36+
- **Spatial**: `output/spatial/<analysis_type>/` (e.g., `clustering/`, `deconvolution/`, `integration/`)
37+
- **Pharmacogenomics**: `output/pharmacogenomics/<analysis_type>/` (e.g., `alleles/`, `clinical/`, `reports/`)
3338

3439
## Path and I/O
3540

@@ -167,6 +172,11 @@ with io.open_text_auto("data/large_file.txt.gz") as f:
167172
- **Networks Module**: Use prefix `NET_` (e.g., `NET_THREADS`, `NET_WORK_DIR`)
168173
- **ML Module**: Use prefix `ML_` (e.g., `ML_THREADS`, `ML_WORK_DIR`, `ML_MODEL_DIR`)
169174
- **Multi-Omics Module**: Use prefix `MULTI_` (e.g., `MULTI_THREADS`, `MULTI_WORK_DIR`)
175+
- **Long Read Module**: Use prefix `LR_` (e.g., `LR_THREADS`, `LR_WORK_DIR`)
176+
- **Metagenomics Module**: Use prefix `META_` (e.g., `META_THREADS`, `META_WORK_DIR`)
177+
- **Structural Variants Module**: Use prefix `SV_` (e.g., `SV_THREADS`, `SV_WORK_DIR`)
178+
- **Spatial Module**: Use prefix `SPATIAL_` (e.g., `SPATIAL_THREADS`, `SPATIAL_WORK_DIR`)
179+
- **Pharmacogenomics Module**: Use prefix `PHARMA_` (e.g., `PHARMA_THREADS`, `PHARMA_DB_PATH`)
170180

171181
### Configuration File Structure
172182
```yaml
@@ -202,7 +212,7 @@ def load_domain_config(config_file: str | Path, prefix: str = "DOMAIN") -> Domai
202212
- RNA: `AmalgkitWorkflowConfig` with prefix `"AK"`
203213
- GWAS: `GWASWorkflowConfig` with prefix `"GWAS"`
204214
- Life Events: `LifeEventsWorkflowConfig` with prefix `"LE"`
205-
- Other modules: Follow pattern `{MODULE}_` prefix (e.g., `DNA_`, `PROT_`, `EPI_`, `ONT_`, `PHEN_`, `ECO_`, `MATH_`, `INFO_`, `VIZ_`, `SIM_`, `SC_`, `QC_`, `NET_`, `ML_`, `MULTI_`)
215+
- Other modules: Follow pattern `{MODULE}_` prefix (e.g., `DNA_`, `PROT_`, `EPI_`, `ONT_`, `PHEN_`, `ECO_`, `MATH_`, `INFO_`, `VIZ_`, `SIM_`, `SC_`, `QC_`, `NET_`, `ML_`, `MULTI_`, `LR_`, `META_`, `SV_`, `SPATIAL_`, `PHARMA_`)
206216

207217
## Code Quality Policy (STRICTLY NO MOCKS/FAKES/PLACEHOLDERS)
208218

@@ -303,6 +313,12 @@ Module-specific rules are organized in the `cursorrules/` directory. Each module
303313
- `cursorrules/networks.cursorrules` - Biological network analysis
304314
- `cursorrules/ml.cursorrules` - Machine learning for biological data
305315
- `cursorrules/multiomics.cursorrules` - Multi-omic data integration
316+
- `cursorrules/longread.cursorrules` - Long-read sequencing (PacBio/Nanopore)
317+
- `cursorrules/metagenomics.cursorrules` - Metagenomic analysis (amplicon, shotgun)
318+
- `cursorrules/structural_variants.cursorrules` - CNV/SV detection and annotation
319+
- `cursorrules/spatial.cursorrules` - Spatial transcriptomics (Visium, MERFISH, Xenium)
320+
- `cursorrules/pharmacogenomics.cursorrules` - Clinical pharmacogenomics
321+
- `cursorrules/menu.cursorrules` - Interactive menu and discovery system
306322

307323
**See `cursorrules/README.md` for detailed information about the modular structure.**
308324

@@ -489,6 +505,19 @@ Each module should have:
489505
- **Quality → All**: Quality control for all data types
490506
- **Simulation → All**: Synthetic data generation for testing
491507
- **Multi-Omics**: Integration of DNA, RNA, protein, epigenome, and other omics types
508+
- **Longread → DNA**: Long-read variant calling and genomic coordinates
509+
- **Longread → Epigenome**: Methylation from modified base detection
510+
- **Longread → Structural Variants**: SV detection complements short-read methods
511+
- **Metagenomics → Ecology**: Community diversity from amplicon/shotgun data
512+
- **Metagenomics → Networks**: Microbial co-occurrence networks
513+
- **Metagenomics → Ontology**: Functional annotation via GO/KEGG
514+
- **Structural Variants → DNA**: Genomic coordinates and variant calling
515+
- **Structural Variants → GWAS**: Structural variants in association studies
516+
- **Spatial → Single-Cell**: scRNA-seq reference for deconvolution
517+
- **Spatial → Networks**: Spatial interaction networks, ligand-receptor
518+
- **Pharmacogenomics → GWAS**: Variant data from association studies
519+
- **Pharmacogenomics → DNA**: Genomic coordinates and variant calling
520+
- **Pharmacogenomics → Phenotype**: Clinical phenotype data
492521

493522
### Workflow Patterns
494523
```python

config/gwas/gwas_amellifera.yaml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,23 @@ genome:
2727
# Direct FTP URL for A. mellifera genome
2828
ftp_url: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/
2929

30+
# =============================================================================
31+
# DATA GENERATION (for synthetic/simulated data pipeline)
32+
# =============================================================================
33+
# Controls how run_amellifera_gwas.py generates VCF, phenotype, and metadata.
34+
# All values are defaults; CLI flags (--scale-factor, --n-variants, etc.) override.
35+
data_generation:
36+
subspecies:
37+
A.m.ligustica: {label: Italian, n_samples: 25, pop_effect: 0.0}
38+
A.m.carnica: {label: Carniolan, n_samples: 20, pop_effect: 0.3}
39+
A.m.mellifera: {label: Dark European, n_samples: 15, pop_effect: -0.2}
40+
A.m.caucasica: {label: Caucasian, n_samples: 10, pop_effect: 0.1}
41+
A.m.scutellata: {label: African, n_samples: 10, pop_effect: -0.5}
42+
n_drones: 10
43+
n_variants: 10000
44+
scale_factor: 5 # multiply all counts: 400 diploid + 50 drones = 450 samples
45+
seed: 42
46+
3047
# =============================================================================
3148
# VARIANT DATA SOURCES
3249
# =============================================================================
@@ -118,6 +135,13 @@ samples:
118135
# Important for honeybees: population/subspecies, sampling location, season
119136
# covariates_file: data/covariates/amellifera/covariates.tsv
120137

138+
# Subset options (all optional; omit or comment out to use all samples):
139+
# sample_list: path/to/sample_ids.txt # one ID per line
140+
# subset:
141+
# subspecies: [A.m.ligustica, A.m.carnica] # filter by subspecies
142+
# caste: [worker] # filter by caste
143+
# max_per_subspecies: 10 # balanced design cap
144+
121145
# =============================================================================
122146
# POPULATION STRUCTURE
123147
# =============================================================================

cursorrules/AGENTS.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,12 @@ Each `.cursorrules` file contains domain-specific guidelines:
2424
- `simulation.cursorrules` - Simulation patterns
2525
- `singlecell.cursorrules` - Single-cell patterns
2626
- `visualization.cursorrules` - Visualization patterns
27+
- `longread.cursorrules` - Long-read sequencing (PacBio/Nanopore) patterns
28+
- `metagenomics.cursorrules` - Metagenomics (amplicon, shotgun) patterns
29+
- `structural_variants.cursorrules` - Structural variant detection patterns
30+
- `spatial.cursorrules` - Spatial transcriptomics patterns
31+
- `pharmacogenomics.cursorrules` - Clinical pharmacogenomics patterns
32+
- `menu.cursorrules` - Interactive menu system patterns
2733

2834
## Usage
2935
These rules are automatically loaded by Cursor AI when working in the corresponding module directories. They ensure:

cursorrules/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ This directory contains module-specific cursor rules for the METAINFORMANT proje
2323
- **`epigenome.cursorrules`**: Epigenetic modification analysis
2424
- **`ecology.cursorrules`**: Ecological metadata and community analysis
2525
- **`simulation.cursorrules`**: Synthetic data generation
26+
- **`longread.cursorrules`**: Long-read sequencing (PacBio/Nanopore)
27+
- **`metagenomics.cursorrules`**: Metagenomic analysis (amplicon, shotgun)
28+
- **`structural_variants.cursorrules`**: CNV/SV detection and annotation
29+
- **`spatial.cursorrules`**: Spatial transcriptomics (Visium, MERFISH, Xenium)
30+
- **`pharmacogenomics.cursorrules`**: Clinical pharmacogenomics
31+
- **`menu.cursorrules`**: Interactive menu and discovery system
2632

2733
## Usage
2834

cursorrules/SPEC.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,12 @@ Each `.cursorrules` file is a plain text file containing:
5353
| ecology.cursorrules | Community diversity |
5454
| simulation.cursorrules | Synthetic data |
5555
| life_events.cursorrules | Event sequences |
56+
| longread.cursorrules | Long-read sequencing (PacBio/Nanopore) |
57+
| metagenomics.cursorrules | Amplicon, shotgun metagenomics |
58+
| structural_variants.cursorrules | CNV/SV detection and annotation |
59+
| spatial.cursorrules | Spatial transcriptomics |
60+
| pharmacogenomics.cursorrules | Clinical pharmacogenomics |
61+
| menu.cursorrules | Interactive menu system |
5662

5763
## Interface
5864

cursorrules/core.cursorrules

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,26 @@ Shared utilities across all domains. Foundation for all other modules.
77
- **Required**: Standard library only
88
- **Optional**: Handled defensively (try/except imports)
99

10+
## Source Structure
11+
```
12+
src/metainformant/core/
13+
├── data/
14+
│ ├── db.py, validation.py
15+
├── engine/
16+
│ └── workflow_manager.py
17+
├── execution/
18+
│ ├── discovery.py, parallel.py, workflow.py
19+
├── io/
20+
│ ├── atomic.py, cache.py, checksums.py, disk.py
21+
│ ├── download.py, download_manager.py, download_robust.py
22+
│ ├── errors.py, io.py, paths.py
23+
├── ui/
24+
│ └── tui.py
25+
└── utils/
26+
├── config.py, errors.py, hash.py, logging.py
27+
├── optional_deps.py, progress.py, symbols.py, text.py, timing.py
28+
```
29+
1030
## Package Management
1131
- **ALWAYS use `uv`** for all Python package management and environment operations
1232
- Use `uv venv` to create virtual environments

cursorrules/dna.cursorrules

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,29 @@ DNA sequence analysis, genomics, population genetics, and variant calling.
77
- **Required**: `core`
88
- **Optional**: `biopython`, `ncbi-datasets-pylib`, `pysam`
99

10+
## Source Structure
11+
```
12+
src/metainformant/dna/
13+
├── alignment/
14+
│ ├── distances.py, msa.py, pairwise.py
15+
├── expression/
16+
│ ├── codon.py, transcription.py, translation.py
17+
├── external/
18+
│ ├── entrez.py, genomes.py, ncbi.py
19+
├── integration/
20+
│ └── rna.py
21+
├── io/
22+
│ ├── fasta.py, fastq.py
23+
├── phylogeny/
24+
│ └── tree.py
25+
├── population/
26+
│ ├── analysis.py, core.py, visualization.py
27+
├── sequence/
28+
│ ├── composition.py, consensus.py, core.py, kmer.py, motifs.py, restriction.py
29+
└── variation/
30+
├── mutations.py, variants.py
31+
```
32+
1033
## Package Management
1134
- **ALWAYS use `uv`** for all Python package management and environment operations
1235
- Install optional dependencies: `uv add biopython`, `uv add ncbi-datasets-pylib`, `uv add pysam`
@@ -22,7 +45,7 @@ DNA sequence analysis, genomics, population genetics, and variant calling.
2245

2346
**Patterns**:
2447
```python
25-
from metainformant.dna import sequences
48+
from metainformant.dna.sequence import core as sequences
2649

2750
seqs = sequences.read_fasta("data/sequences.fasta")
2851
for seq_id, sequence in seqs:
@@ -49,9 +72,9 @@ for seq_id, sequence in seqs:
4972

5073
**Patterns**:
5174
```python
52-
from metainformant.dna import phylogeny
75+
from metainformant.dna.phylogeny import tree
5376

54-
tree = phylogeny.neighbor_joining_tree(sequences)
77+
tree_result = tree.neighbor_joining_tree(sequences)
5578
# Returns Newick format string or tree object
5679
```
5780

@@ -62,7 +85,7 @@ tree = phylogeny.neighbor_joining_tree(sequences)
6285

6386
**Patterns**:
6487
```python
65-
from metainformant.dna import population
88+
from metainformant.dna.population import core as population
6689

6790
stats = population.calculate_pi(sequences)
6891
# Returns: {"pi": 0.001, "segregating_sites": 42, ...}
@@ -95,7 +118,7 @@ stats = population.calculate_pi(sequences)
95118

96119
**Patterns**:
97120
```python
98-
from metainformant.dna import ncbi, genomes
121+
from metainformant.dna.external import ncbi, genomes
99122

100123
# Validate accession
101124
accession = genomes.validate_accession("GCF_000001405.40")

cursorrules/ecology.cursorrules

Lines changed: 33 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,15 @@ Ecological metadata and community analysis: community structure analysis and div
77
- **Required**: `core`
88
- **Optional**: `math` (diversity calculations)
99

10+
## Source Structure
11+
```
12+
src/metainformant/ecology/
13+
├── analysis/
14+
│ ├── community.py, functional.py, indicators.py, macroecology.py, ordination.py
15+
└── visualization/
16+
└── visualization.py
17+
```
18+
1019
## Package Management
1120
- **ALWAYS use `uv`** for all Python package management and environment operations
1221
- Use `uv run` to execute commands: `uv run pytest`, `uv run metainformant ecology --help`
@@ -19,20 +28,30 @@ Ecological metadata and community analysis: community structure analysis and div
1928
- Species abundance
2029
- Diversity metrics
2130

22-
### Environmental (`environmental`)
23-
- Environmental metadata integration
24-
- Ecological parameter analysis
25-
- Environmental variable processing
26-
27-
### Interactions (`interactions`)
28-
- Ecological interaction analysis
29-
- Species interaction networks
30-
- Interaction pattern detection
31-
32-
### Workflow (`workflow`)
33-
- End-to-end ecology analysis workflows
34-
- Workflow orchestration
35-
- Configuration-based execution
31+
### Functional (`functional`)
32+
- Functional trait analysis
33+
- Functional diversity
34+
- Trait-based ecology
35+
36+
### Indicators (`indicators`)
37+
- Ecological indicators
38+
- Environmental health metrics
39+
- Biodiversity indicators
40+
41+
### Macroecology (`macroecology`)
42+
- Macroecological patterns
43+
- Species-area relationships
44+
- Abundance distributions
45+
46+
### Ordination (`ordination`)
47+
- Ordination methods (PCA, NMDS, CCA)
48+
- Community composition analysis
49+
- Multivariate statistics
50+
51+
### Visualization (`visualization`)
52+
- Ecological data visualization
53+
- Community structure plots
54+
- Diversity visualizations
3655

3756
## Patterns
3857

cursorrules/epigenome.cursorrules

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,19 @@ Epigenetic modification analysis: DNA methylation analysis and epigenomic track
66
## Dependencies
77
- **Required**: `core`, `dna` (for genomic coordinates)
88

9+
## Source Structure
10+
```
11+
src/metainformant/epigenome/
12+
├── analysis/
13+
│ └── tracks.py
14+
├── assays/
15+
│ ├── atacseq.py, chipseq.py, methylation.py
16+
├── visualization/
17+
│ └── visualization.py
18+
└── workflow/
19+
└── workflow.py
20+
```
21+
922
## Package Management
1023
- **ALWAYS use `uv`** for all Python package management and environment operations
1124
- Use `uv run` to execute commands: `uv run pytest`, `uv run metainformant epigenome --help`
@@ -20,7 +33,7 @@ Epigenetic modification analysis: DNA methylation analysis and epigenomic track
2033

2134
**Patterns**:
2235
```python
23-
from metainformant.epigenome import methylation
36+
from metainformant.epigenome.assays import methylation
2437

2538
methylation_data = methylation.analyze_methylation(
2639
bam_file="data/methylation.bam",

cursorrules/gwas.cursorrules

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,28 @@ Genome-wide association studies: variant quality control, association testing, p
77
- **Required**: `core`, `dna.variants`, `dna.population`, `math.popgen`, `ml.regression`
88
- **Optional**: External tools (bcftools, GATK) for variant calling
99

10+
## Source Structure
11+
```
12+
src/metainformant/gwas/
13+
├── analysis/
14+
│ ├── annotation.py, association.py, calling.py, correction.py
15+
│ ├── heritability.py, ld_pruning.py, mixed_model.py
16+
│ ├── quality.py, structure.py, summary_stats.py, utils.py
17+
├── data/
18+
│ ├── config.py, download.py, genome.py, metadata.py, sra_download.py
19+
├── visualization/
20+
│ ├── config.py, general.py, utils.py
21+
│ ├── visualization_comparison.py, visualization_composite.py
22+
│ ├── visualization_effects.py, visualization_finemapping.py
23+
│ ├── visualization_genome.py, visualization_geography.py
24+
│ ├── visualization_interactive.py, visualization_ld.py
25+
│ ├── visualization_phenotype.py, visualization_population.py
26+
│ ├── visualization_regional.py, visualization_statistical.py
27+
│ ├── visualization_suite.py, visualization_variants.py
28+
└── workflow/
29+
└── workflow.py
30+
```
31+
1032
## Package Management
1133
- **ALWAYS use `uv`** for all Python package management and environment operations
1234
- Use `uv run` to execute commands: `uv run pytest`, `uv run metainformant gwas run --config config/gwas/example.yaml`
@@ -22,7 +44,7 @@ Genome-wide association studies: variant quality control, association testing, p
2244

2345
**Patterns**:
2446
```python
25-
from metainformant.gwas import association
47+
from metainformant.gwas.analysis import association
2648

2749
results = association.test_association(
2850
genotypes=genotypes,
@@ -40,7 +62,7 @@ results = association.test_association(
4062

4163
**Patterns**:
4264
```python
43-
from metainformant.gwas import correction
65+
from metainformant.gwas.analysis import correction
4466

4567
corrected = correction.apply_bonferroni(p_values)
4668
corrected = correction.apply_fdr(p_values, method="bh")
@@ -79,7 +101,7 @@ corrected = correction.apply_fdr(p_values, method="bh")
79101

80102
**Patterns**:
81103
```python
82-
from metainformant.gwas import visualization
104+
from metainformant.gwas.visualization import general as visualization
83105

84106
visualization.plot_manhattan(
85107
results=association_results,

0 commit comments

Comments
 (0)