Skip to content

Commit fe3d215

Browse files
authored
Merge pull request #177 from Ensembl/dev/metadata_patching
Dev/metadata patching
2 parents b0abd5c + c40b963 commit fe3d215

File tree

3 files changed

+327
-94
lines changed

3 files changed

+327
-94
lines changed

src/python/ensembl/genes/metadata/README.md

Lines changed: 72 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,5 +54,76 @@ export TAXONOMY_URI="mysql+pymysql://user:pass@host:port/ncbi_taxonomy"
5454

5555
**Usage:**
5656
```bash
57+
# Basic usage
5758
python beta_patcher.py patches.csv --jira-ticket EBD-1111 --output-dir ./patches/
58-
```
59+
60+
# With team filter (only applies patches where all affected genomes belong to specified team)
61+
python beta_patcher.py patches.csv --jira-ticket EBD-1111 --team-filter Genebuild
62+
```
63+
64+
### Finding genome_uuid for organism/assembly patches
65+
66+
When patching `organism` or `assembly` tables, you need to provide a genome_uuid. Use these queries to find genome UUIDs:
67+
68+
**Find all genomes for a specific assembly (by accession):**
69+
```sql
70+
SELECT DISTINCT
71+
genome.genome_uuid,
72+
genome.production_name,
73+
assembly.accession,
74+
assembly.name AS assembly_name,
75+
(SELECT da.value
76+
FROM genome_dataset gd
77+
JOIN dataset d ON gd.dataset_id = d.dataset_id AND d.name = 'genebuild'
78+
JOIN dataset_attribute da ON d.dataset_id = da.dataset_id
79+
JOIN attribute a ON da.attribute_id = a.attribute_id AND a.name = 'genebuild.team_responsible'
80+
WHERE gd.genome_id = genome.genome_id
81+
LIMIT 1) AS team_responsible
82+
FROM genome
83+
JOIN assembly ON genome.assembly_id = assembly.assembly_id
84+
WHERE assembly.accession = 'GCA_000001405.14'
85+
ORDER BY team_responsible, genome.production_name;
86+
```
87+
88+
**Find all genomes for a specific organism (by biosample_id):**
89+
```sql
90+
SELECT DISTINCT
91+
genome.genome_uuid,
92+
genome.production_name,
93+
organism.biosample_id,
94+
organism.scientific_name,
95+
organism.strain,
96+
(SELECT da.value
97+
FROM genome_dataset gd
98+
JOIN dataset d ON gd.dataset_id = d.dataset_id AND d.name = 'genebuild'
99+
JOIN dataset_attribute da ON d.dataset_id = da.dataset_id
100+
JOIN attribute a ON da.attribute_id = a.attribute_id AND a.name = 'genebuild.team_responsible'
101+
WHERE gd.genome_id = genome.genome_id
102+
LIMIT 1) AS team_responsible
103+
FROM genome
104+
JOIN organism ON genome.organism_id = organism.organism_id
105+
WHERE organism.biosample_id = 'SAMN04851098'
106+
ORDER BY team_responsible, genome.production_name;
107+
```
108+
109+
**Find genomes by organism strain:**
110+
```sql
111+
SELECT DISTINCT
112+
genome.genome_uuid,
113+
genome.production_name,
114+
organism.scientific_name,
115+
organism.strain,
116+
(SELECT da.value
117+
FROM genome_dataset gd
118+
JOIN dataset d ON gd.dataset_id = d.dataset_id AND d.name = 'genebuild'
119+
JOIN dataset_attribute da ON d.dataset_id = da.dataset_id
120+
JOIN attribute a ON da.attribute_id = a.attribute_id AND a.name = 'genebuild.team_responsible'
121+
WHERE gd.genome_id = genome.genome_id
122+
LIMIT 1) AS team_responsible
123+
FROM genome
124+
JOIN organism ON genome.organism_id = organism.organism_id
125+
WHERE organism.scientific_name = 'Homo sapiens'
126+
ORDER BY team_responsible, genome.production_name;
127+
```
128+
129+
Pick any one of the returned genome_uuid values to use in your CSV. The script will automatically detect and warn about all other genomes sharing that organism/assembly.

0 commit comments

Comments
 (0)