Skip to content

Commit b47a388

Browse files
authored
Merge pull request #2 from sourmash-bio/upd_aug2
WIP: working on stuff, round2
2 parents b38bc46 + c8af19b commit b47a388

25 files changed

+1008
-72
lines changed

.github/workflows/gh-pages.yml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
name: build and deploy mkdocs to github pages
2+
3+
permissions:
4+
pages: write
5+
contents: write
6+
7+
on:
8+
push:
9+
branches:
10+
- main
11+
- upd_aug2
12+
13+
jobs:
14+
deploy:
15+
runs-on: ubuntu-latest
16+
steps:
17+
- uses: actions/checkout@v4
18+
- uses: actions/setup-python@v5
19+
with:
20+
python-version: 3.x
21+
- run: pip install mkdocs-material
22+
- run: mkdocs gh-deploy --force

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
*~
22
__pycache__
3-
outputs
43
.snakemake
4+
site/
5+
*.pickle

Makefile

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,18 @@
1-
all: Snakefile
1+
.PHONY: build format update_sourmash
2+
3+
all: Snakefile build preview
4+
5+
build:
26
snakemake --delete-all-output
3-
snakemake -p
7+
snakemake -p -j 1
8+
9+
preview: build
10+
rm -fr preview/generated
11+
mkdir -p preview/generated/
12+
cp outputs/md/*.md preview/generated/
13+
14+
update_ctb_sourmash: build
15+
cp -r outputs/md/ ~/dev/sourmash/doc/databases-md/
416

517
format:
618
black scripts

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,37 @@
11
# 2025-sourmash-databases-doc-template
2+
3+
This repo uses Jinja2 templating to automatically produce Markdown
4+
files describing the available sourmash databases. It also produces
5+
automated scripts that check that the relevant files are available for
6+
download.
7+
8+
## Basic instructions
9+
10+
Run `make`. Then look at [outputs/md](outputs/md) to see output markdown.
11+
12+
## Previewing formatting with mkdocs
13+
14+
Run `mkdocs serve` to see the generated files.
15+
16+
## Updating the database list in the sourmash docs
17+
18+
Copy all the files in `outputs/md/` into the sourmash repo under `doc/databases-md/`.
19+
20+
## Adding databases
21+
22+
Edit [scripts/databases.py](scripts/databases.py) to add databases.
23+
24+
You'll also need to:
25+
* add the new db to 'collections' list at top of `scripts/make-md.py`.
26+
* update `mkdocs.yml` if you want to preview the new db;
27+
* update `doc/databases.md` in sourmash to include a direct link to the generated file.
28+
29+
## Validating database links
30+
31+
Run `outputs/scripts/check-urls.py` to check that all the database and
32+
taxonomy URLs are valid.
33+
34+
## Notes on database naming conventions:
35+
36+
* all NCBI databases will be date-stamped
37+
* all GTDB databases will be version-stamped

Snakefile

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ templates = [
99
Templates_To_Output('gtdb220',
1010
'complete',
1111
'outputs/md/gtdb220.md'),
12+
Templates_To_Output('gtdb226',
13+
'complete',
14+
'outputs/md/gtdb226.md'),
1215
Templates_To_Output('ncbi_viruses_2025_01',
1316
'complete',
1417
'outputs/md/ncbi_viruses_2025_01.md'),
@@ -17,6 +20,7 @@ templates = [
1720
'outputs/md/ncbi_euks_2025_01.md'),
1821
]
1922

23+
# retrieve path from a Templates_To_Output object
2024
def get_template_path(w):
2125
print('gtp', w)
2226
for t in templates:
@@ -25,6 +29,7 @@ def get_template_path(w):
2529
print('get_template_path', t, x)
2630
return x
2731

32+
# retrieve name from a Templates_To_Output object
2833
def get_template_name(w):
2934
for t in templates:
3035
if t.output_md == f'outputs/md/{w.db}.md':
@@ -37,19 +42,22 @@ def get_template_name(w):
3742
rule default:
3843
input:
3944
[ t.output_md for t in templates ],
45+
"outputs/scripts/check-urls.py",
46+
"outputs/md/databases.md",
4047

4148

4249
rule make_db_descr:
4350
input:
4451
script='scripts/make-list.py',
52+
dbfoo='scripts/databases.py',
4553
output:
4654
pickle='outputs/collections.pickle',
4755
shell: """
4856
{input.script} --save-pickle {output.pickle}
4957
"""
5058

5159

52-
rule make_gtdb:
60+
rule make_db_md:
5361
input:
5462
script='scripts/make-md.py',
5563
pickle='outputs/collections.pickle',
@@ -62,3 +70,31 @@ rule make_gtdb:
6270
{input.script} {input.pickle} {params.template} \
6371
--set-collection {wildcards.db} -o {output}
6472
"""
73+
74+
rule make_check_script:
75+
input:
76+
script="scripts/make-file.py",
77+
pickle='outputs/collections.pickle',
78+
template="templates/check-urls.py",
79+
output:
80+
"outputs/scripts/check-urls.py",
81+
params:
82+
template_name="check-urls.py"
83+
shell: """
84+
{input.script} {input.pickle} {params.template_name} -o {output}
85+
chmod +x {output}
86+
"""
87+
88+
rule make_databases:
89+
input:
90+
script="scripts/make-file.py",
91+
pickle='outputs/collections.pickle',
92+
template="templates/databases.md",
93+
output:
94+
"outputs/md/databases.md",
95+
params:
96+
template_name="databases.md",
97+
shell: """
98+
{input.script} {input.pickle} {params.template_name} -o {output}
99+
chmod +x {output}
100+
"""

mkdocs.yml

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
site_name: database-template-preview
2+
docs_dir: ./preview/
3+
4+
nav:
5+
- Home: index.md
6+
- Databases: generated/databases.md
7+
- GTDB RS220: generated/gtdb220.md
8+
- GTDB RS226: generated/gtdb226.md
9+
- NCBI eukaryotes (Jan 2025): generated/ncbi_euks_2025_01.md
10+
- NCBI viruses (Jan 2025): generated/ncbi_viruses_2025_01.md
11+
12+
markdown_extensions:
13+
- toc:
14+
permalink: true
15+
- def_list
16+
- attr_list
17+
- admonition
18+
19+
plugins:
20+
- search
21+
22+
theme:
23+
name: material
24+
features:
25+
- content.code.annotate
26+
- navigation.sections
27+
- navigation.expand
28+
- navigation.indexes
29+
- navigation.top
30+
- navigation.footer
31+
- navigation.tracking
32+
- search.highlight
33+
- search.share
34+
- search.suggest
35+
- content.code.copy
36+

outputs/md/databases.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<!-- automatically generated by code in https://github.com/sourmash-bio/2025-sourmash-databases-doc-template/ -->
2+
<!-- template file: templates/databases.md -->
3+
4+
5+
6+
[GTDB RS220](gtdb220.md) -- Bacterial and Archaeal genomes from GTDB RS220.
7+
8+
9+
10+
[GTDB RS226](gtdb226.md) -- Bacterial and Archaeal genomes from GTDB RS226.
11+
12+
13+
14+
[NCBI Viruses (Jan 2025)](ncbi_viruses_2025_01.md) -- All viruses from NCBI (NCBI:txid10239) as of January 2025.
15+
16+
17+
18+
[NCBI Eukaryotes (Jan 2025)](ncbi_euks_2025_01.md) -- All eukaryotic reference genomes from NCBI (NCBI:txid2759) as of January 2025.
19+

outputs/md/gtdb220.md

Lines changed: 56 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,69 @@
1-
# Collection: GTDB RS220 - All Bacteria and Archaea from GTDB RS220
1+
<!-- automatically generated by code in https://github.com/sourmash-bio/2025-sourmash-databases-doc-template/ -->
2+
<!-- template file: templates/complete.md -->
3+
4+
# Collection: GTDB RS220
5+
6+
Bacterial and Archaeal genomes from GTDB RS220.
27

38
Links:
9+
410
* [Announcement](https://forum.gtdb.ecogenomic.org/t/announcing-gtdb-r09-rs220/595)
511

612
## Database files:
713

814
Files:
915

10-
* zip: [gtdb-rs220-k21.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs220/gtdb-rs220-k21.zip) - all GTDB genomes - DNA, k=21, scaled=1000
11-
* zip: [gtdb-rs220-k31.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs220/gtdb-rs220-k31.zip) - all GTDB genomes - DNA, k=31, scaled=1000
12-
* zip: [gtdb-rs220-k51.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs220/gtdb-rs220-k51.zip) - all GTDB genomes - DNA, k=51, scaled=1000
16+
* zip: [gtdb-rs220-k21.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k21.dna.zip) - all GTDB genomes. - DNA, k=21, scaled=1000 (17.0 GB)
17+
* zip: [gtdb-rs220-k31.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k31.dna.zip) - all GTDB genomes. - DNA, k=31, scaled=1000 (17.0 GB)
18+
* zip: [gtdb-rs220-k51.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k51.dna.zip) - all GTDB genomes. - DNA, k=51, scaled=1000 (17.0 GB)
19+
20+
21+
* zip: [gtdb-reps-rs220-k21.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k21.dna.zip) - all GTDB species representative genomes. - DNA, k=21, scaled=1000 (2.8 GB)
22+
* zip: [gtdb-reps-rs220-k31.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k31.dna.zip) - all GTDB species representative genomes. - DNA, k=31, scaled=1000 (2.8 GB)
23+
* zip: [gtdb-reps-rs220-k51.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k51.dna.zip) - all GTDB species representative genomes. - DNA, k=51, scaled=1000 (2.8 GB)
1324

1425

1526

1627
## Taxonomy files:
1728

18-
* [GTDB taxonomy for RS220](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/gtdb-rs220.lineages.csv)
29+
* [GTDB taxonomy for RS220.](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220.lineages.csv)
30+
31+
32+
## Advanced
33+
34+
### Download via curl using the command line
35+
36+
```shell
37+
# download gtdb-rs220-k21.dna.zip
38+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k21.dna.zip
39+
40+
# download gtdb-rs220-k31.dna.zip
41+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k31.dna.zip
42+
43+
# download gtdb-rs220-k51.dna.zip
44+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k51.dna.zip
45+
46+
# download gtdb-reps-rs220-k21.dna.zip
47+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k21.dna.zip
48+
49+
# download gtdb-reps-rs220-k31.dna.zip
50+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k31.dna.zip
51+
52+
# download gtdb-reps-rs220-k51.dna.zip
53+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k51.dna.zip
54+
55+
# download taxonomy file
56+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220.lineages.csv
57+
```
58+
59+
### A list of all the URLs
60+
61+
```
62+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k21.dna.zip
63+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k31.dna.zip
64+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220-k51.dna.zip
65+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k21.dna.zip
66+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k31.dna.zip
67+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-reps-rs220-k51.dna.zip
68+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs220/gtdb-rs220.lineages.csv
69+
```

outputs/md/gtdb226.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
<!-- automatically generated by code in https://github.com/sourmash-bio/2025-sourmash-databases-doc-template/ -->
2+
<!-- template file: templates/complete.md -->
3+
4+
# Collection: GTDB RS226
5+
6+
Bacterial and Archaeal genomes from GTDB RS226.
7+
8+
Links:
9+
10+
* [Announcement](https://forum.gtdb.ecogenomic.org/t/announcing-gtdb-r10-rs226/724)
11+
12+
## Database files:
13+
14+
Files:
15+
16+
* zip: [gtdb-rs226-k21.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k21.dna.zip) - all GTDB genomes. - DNA, k=21, scaled=1000 (21.0 GB)
17+
* zip: [gtdb-rs226-k31.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k31.dna.zip) - all GTDB genomes. - DNA, k=31, scaled=1000 (21.0 GB)
18+
* zip: [gtdb-rs226-k51.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k51.dna.zip) - all GTDB genomes. - DNA, k=51, scaled=1000 (21.0 GB)
19+
20+
21+
* zip: [gtdb-reps-rs226-k21.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k21.dna.zip) - all GTDB species representative genomes. - DNA, k=21, scaled=1000 (21.0 GB)
22+
* zip: [gtdb-reps-rs226-k31.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k31.dna.zip) - all GTDB species representative genomes. - DNA, k=31, scaled=1000 (21.0 GB)
23+
* zip: [gtdb-reps-rs226-k51.dna.zip](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k51.dna.zip) - all GTDB species representative genomes. - DNA, k=51, scaled=1000 (21.0 GB)
24+
25+
26+
27+
## Taxonomy files:
28+
29+
* [GTDB taxonomy for RS226.](https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226.lineages.csv)
30+
31+
32+
## Advanced
33+
34+
### Download via curl using the command line
35+
36+
```shell
37+
# download gtdb-rs226-k21.dna.zip
38+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k21.dna.zip
39+
40+
# download gtdb-rs226-k31.dna.zip
41+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k31.dna.zip
42+
43+
# download gtdb-rs226-k51.dna.zip
44+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k51.dna.zip
45+
46+
# download gtdb-reps-rs226-k21.dna.zip
47+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k21.dna.zip
48+
49+
# download gtdb-reps-rs226-k31.dna.zip
50+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k31.dna.zip
51+
52+
# download gtdb-reps-rs226-k51.dna.zip
53+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k51.dna.zip
54+
55+
# download taxonomy file
56+
curl -O --no-clobber https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226.lineages.csv
57+
```
58+
59+
### A list of all the URLs
60+
61+
```
62+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k21.dna.zip
63+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k31.dna.zip
64+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226-k51.dna.zip
65+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k21.dna.zip
66+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k31.dna.zip
67+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-reps-rs226-k51.dna.zip
68+
https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db.new/gtdb-rs226/gtdb-rs226.lineages.csv
69+
```

0 commit comments

Comments
 (0)