Skip to content

Commit c1307d1

Browse files
committed
Merge branch 'rel-1.5.0' into release
2 parents 9cf5b1d + caeabe6 commit c1307d1

File tree

65 files changed

+11836
-1672
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+11836
-1672
lines changed

.github/workflows/build-containers.yml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -82,10 +82,9 @@ jobs:
8282
uses: actions/checkout@v4
8383

8484
- name: Install Nextflow
85-
run: |
86-
curl -s https://get.nextflow.io | bash
87-
chmod +x nextflow
88-
mv nextflow /usr/local/bin/
85+
uses: nf-core/setup-nextflow@v2
86+
with:
87+
version: '24.10.5'
8988

9089
- name: Run Nextflow pipeline
9190
run: nextflow run main.nf -profile docker,test

README.md

Lines changed: 73 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,9 @@ Convert OMA run into OMA Browser release
8888
| Parameter | Description | Type | Default | Required |
8989
|-----------|-----------|-----------|-----------|-----------|
9090
| `oma_source` | Selection of OMA data source. Can be either 'FastOMA' or 'Production'. The selection requires setting either the parameters for FastOMA or Production. | `string` | FastOMA | |
91+
| `oma_version` | Version of the OMA Browser instance. It defaults to 'All.<Mon><YEAR>' | `string` | | | |
92+
| `oma_release_char` | Release specific character (used in HOG ids) <details><summary>Help</summary><small>A single capital letter [A-Z] which makes the
93+
HOG-IDs unique accross different releases.</small></details>| `string` | | | |
9194

9295
### FastOMA Input data
9396

@@ -109,43 +112,95 @@ Input files genereated from an OMA Production run
109112
| `matrix_file` | OMA Groups file | `string` | | |
110113
| `hog_orthoxml` | Hierarchcial orthologous groups (HOGs) in orthoxml format | `string` | | True |
111114
| `genomes_dir` | Folder containing genomes | `string` | | True |
115+
| `homoeologs_folder` | Folder containing the homoeologs files | `string` | | | |
112116

113117
### Domain data
114118

115119
File paths for domain annotations
116120

117121
| Parameter | Description | Type | Default | Required |
118122
|-----------|-----------|-----------|-----------|-----------|
119-
| `cath_names_path` | File containing CATH domain descriptions | `string` | http://download.cathdb.info/cath/releases/latest-release/cath-classification-data/cath-names.txt | |
120-
| `known_domains` | Folder containing known domain assignments files | `string` | | |
121-
| `pfam_names_path` | File containing Pfam descriptions | `string` | https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.clans.tsv.gz | |
123+
| `infer_domains` | Flag indicating whether domains are inferred using the CATH/Gene3d pipeline. <details><summary>Help</summary><small>If set to true, the
124+
pipeline will run the CATH/Gene3D pipeline to infer domain assignments. This will require substantial amount of compute time. The set of already known
125+
domains (see parameter 'known_domains') will be used to skip the inference of domains that are already known. If set to false, the pipeline will use the
126+
known domain assignments provided in the 'known_domains' parameter.</small></details>| `boolean` | | | |
127+
| `known_domains` | Folder containing known domain assignments files. <details><summary>Help</summary><small>The folder must contain csv/tsv files that
128+
contain three columns (md5hash of sequence, CATH-domain-id, region on sequence). The output of a previous run of this pipeline can thus be used as
129+
input.</small></details>| `string` | | | |
130+
| `cath_names_path` | File containing CATH domain descriptions | `string` |
131+
http://download.cathdb.info/cath/releases/latest-release/cath-classification-data/cath-names.txt | | |
132+
| `hmm_db` | Path where the domain hmms for the cath/gene3d pipeline are located. | `string` |
133+
ftp://orengoftp.biochem.ucl.ac.uk/gene3d/v21.0.0/gene3d_hmmsearch/hmms.tar.gz | | |
134+
| `cath_domain_list` | File with mapping from hmm id to cath domain id. | `string` |
135+
http://download.cathdb.info/cath/releases/latest-release/cath-classification-data/cath-domain-list.txt | | |
136+
| `discontinuous_regs` | File provided by gene3d to handle discontinuous regions | `string` |
137+
http://download.cathdb.info/gene3d/v21.0.0/gene3d_hmmsearch/discontinuous/discontinuous_regs.pkl | | |
138+
| `pfam_names_path` | File containing Pfam descriptions | `string` | https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.clans.tsv.gz | | |
139+
122140

123141
### Crossreferences
124142

125143
Integrate crossreferences
126144

127-
| Parameter | Description | Type | Default | Required |
128-
|-----------|-----------|-----------|-----------|-----------|
129-
| `xref_uniprot_swissprot` | UniProtKB/SwissProt annotation in text format | `string` | https://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz | |
130-
| `xref_uniprot_trembl` | UniProtKB/TrEMBL annotations in text format | `string` | /dev/null | |
131-
| `taxonomy_sqlite_path` | | `string` | | |
132-
| `xref_refseq` | Folder containing RefSeq gbff files. | `string` | | |
145+
| Parameter | Description | Type | Default | Required | Hidden |
146+
|-----------|-----------|-----------|-----------|-----------|-----------|
147+
| `xref_uniprot_swissprot` | UniProtKB/SwissProt annotation in text format | `string` |
148+
https://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz | | |
149+
| `xref_uniprot_trembl` | UniProtKB/TrEMBL annotations in text format. <details><summary>Help</summary><small>If not provided, no TrEMBL cross-references
150+
will be included. The generic ftp url for TrEMBL is
151+
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.dat.gz</small></details>| `string` | | | |
152+
| `taxonomy_sqlite_path` | Path to a sqlite database containing the combined NCBI/GTDB taxonomy data. <details><summary>Help</summary><small>If not provided
153+
it will be generated automatically and cached</small></details>| `string` | | | |
154+
| `xref_refseq` | 'download' or folder containing RefSeq gbff files. <details><summary>Help</summary><small>If not specified, no RefSeq crossreferences will
155+
be download (default). If set to 'download', the latest RefSeq gbff files will be downloaded from NCBI FTP server. Alternatively, a folder containing local
156+
*.gbff.gz files can be provided.</small></details>| `string` | | | |
133157

134158
### Gene Ontology
135159

136160
Gene Ontology files to integrate
137161

138-
| Parameter | Description | Type | Default | Required |
139-
|-----------|-----------|-----------|-----------|-----------|
140-
| `go_obo` | Gene Ontology OBO file | `string` | http://purl.obolibrary.org/obo/go/go-basic.obo | |
141-
| `go_gaf` | Gene Ontology annotations (GAF format). This can the GOA database or a glob pattern with local files in gaf format. | `string` | https://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz | |
162+
| Parameter | Description | Type | Default | Required | Hidden |
163+
|-----------|-----------|-----------|-----------|-----------|-----------|
164+
| `go_obo` | Gene Ontology OBO file | `string` | http://purl.obolibrary.org/obo/go/go-basic.obo | | |
165+
| `go_gaf` | Gene Ontology annotations (GAF format). This can the GOA database or a glob pattern with local files in gaf format. | `string` |
166+
https://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz | | |
167+
168+
### OMAmer
169+
170+
Parameters regarding building OMAmer databases based on the generated OMA instance
171+
172+
| Parameter | Description | Type | Default | Required | Hidden |
173+
|-----------|-----------|-----------|-----------|-----------|-----------|
174+
| `omamer_levels` | Comma-seperated list of taxonomic levels for which OMAmer databases should be built. <details><summary>Help</summary><small>The input
175+
string is parsed as a comma-seperated list, e.g. given 'Mammalia,Primates' as parameter value would build two OMAmer databases, one for Mammalia and one for
176+
Primates. Note that the taxonomic levels must exist in the input species tree.</small></details>| `string` | | | |
177+
178+
### Exporting as RDF
179+
180+
Parameters regarding the export as rdf triples
181+
182+
| Parameter | Description | Type | Default | Required | Hidden |
183+
|-----------|-----------|-----------|-----------|-----------|-----------|
184+
| `rdf_export` | Flag to activate export as RDF triples <details><summary>Help</summary><small>Activating rdf_export will enable the dump of RDF ttl files
185+
which can be imported into a Sparql endpoint.</small></details>| `boolean` | | | |
186+
| `rdf_orthOntology` | user provided orthOntology file. If not provided, default ontology will be used | `string` | | | |
187+
| `rdf_prefixes` | user provided rdf prefix mapping. if not provided, default prefixes will be used. | `string` | | | |
188+
189+
### Production OMA output settings
190+
191+
Parameters concerning additional output files usually needed for the production OMA Browser instance
192+
193+
| Parameter | Description | Type | Default | Required | Hidden |
194+
|-----------|-----------|-----------|-----------|-----------|-----------|
195+
| `oma_dumps` | Flag to activate dumping various files for the download section <details><summary>Help</summary><small>Activating oma_dumps will enable
196+
species, sequences, GO annotations files as text files for the download section.</small></details>| `boolean` | | | |
142197

143198
### Generic options
144199

145200
Less common options for the pipeline, typically set in a config file.
146201

147-
| Parameter | Description | Type | Default | Required |
148-
|-----------|-----------|-----------|-----------|-----------|
149-
| `custom_config_version` | version of configuration base to include (nf-core configs) | `string` | master | |
150-
| `custom_config_base` | location where to look for nf-core/configs | `string` | https://raw.githubusercontent.com/nf-core/configs/master | |
151-
202+
| Parameter | Description | Type | Default | Required | Hidden |
203+
|-----------|-----------|-----------|-----------|-----------|-----------|
204+
| `help` | Display help text. | `boolean` | | | True |
205+
| `custom_config_version` | version of configuration base to include (nf-core configs) | `string` | master | | True |
206+
| `custom_config_base` | location where to look for nf-core/configs | `string` | https://raw.githubusercontent.com/nf-core/configs/master | | True |

config/base.config

Lines changed: 100 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -69,14 +69,14 @@ process {
6969
}
7070

7171
withName: '.*:BUILD_HOG_H5'{
72-
cpus = { 1 }
73-
memory = { 500.MB + 500.B * orthoxml.size() * (2*task.attempt-1) }
74-
time = { 1.min * (Math.ceil( meta.nr_of_genomes / 3 ) + 10) * task.attempt }
72+
cpus = { meta.nr_of_taxa > 250 ? 3 : 1 }
73+
memory = { 1.GB + 20.KB*Math.ceil(orthoxml.size() / 1024) * (2*task.attempt-1) }
74+
time = { 1.min * (Math.ceil( meta.nr_of_genomes / 3 ) + Math.ceil( meta.nr_of_taxa / (meta.nr_of_taxa > 250 ? 9 : 3)) + 10) * task.attempt }
7575
}
7676

7777
withName: ".*:ADD_PAIRWISE_ORTHOLOGS" {
7878
cpus = { meta.nr_of_genomes < 10 ? 2 : (meta.nr_of_genomes < 30 ? 4 : (meta.nr_of_genomes < 300 ? 6 : 12)) }
79-
memory = { 3.GB + 10.B * (meta.max_nr_seqs_in_genome * meta.nr_of_genomes * 10 as Long) * task.attempt }
79+
memory = { 3.GB + 100.KB * (Math.ceil(meta.max_nr_seqs_in_genome/1024) * meta.nr_of_genomes) * task.attempt }
8080
time = { 1.min * (Math.ceil( meta.nr_of_genomes / 2 ) + 10) * task.attempt }
8181
}
8282

@@ -88,20 +88,112 @@ process {
8888

8989
withName: "INFER_FINGERPRINTS" {
9090
cpus = { 1 }
91-
memory = { (800.MB + meta.nr_of_amino_acids * 2.B) * task.attempt }
92-
time = { (10.min + Math.ceil(meta.nr_of_sequences / 100) * 1.sec) * (2 * task.attempt - 1)}
91+
memory = { (800.MB + meta.nr_of_amino_acids * 24.B) * task.attempt }
92+
time = { 1.min * (0.5 * meta.nr_of_amino_acids / Math.pow(2,20) * Math.log(meta.nr_of_amino_acids / Math.pow(2, 20)) + 10) * task.attempt}
9393
}
9494

9595
withName: "INFER_KEYWORDS" {
9696
cpus = { 1 }
97-
memory = { (800.MB + meta.nr_of_sequences * 500.B) * task.attempt }
97+
memory = { (800.MB + meta.nr_of_sequences * 300.B) * task.attempt }
9898
time = { (10.min + Math.ceil(meta.nr_of_sequences / 500) * 1.sec) * (2 * task.attempt - 1)}
9999
}
100100

101+
withName: "IDENTIFY_PROTEINS_WITHOUT_DOMAIN_ANNOTATION" {
102+
cpus = { 1 }
103+
memory = {
104+
def tot_size = domain_files.collect {it.size() }.sum()
105+
// scale memory with input size, but at least 6GB
106+
def gb = Math.max(6, 4 * Math.ceil(tot_size / Math.pow(2,30)) + 3)
107+
return gb * 1.GB * task.attempt
108+
}
109+
time = { 4.h * task.attempt }
110+
}
111+
112+
withName: "INFER_HOG_PROFILES" {
113+
cpus = { (meta.nr_of_sequences < 1000000 ? 2 : (meta.nr_of_sequences < 10000000 ? 4 : 6)) * task.attempt }
114+
time = { (meta.nr_of_sequences > 10000000 ? 24.h : 8.h) * task.attempt }
115+
// memory should be 6GB per cpu.
116+
memory = { 6.GB * (meta.nr_of_sequences < 1000000 ? 2 : (meta.nr_of_sequences < 10000000 ? 4 : 6)) * task.attempt }
117+
}
118+
101119
withName: "HMMER_HMMSEARCH" {
102120
cpus = { 4 }
103121
memory = { 1.GB * (2*task.attempt-1) }
104122
time = { 2.h * (2*task.attempt-1) }
105123
}
106124

107-
}
125+
withName: ".*:BUILD_VPTAB_DATABASE" {
126+
cpus = { 8 }
127+
memory = { Math.max(Math.ceil(db.size()/Math.pow(2, 30)), 20) * 1.GB * task.attempt * task.cpus }
128+
time = { 12.h * task.attempt }
129+
}
130+
131+
withName: ".*:COMPUTE_CACHE" {
132+
cpus = { 1 }
133+
memory = { 6.GB * task.attempt }
134+
time = { 24.h * task.attempt }
135+
}
136+
137+
withName: ".*:COMBINE_H5_FILES" {
138+
cpus = { 1 }
139+
memory = { 3.GB + meta.nr_of_sequences * 2.KB * task.attempt }
140+
time = { (2.h + meta.nr_of_genomes * 10.sec) * (2 * task.attempt - 1) }
141+
}
142+
143+
withName: ".*:FILTER_AND_SPLIT" {
144+
cpus = {
145+
def nr_files = xref instanceof List ? xref.size() : 1
146+
def base_nr = nr_files < 4 ? 1 : (nr_files < 12 ? 2 : 6)
147+
return base_nr * task.attempt
148+
}
149+
memory = { 3.GB + task.cpus * 400.MB * task.attempt }
150+
time = { 8.h * task.attempt }
151+
}
152+
153+
154+
withName: ".*:MAP_XREFS" {
155+
cpus = { 6 }
156+
memory = { 2.GB + task.cpus * Math.ceil(meta.nr_of_sequences) * 1.KB * (2 * task.attempt - 1) }
157+
time = { 20.h * (2 * task.attempt -1) }
158+
}
159+
160+
withName: ".*:COLLECT_XREFS" {
161+
cpus = { 1 }
162+
memory = {
163+
def nr_files = map_results instanceof List ? map_results.size() : 1
164+
return 6.GB + (nr_files * 128.MB) * task.attempt
165+
}
166+
time = { 12.h * task.attempt }
167+
}
168+
169+
withName: ".*:COMBINE_ALL_XREFS" {
170+
cpus = { 1 }
171+
memory = {
172+
def total_size = 0
173+
if (xref_dbs instanceof List) {
174+
total_size = 2 * xref_dbs.collect {it.size()}.sum() / Math.pow(2,20);
175+
} else {
176+
total_size = 2 * xref_dbs.size() / Math.pow(2,20);
177+
}
178+
def log2_scale = Math.log(Math.max(total_size, 2)) / Math.log(2)
179+
return (6.GB + 1.MB * total_size * log2_scale) * task.attempt
180+
}
181+
time = { 1.min * Math.ceil(meta.nr_of_sequences / 12000) * (2 * task.attempt - 1) }
182+
}
183+
184+
withName: "OMAMER_BUILD" {
185+
cpus = { 1 }
186+
memory = {
187+
def basemem = 36.GB
188+
def multiplier = meta.id == "LUCA" ? 3 : (meta.id == "Metazoa" ? 2.5 : (meta.id == "Viridiplantae" ? 1 : 0.5))
189+
return basemem * multiplier * (2*task.attempt-1)
190+
}
191+
time = { 8.h * task.attempt }
192+
}
193+
194+
withName: "HOGPROP" {
195+
// 6.GB -> 24.GB -> 54.GB; some big hogs might need a lot of memory
196+
memory = { 6.GB * task.attempt * task.attempt }
197+
maxRetries = 2
198+
}
199+
}

config/euler_hpc.config

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
11
// Process scope
22
process {
33
// Node options
4-
resourceLimits = [ cpus: 48, memory: 350.GB, time: 72.h ]
4+
resourceLimits = [ cpus: 48, memory: 450.GB, time: 72.h ]
55
scratch = true
66
containerOptions = "--bind /scratch:/scratch"
77
beforeScript = 'module load eth_proxy'
88

99
withLabel: HIGH_IO_ACCESS {
10-
stageInMode = "copy"
10+
//stageInMode = "copy"
1111
scratch = true
1212
}
13+
1314
}
1415

1516
executor {
1617
name = "slurm"
1718
perCpuMemAllocation = true
1819
queueSize = 500
1920
}
21+
22+

config/test.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,6 @@ params {
3131
genomes_dir = "${projectDir}/testdata/fastoma/proteome"
3232
taxonomy_sqlite_path = "${projectDir}/testdata/taxonomy.sqlite"
3333
pfam_names_path = "${projectDir}/testdata/Pfam-A.clans.stub.tsv.gz"
34-
xref_refseq = "${projectDir}/assets/NO_FILE"
34+
cath_names_path = "${projectDir}/testdata/cath-names.txt"
3535
go_gaf = "${projectDir}/testdata/fastoma/*.goa"
3636
}

containers/oma/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ FROM basis AS builder
77
RUN apt-get update \
88
&& apt-get install -y --no-install-recommends \
99
build-essential \
10-
libhdf5-103 \
10+
libhdf5-310 \
1111
libhdf5-dev \
1212
git-core \
1313
pkg-config \

0 commit comments

Comments
 (0)