-
Notifications
You must be signed in to change notification settings - Fork 3
feat: Add VEP annotation with REVEL, Sift and PolyPhen Scores to fp… #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds a VEP/REVEL annotation pipeline and new helper functions, updates FP/FN VCF output paths to an annotated_vcf directory, and introduces curl/htslib conda environment manifests. Changes
Sequence DiagramsequenceDiagram
participant FP_FN as FP/FN VCFs
participant Cache as VEP Cache/Plugins
participant Revel as REVEL data (tsv + tbi)
participant VEP as VEP Annotator
participant Output as Annotated VCF + stats
rect rgb(230,240,255)
Cache->>Cache: get_vep_cache
Cache->>Cache: get_vep_plugins
Revel->>Revel: download_revel
Revel->>Revel: process_revel_scores (GRCh37/GRCh38)
Revel->>Revel: tabix_revel_scores
end
rect rgb(230,255,230)
FP_FN->>VEP: annotate_shared_fn / annotate_unique_fp_fn (input VCFs)
Cache->>VEP: provide VEP cache & plugins
Revel->>VEP: provide REVEL TSV + index
VEP->>Output: produce `.annotated.vcf.gz` + `.html` stats
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
workflow/rules/common.smk (1)
744-748: Type inconsistency: returns string vs empty list.
get_plugin_auxreturns a string whenplugin == "REVEL"but an empty list[]otherwise. In Snakemake, this mixed return type can work for inputs, but consider returning an empty string or consistently using lists for clarity.🔎 Proposed fix for consistent return type
def get_plugin_aux(plugin, index=False): if plugin == "REVEL": suffix = ".tbi" if index else "" return "resources/revel_scores.tsv.gz{suffix}".format(suffix=suffix) - return [] + return ""workflow/envs/htslib.yaml (1)
1-6: Consider updating htslib to version 1.22.1.htslib 1.12 was released in March 2021. The latest available version in bioconda is 1.22.1, which includes bug fixes and performance improvements. Updating is straightforward and poses minimal compatibility risk for tabix/bgzip operations.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
workflow/Snakefile(2 hunks)workflow/envs/curl.yaml(1 hunks)workflow/envs/htslib.yaml(1 hunks)workflow/rules/annotation.smk(1 hunks)workflow/rules/common.smk(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.
Applied to files:
workflow/Snakefileworkflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Testing
🔇 Additional comments (4)
workflow/rules/common.smk (1)
738-741: LGTM!The tabix parameter logic correctly maps the reference genome to the appropriate column for REVEL score indexing.
workflow/Snakefile (2)
18-18: LGTM!The annotation rules are correctly included within the
variant-callsconfig block, making them conditionally available only when variant calls are configured.
33-41: LGTM!The output paths are correctly updated to reference the new annotated VCF outputs, which aligns with the annotation pipeline that takes sorted VCFs as input and produces annotated VCFs.
workflow/rules/annotation.smk (1)
76-99: Clarify theaccess.random()version requirement and confirm intentional parameter difference.The
access.random()function is available in Snakemake 7.17.1+ (the current min_version requirement), not exclusively an 8.x feature. The two annotation rules intentionally differ in their extra parameters:annotate_shared_fnincludes--sift b --polyphen bwhileannotate_unique_fp_fnomits these flags. Verify this difference aligns with your analysis requirements.
| rule download_revel: | ||
| output: | ||
| temp("resources/revel_scores.zip"), | ||
| log: | ||
| "logs/vep_plugins/download_revel.log", | ||
| conda: | ||
| "../envs/curl.yaml" | ||
| shell: | ||
| "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add -L flag to curl for redirect handling.
The Zenodo URL may redirect. Adding -L ensures curl follows redirects. Also consider adding --fail to exit with error on HTTP failures.
🔎 Proposed fix
shell:
- "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
+ "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"🤖 Prompt for AI Agents
In workflow/rules/annotation.smk around lines 26 to 34, the curl command used to
download the Zenodo file doesn't follow HTTP redirects or fail on HTTP errors;
update the shell command to include the -L flag to follow redirects and add
--fail (and optionally --show-error) so curl exits with a failure status on HTTP
errors and prints errors to the log; keep the redirection to the rule log
intact.
| rule process_revel_scores: | ||
| input: | ||
| "resources/revel_scores.zip", | ||
| output: | ||
| "resources/revel_scores.tsv.gz", | ||
| params: | ||
| build=get_reference_genome_build(), | ||
| log: | ||
| "logs/vep_plugins/process_revel_scores.log", | ||
| conda: | ||
| "../envs/htslib.yaml" | ||
| shell: | ||
| """ | ||
| tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX) | ||
| unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile | ||
| if [ "{params.build}" == "GRCh38" ] ; then | ||
| zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} | ||
| elif [ "{params.build}" == "GRCh37" ] ; then | ||
| cat $tmpfile > {output} | ||
| else | ||
| echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} | ||
| exit 125 | ||
| fi | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing resources declaration for tmpdir and temp file cleanup.
The shell script references {resources.tmpdir} but no resources block is declared. This will cause a runtime error. Also, the temporary file $tmpfile is not removed after use.
🔎 Proposed fix
rule process_revel_scores:
input:
"resources/revel_scores.zip",
output:
"resources/revel_scores.tsv.gz",
params:
build=get_reference_genome_build(),
log:
"logs/vep_plugins/process_revel_scores.log",
+ resources:
+ tmpdir=temp("tmpdir"),
conda:
"../envs/htslib.yaml"
shell:
"""
- tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+ tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)
+ trap "rm -f $tmpfile" EXIT
unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
if [ "{params.build}" == "GRCh38" ] ; then
zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
elif [ "{params.build}" == "GRCh37" ] ; then
cat $tmpfile > {output}
else
echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
exit 125
fi
"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| rule process_revel_scores: | |
| input: | |
| "resources/revel_scores.zip", | |
| output: | |
| "resources/revel_scores.tsv.gz", | |
| params: | |
| build=get_reference_genome_build(), | |
| log: | |
| "logs/vep_plugins/process_revel_scores.log", | |
| conda: | |
| "../envs/htslib.yaml" | |
| shell: | |
| """ | |
| tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX) | |
| unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile | |
| if [ "{params.build}" == "GRCh38" ] ; then | |
| zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} | |
| elif [ "{params.build}" == "GRCh37" ] ; then | |
| cat $tmpfile > {output} | |
| else | |
| echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} | |
| exit 125 | |
| fi | |
| """ | |
| rule process_revel_scores: | |
| input: | |
| "resources/revel_scores.zip", | |
| output: | |
| "resources/revel_scores.tsv.gz", | |
| params: | |
| build=get_reference_genome_build(), | |
| log: | |
| "logs/vep_plugins/process_revel_scores.log", | |
| resources: | |
| tmpdir=temp("tmpdir"), | |
| conda: | |
| "../envs/htslib.yaml" | |
| shell: | |
| """ | |
| tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX) | |
| trap "rm -f $tmpfile" EXIT | |
| unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile | |
| if [ "{params.build}" == "GRCh38" ] ; then | |
| zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} | |
| elif [ "{params.build}" == "GRCh37" ] ; then | |
| cat $tmpfile > {output} | |
| else | |
| echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} | |
| exit 125 | |
| fi | |
| """ |
🤖 Prompt for AI Agents
In workflow/rules/annotation.smk around lines 37 to 60, the rule
process_revel_scores uses {resources.tmpdir} but no resources block is declared
and the temporary file $tmpfile is never cleaned up; add a resources:
tmpdir=<appropriate_tmpdir_path> declaration to the rule (or use resources:
tmpdir=configured_tmpdir variable) so Snakemake provides the directory, and
modify the shell block to create the temp file inside that tmpdir and remove it
on exit (use a trap to rm -f $tmpfile on EXIT or ensure rm -f $tmpfile at the
end of the script) so the temp file is always cleaned up even on errors.
…s/dna-seq-benchmark into feat/add-revel-scores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
workflow/rules/annotation.smk (2)
26-34: Add-Land--failflags to curl command.The curl command should include
-Lto follow redirects and--failto exit with error status on HTTP failures, ensuring robust downloads from Zenodo.🔎 Proposed fix
shell: - "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}" + "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
37-60: Missingresourcesblock causes runtime error.The shell script references
{resources.tmpdir}on line 50, but noresourcesblock is declared in the rule. This will cause a Snakemake runtime error. Additionally, the temporary file is not cleaned up on exit.🔎 Proposed fix
rule process_revel_scores: input: "resources/revel_scores.zip", output: "resources/revel_scores.tsv.gz", params: build=get_reference_genome_build(), log: "logs/vep_plugins/process_revel_scores.log", + resources: + tmpdir=config.get("tmpdir", "/tmp"), conda: "../envs/htslib.yaml" shell: """ - tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX) + tmpfile=$(mktemp -p {resources.tmpdir} revel_scores.XXXXXX) + trap 'rm -f "$tmpfile"' EXIT unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile if [ "{params.build}" == "GRCh38" ] ; then zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} elif [ "{params.build}" == "GRCh37" ] ; then cat $tmpfile > {output} else echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} exit 125 fi """
🧹 Nitpick comments (1)
workflow/rules/annotation.smk (1)
102-125: LGTM! SIFT and PolyPhen flags are correctly included.The rule structure is correct and now includes the
--sift b --polyphen bflags in theextraparameter (line 118) as intended per the PR objectives.Note: There's a minor trailing space after
"...polyphen b "at line 118 that can be trimmed (optional formatting nitpick).
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
workflow/envs/curl.yaml(1 hunks)workflow/rules/annotation.smk(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- workflow/envs/curl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.
Applied to files:
workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Testing
🔇 Additional comments (5)
workflow/rules/annotation.smk (5)
15-23: LGTM!Standard VEP plugins setup with appropriate wrapper usage.
12-12: No action needed. The snakemake-wrappers version v8.0.2 used throughout the workflow is valid and current.
76-99: No issues found. The helper functionget_plugin_aux()is correctly implemented inworkflow/rules/common.smk, the lambda function calls with arguments are proper, andaccess.random()is the correct Snakemake API for this resource access pattern. The trailing space in theextraparameter at line 92 can optionally be trimmed for consistency.
63-73: Rule structure and implementation are correct.The
get_tabix_revel_params()function inworkflow/rules/common.smk(lines 738-741) correctly returns build-appropriate tabix parameters for REVEL score indexing. It selects the correct column (2 for GRCh37, 3 otherwise) and uses appropriate tabix flags (-f -s 1 -b {column} -e {column}) for indexing the TSV file across different reference genomes.
1-12: Helper function is properly implemented and returns correct build strings.The
get_reference_genome_build()function inworkflow/rules/common.smkis correctly implemented. Wrapper version v8.0.2 exists and is available in the snakemake-wrappers repository. The function validates the configuration and returns the expected values:
- "GRCh37" for grch37 configuration
- "GRCh38" for grch38 configuration
The rule structure correctly passes this value to the VEP cache wrapper as the build parameter.
…/fn vcfs.
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.