feat: Add VEP annotation with REVEL, Sift and PolyPhen Scores to fp… #154

BiancaStoecker · 2025-12-19T13:10:51Z

…/fn vcfs.

Summary by CodeRabbit

New Features
- Added VEP+REVEL variant annotation to the pipeline, enriching FP/FN VCFs with effect predictions and REVEL scores.
- FP/FN shared and unique callset VCFs are now produced as annotated VCFs (with accompanying stats/HTML).
Chores
- Updated workflow tool environments for improved tooling and indexing support.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…/fn vcfs.

coderabbitai · 2025-12-19T13:11:01Z

Walkthrough

Adds a VEP/REVEL annotation pipeline and new helper functions, updates FP/FN VCF output paths to an annotated_vcf directory, and introduces curl/htslib conda environment manifests.

Changes

Cohort / File(s)	Summary
Snakefile & path updates `workflow/Snakefile`	Includes the new annotation rules module and changes FP/FN VCF outputs to `results/fp-fn/annotated_vcf/...` (renaming from `vcf/` and updating filenames to `.annotated.vcf.gz`).
Annotation rules `workflow/rules/annotation.smk`	New Snakemake rules: `get_vep_cache`, `get_vep_plugins`, `download_revel`, `process_revel_scores`, `tabix_revel_scores`, `annotate_shared_fn`, `annotate_unique_fp_fn` (VEP + REVEL integration, per-build handling, logs, wrappers).
Shared helpers `workflow/rules/common.smk`	Added `get_tabix_revel_params()` (column selection by reference-genome) and `get_plugin_aux(plugin, index=False)` (REVEL resource/path helper).
Conda environments `workflow/envs/curl.yaml`, `workflow/envs/htslib.yaml`	`curl.yaml` updated to curl=8.17.0; `htslib.yaml` added (htslib=1.12, unzip=6.0) with channels conda-forge/bioconda.

Sequence Diagram

sequenceDiagram
    participant FP_FN as FP/FN VCFs
    participant Cache as VEP Cache/Plugins
    participant Revel as REVEL data (tsv + tbi)
    participant VEP as VEP Annotator
    participant Output as Annotated VCF + stats

    rect rgb(230,240,255)
    Cache->>Cache: get_vep_cache
    Cache->>Cache: get_vep_plugins
    Revel->>Revel: download_revel
    Revel->>Revel: process_revel_scores (GRCh37/GRCh38)
    Revel->>Revel: tabix_revel_scores
    end

    rect rgb(230,255,230)
    FP_FN->>VEP: annotate_shared_fn / annotate_unique_fp_fn (input VCFs)
    Cache->>VEP: provide VEP cache & plugins
    Revel->>VEP: provide REVEL TSV + index
    VEP->>Output: produce `.annotated.vcf.gz` + `.html` stats
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Reviewers should focus on:
- workflow/rules/annotation.smk (shell/conda commands, per-build branching, wrapper parameters).
- workflow/Snakefile path updates to ensure all downstream rules reference annotated_vcf correctly.
- workflow/rules/common.smk helper logic for tabix column selection and plugin auxiliary paths.
- Environment files (workflow/envs/*.yaml) for correct versions and channels.

Possibly related PRs

feat: add fp-fn-filtering and write to vcf #137 — introduces the FP/FN VCF generation flow whose outputs are now consumed/renamed by this PR for annotation.

Suggested reviewers

famosab
johanneskoester

Poem

🐰 I nibbled through VCFs all night,
VEP and REVEL made variants bright.
Caches fetched, scores zipped with cheer,
Annotated files hop into here! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes in the PR, which add VEP annotation with REVEL scoring to false positive/false negative VCF files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/add-revel-scores

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

workflow/rules/common.smk (1)
744-748: Type inconsistency: returns string vs empty list.

get_plugin_aux returns a string when plugin == "REVEL" but an empty list [] otherwise. In Snakemake, this mixed return type can work for inputs, but consider returning an empty string or consistently using lists for clarity.
🔎 Proposed fix for consistent return type
 def get_plugin_aux(plugin, index=False):
     if plugin == "REVEL":
         suffix = ".tbi" if index else ""
         return "resources/revel_scores.tsv.gz{suffix}".format(suffix=suffix)
-    return []
+    return ""
workflow/envs/htslib.yaml (1)

1-6: Consider updating htslib to version 1.22.1.

htslib 1.12 was released in March 2021. The latest available version in bioconda is 1.22.1, which includes bug fixes and performance improvements. Updating is straightforward and poses minimal compatibility risk for tabix/bgzip operations.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d820558 and 30894c7.

📒 Files selected for processing (5)

workflow/Snakefile (2 hunks)
workflow/envs/curl.yaml (1 hunks)
workflow/envs/htslib.yaml (1 hunks)
workflow/rules/annotation.smk (1 hunks)
workflow/rules/common.smk (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-11T08:47:27.118Z

Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.

Applied to files:

workflow/Snakefile
workflow/rules/annotation.smk

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Testing

🔇 Additional comments (4)

workflow/rules/common.smk (1)

738-741: LGTM!

The tabix parameter logic correctly maps the reference genome to the appropriate column for REVEL score indexing.

workflow/Snakefile (2)

18-18: LGTM!

The annotation rules are correctly included within the variant-calls config block, making them conditionally available only when variant calls are configured.

33-41: LGTM!

The output paths are correctly updated to reference the new annotated VCF outputs, which aligns with the annotation pipeline that takes sorted VCFs as input and produces annotated VCFs.

workflow/rules/annotation.smk (1)

76-99: Clarify the access.random() version requirement and confirm intentional parameter difference.

The access.random() function is available in Snakemake 7.17.1+ (the current min_version requirement), not exclusively an 8.x feature. The two annotation rules intentionally differ in their extra parameters: annotate_shared_fn includes --sift b --polyphen b while annotate_unique_fp_fn omits these flags. Verify this difference aligns with your analysis requirements.

workflow/envs/curl.yaml

coderabbitai · 2025-12-19T13:18:18Z

workflow/rules/annotation.smk

+rule download_revel:
+    output:
+        temp("resources/revel_scores.zip"),
+    log:
+        "logs/vep_plugins/download_revel.log",
+    conda:
+        "../envs/curl.yaml"
+    shell:
+        "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"


⚠️ Potential issue | 🟡 Minor

Add -L flag to curl for redirect handling.

The Zenodo URL may redirect. Adding -L ensures curl follows redirects. Also consider adding --fail to exit with error on HTTP failures.

🔎 Proposed fix

shell: - "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}" + "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"

🤖 Prompt for AI Agents

In workflow/rules/annotation.smk around lines 26 to 34, the curl command used to download the Zenodo file doesn't follow HTTP redirects or fail on HTTP errors; update the shell command to include the -L flag to follow redirects and add --fail (and optionally --show-error) so curl exits with a failure status on HTTP errors and prints errors to the log; keep the redirection to the rule log intact.

coderabbitai · 2025-12-19T13:18:18Z

workflow/rules/annotation.smk

+rule process_revel_scores:
+    input:
+        "resources/revel_scores.zip",
+    output:
+        "resources/revel_scores.tsv.gz",
+    params:
+        build=get_reference_genome_build(),
+    log:
+        "logs/vep_plugins/process_revel_scores.log",
+    conda:
+        "../envs/htslib.yaml"
+    shell:
+        """
+        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
+        if [ "{params.build}" == "GRCh38" ] ; then
+            zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
+        elif [ "{params.build}" == "GRCh37" ] ; then
+            cat $tmpfile > {output}
+        else
+            echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
+            exit 125
+        fi
+        """


⚠️ Potential issue | 🔴 Critical

Missing resources declaration for tmpdir and temp file cleanup.

The shell script references {resources.tmpdir} but no resources block is declared. This will cause a runtime error. Also, the temporary file $tmpfile is not removed after use.

🔎 Proposed fix

rule process_revel_scores: input: "resources/revel_scores.zip", output: "resources/revel_scores.tsv.gz", params: build=get_reference_genome_build(), log: "logs/vep_plugins/process_revel_scores.log", + resources: + tmpdir=temp("tmpdir"), conda: "../envs/htslib.yaml" shell: """ - tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX) + tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX) + trap "rm -f $tmpfile" EXIT unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile if [ "{params.build}" == "GRCh38" ] ; then zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output} elif [ "{params.build}" == "GRCh37" ] ; then cat $tmpfile > {output} else echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log} exit 125 fi """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

rule process_revel_scores:

input:

"resources/revel_scores.zip",

output:

"resources/revel_scores.tsv.gz",

params:

build=get_reference_genome_build(),

log:

"logs/vep_plugins/process_revel_scores.log",

conda:

"../envs/htslib.yaml"

shell:

"""

tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)

unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile

if [ "{params.build}" == "GRCh38" ] ; then

zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}

elif [ "{params.build}" == "GRCh37" ] ; then

cat $tmpfile > {output}

else

echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}

exit 125

fi

"""

rule process_revel_scores:

input:

"resources/revel_scores.zip",

output:

"resources/revel_scores.tsv.gz",

params:

build=get_reference_genome_build(),

log:

"logs/vep_plugins/process_revel_scores.log",

resources:

tmpdir=temp("tmpdir"),

conda:

"../envs/htslib.yaml"

shell:

"""

tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)

trap "rm -f $tmpfile" EXIT

unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile

if [ "{params.build}" == "GRCh38" ] ; then

zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}

elif [ "{params.build}" == "GRCh37" ] ; then

cat $tmpfile > {output}

else

echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}

exit 125

fi

"""

🤖 Prompt for AI Agents

In workflow/rules/annotation.smk around lines 37 to 60, the rule process_revel_scores uses {resources.tmpdir} but no resources block is declared and the temporary file $tmpfile is never cleaned up; add a resources: tmpdir=<appropriate_tmpdir_path> declaration to the rule (or use resources: tmpdir=configured_tmpdir variable) so Snakemake provides the directory, and modify the shell block to create the temp file inside that tmpdir and remove it on exit (use a trap to rm -f $tmpfile on EXIT or ensure rm -f $tmpfile at the end of the script) so the temp file is always cleaned up even on errors.

workflow/rules/annotation.smk

…s/dna-seq-benchmark into feat/add-revel-scores

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

workflow/rules/annotation.smk (2)

26-34: Add -L and --fail flags to curl command.

The curl command should include -L to follow redirects and --fail to exit with error status on HTTP failures, ensuring robust downloads from Zenodo.

🔎 Proposed fix

     shell:
-        "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
+        "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"

37-60: Missing resources block causes runtime error.

The shell script references {resources.tmpdir} on line 50, but no resources block is declared in the rule. This will cause a Snakemake runtime error. Additionally, the temporary file is not cleaned up on exit.

🔎 Proposed fix

 rule process_revel_scores:
     input:
         "resources/revel_scores.zip",
     output:
         "resources/revel_scores.tsv.gz",
     params:
         build=get_reference_genome_build(),
     log:
         "logs/vep_plugins/process_revel_scores.log",
+    resources:
+        tmpdir=config.get("tmpdir", "/tmp"),
     conda:
         "../envs/htslib.yaml"
     shell:
         """
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp -p {resources.tmpdir} revel_scores.XXXXXX)
+        trap 'rm -f "$tmpfile"' EXIT
         unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
         if [ "{params.build}" == "GRCh38" ] ; then
             zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
         elif [ "{params.build}" == "GRCh37" ] ; then
             cat $tmpfile > {output}
         else
             echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
             exit 125
         fi
         """

🧹 Nitpick comments (1)

workflow/rules/annotation.smk (1)

102-125: LGTM! SIFT and PolyPhen flags are correctly included.

The rule structure is correct and now includes the --sift b --polyphen b flags in the extra parameter (line 118) as intended per the PR objectives.

Note: There's a minor trailing space after "...polyphen b " at line 118 that can be trimmed (optional formatting nitpick).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30894c7 and ba99965.

📒 Files selected for processing (2)

workflow/envs/curl.yaml (1 hunks)
workflow/rules/annotation.smk (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

workflow/envs/curl.yaml

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-11T08:47:27.118Z

Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.

Applied to files:

workflow/rules/annotation.smk

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Testing

🔇 Additional comments (5)

workflow/rules/annotation.smk (5)

15-23: LGTM!

Standard VEP plugins setup with appropriate wrapper usage.

12-12: No action needed. The snakemake-wrappers version v8.0.2 used throughout the workflow is valid and current.

76-99: No issues found. The helper function get_plugin_aux() is correctly implemented in workflow/rules/common.smk, the lambda function calls with arguments are proper, and access.random() is the correct Snakemake API for this resource access pattern. The trailing space in the extra parameter at line 92 can optionally be trimmed for consistency.

63-73: Rule structure and implementation are correct.

The get_tabix_revel_params() function in workflow/rules/common.smk (lines 738-741) correctly returns build-appropriate tabix parameters for REVEL score indexing. It selects the correct column (2 for GRCh37, 3 otherwise) and uses appropriate tabix flags (-f -s 1 -b {column} -e {column}) for indexing the TSV file across different reference genomes.

1-12: Helper function is properly implemented and returns correct build strings.

The get_reference_genome_build() function in workflow/rules/common.smk is correctly implemented. Wrapper version v8.0.2 exists and is available in the snakemake-wrappers repository. The function validates the configuration and returns the expected values:

"GRCh37" for grch37 configuration

"GRCh38" for grch38 configuration

The rule structure correctly passes this value to the VEP cache wrapper as the build parameter.

feat: added VEP annotation with REVEL, Sift and PolyPhen Scores to fp…

1e63fb0

…/fn vcfs.

Merge branch 'main' into feat/add-revel-scores

30894c7

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

BiancaStoecker added 3 commits December 19, 2025 14:22

fix: also all sift and polyphen to unique fp fn.

e991968

fix: update curl version

b6e3955

Merge branch 'feat/add-revel-scores' of github.com:snakemake-workflow…

ba99965

…s/dna-seq-benchmark into feat/add-revel-scores

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add VEP annotation with REVEL, Sift and PolyPhen Scores to fp… #154

feat: Add VEP annotation with REVEL, Sift and PolyPhen Scores to fp… #154

Uh oh!

BiancaStoecker commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Dec 19, 2025

Uh oh!

coderabbitai bot Dec 19, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add VEP annotation with REVEL, Sift and PolyPhen Scores to fp… #154

Are you sure you want to change the base?

feat: Add VEP annotation with REVEL, Sift and PolyPhen Scores to fp… #154

Uh oh!

Conversation

BiancaStoecker commented Dec 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BiancaStoecker commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 19, 2025 •

edited

Loading