Skip to content

Conversation

@BiancaStoecker
Copy link
Collaborator

@BiancaStoecker BiancaStoecker commented Dec 19, 2025

…/fn vcfs.

Summary by CodeRabbit

  • New Features

    • Added VEP+REVEL variant annotation to the pipeline, enriching FP/FN VCFs with effect predictions and REVEL scores.
    • FP/FN shared and unique callset VCFs are now produced as annotated VCFs (with accompanying stats/HTML).
  • Chores

    • Updated workflow tool environments for improved tooling and indexing support.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Walkthrough

Adds a VEP/REVEL annotation pipeline and new helper functions, updates FP/FN VCF output paths to an annotated_vcf directory, and introduces curl/htslib conda environment manifests.

Changes

Cohort / File(s) Summary
Snakefile & path updates
workflow/Snakefile
Includes the new annotation rules module and changes FP/FN VCF outputs to results/fp-fn/annotated_vcf/... (renaming from vcf/ and updating filenames to .annotated.vcf.gz).
Annotation rules
workflow/rules/annotation.smk
New Snakemake rules: get_vep_cache, get_vep_plugins, download_revel, process_revel_scores, tabix_revel_scores, annotate_shared_fn, annotate_unique_fp_fn (VEP + REVEL integration, per-build handling, logs, wrappers).
Shared helpers
workflow/rules/common.smk
Added get_tabix_revel_params() (column selection by reference-genome) and get_plugin_aux(plugin, index=False) (REVEL resource/path helper).
Conda environments
workflow/envs/curl.yaml, workflow/envs/htslib.yaml
curl.yaml updated to curl=8.17.0; htslib.yaml added (htslib=1.12, unzip=6.0) with channels conda-forge/bioconda.

Sequence Diagram

sequenceDiagram
    participant FP_FN as FP/FN VCFs
    participant Cache as VEP Cache/Plugins
    participant Revel as REVEL data (tsv + tbi)
    participant VEP as VEP Annotator
    participant Output as Annotated VCF + stats

    rect rgb(230,240,255)
    Cache->>Cache: get_vep_cache
    Cache->>Cache: get_vep_plugins
    Revel->>Revel: download_revel
    Revel->>Revel: process_revel_scores (GRCh37/GRCh38)
    Revel->>Revel: tabix_revel_scores
    end

    rect rgb(230,255,230)
    FP_FN->>VEP: annotate_shared_fn / annotate_unique_fp_fn (input VCFs)
    Cache->>VEP: provide VEP cache & plugins
    Revel->>VEP: provide REVEL TSV + index
    VEP->>Output: produce `.annotated.vcf.gz` + `.html` stats
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Reviewers should focus on:
    • workflow/rules/annotation.smk (shell/conda commands, per-build branching, wrapper parameters).
    • workflow/Snakefile path updates to ensure all downstream rules reference annotated_vcf correctly.
    • workflow/rules/common.smk helper logic for tabix column selection and plugin auxiliary paths.
    • Environment files (workflow/envs/*.yaml) for correct versions and channels.

Possibly related PRs

Suggested reviewers

  • famosab
  • johanneskoester

Poem

🐰 I nibbled through VCFs all night,
VEP and REVEL made variants bright.
Caches fetched, scores zipped with cheer,
Annotated files hop into here! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes in the PR, which add VEP annotation with REVEL scoring to false positive/false negative VCF files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/add-revel-scores

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
workflow/rules/common.smk (1)

744-748: Type inconsistency: returns string vs empty list.

get_plugin_aux returns a string when plugin == "REVEL" but an empty list [] otherwise. In Snakemake, this mixed return type can work for inputs, but consider returning an empty string or consistently using lists for clarity.

🔎 Proposed fix for consistent return type
 def get_plugin_aux(plugin, index=False):
     if plugin == "REVEL":
         suffix = ".tbi" if index else ""
         return "resources/revel_scores.tsv.gz{suffix}".format(suffix=suffix)
-    return []
+    return ""
workflow/envs/htslib.yaml (1)

1-6: Consider updating htslib to version 1.22.1.

htslib 1.12 was released in March 2021. The latest available version in bioconda is 1.22.1, which includes bug fixes and performance improvements. Updating is straightforward and poses minimal compatibility risk for tabix/bgzip operations.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d820558 and 30894c7.

📒 Files selected for processing (5)
  • workflow/Snakefile (2 hunks)
  • workflow/envs/curl.yaml (1 hunks)
  • workflow/envs/htslib.yaml (1 hunks)
  • workflow/rules/annotation.smk (1 hunks)
  • workflow/rules/common.smk (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.

Applied to files:

  • workflow/Snakefile
  • workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Testing
🔇 Additional comments (4)
workflow/rules/common.smk (1)

738-741: LGTM!

The tabix parameter logic correctly maps the reference genome to the appropriate column for REVEL score indexing.

workflow/Snakefile (2)

18-18: LGTM!

The annotation rules are correctly included within the variant-calls config block, making them conditionally available only when variant calls are configured.


33-41: LGTM!

The output paths are correctly updated to reference the new annotated VCF outputs, which aligns with the annotation pipeline that takes sorted VCFs as input and produces annotated VCFs.

workflow/rules/annotation.smk (1)

76-99: Clarify the access.random() version requirement and confirm intentional parameter difference.

The access.random() function is available in Snakemake 7.17.1+ (the current min_version requirement), not exclusively an 8.x feature. The two annotation rules intentionally differ in their extra parameters: annotate_shared_fn includes --sift b --polyphen b while annotate_unique_fp_fn omits these flags. Verify this difference aligns with your analysis requirements.

Comment on lines +26 to +34
rule download_revel:
output:
temp("resources/revel_scores.zip"),
log:
"logs/vep_plugins/download_revel.log",
conda:
"../envs/curl.yaml"
shell:
"curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add -L flag to curl for redirect handling.

The Zenodo URL may redirect. Adding -L ensures curl follows redirects. Also consider adding --fail to exit with error on HTTP failures.

🔎 Proposed fix
     shell:
-        "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
+        "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
🤖 Prompt for AI Agents
In workflow/rules/annotation.smk around lines 26 to 34, the curl command used to
download the Zenodo file doesn't follow HTTP redirects or fail on HTTP errors;
update the shell command to include the -L flag to follow redirects and add
--fail (and optionally --show-error) so curl exits with a failure status on HTTP
errors and prints errors to the log; keep the redirection to the rule log
intact.

Comment on lines +37 to +60
rule process_revel_scores:
input:
"resources/revel_scores.zip",
output:
"resources/revel_scores.tsv.gz",
params:
build=get_reference_genome_build(),
log:
"logs/vep_plugins/process_revel_scores.log",
conda:
"../envs/htslib.yaml"
shell:
"""
tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
if [ "{params.build}" == "GRCh38" ] ; then
zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
elif [ "{params.build}" == "GRCh37" ] ; then
cat $tmpfile > {output}
else
echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
exit 125
fi
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing resources declaration for tmpdir and temp file cleanup.

The shell script references {resources.tmpdir} but no resources block is declared. This will cause a runtime error. Also, the temporary file $tmpfile is not removed after use.

🔎 Proposed fix
 rule process_revel_scores:
     input:
         "resources/revel_scores.zip",
     output:
         "resources/revel_scores.tsv.gz",
     params:
         build=get_reference_genome_build(),
     log:
         "logs/vep_plugins/process_revel_scores.log",
+    resources:
+        tmpdir=temp("tmpdir"),
     conda:
         "../envs/htslib.yaml"
     shell:
         """
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)
+        trap "rm -f $tmpfile" EXIT
         unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
         if [ "{params.build}" == "GRCh38" ] ; then
             zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
         elif [ "{params.build}" == "GRCh37" ] ; then
             cat $tmpfile > {output}
         else
             echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
             exit 125
         fi
         """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rule process_revel_scores:
input:
"resources/revel_scores.zip",
output:
"resources/revel_scores.tsv.gz",
params:
build=get_reference_genome_build(),
log:
"logs/vep_plugins/process_revel_scores.log",
conda:
"../envs/htslib.yaml"
shell:
"""
tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
if [ "{params.build}" == "GRCh38" ] ; then
zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
elif [ "{params.build}" == "GRCh37" ] ; then
cat $tmpfile > {output}
else
echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
exit 125
fi
"""
rule process_revel_scores:
input:
"resources/revel_scores.zip",
output:
"resources/revel_scores.tsv.gz",
params:
build=get_reference_genome_build(),
log:
"logs/vep_plugins/process_revel_scores.log",
resources:
tmpdir=temp("tmpdir"),
conda:
"../envs/htslib.yaml"
shell:
"""
tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)
trap "rm -f $tmpfile" EXIT
unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
if [ "{params.build}" == "GRCh38" ] ; then
zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
elif [ "{params.build}" == "GRCh37" ] ; then
cat $tmpfile > {output}
else
echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
exit 125
fi
"""
🤖 Prompt for AI Agents
In workflow/rules/annotation.smk around lines 37 to 60, the rule
process_revel_scores uses {resources.tmpdir} but no resources block is declared
and the temporary file $tmpfile is never cleaned up; add a resources:
tmpdir=<appropriate_tmpdir_path> declaration to the rule (or use resources:
tmpdir=configured_tmpdir variable) so Snakemake provides the directory, and
modify the shell block to create the temp file inside that tmpdir and remove it
on exit (use a trap to rm -f $tmpfile on EXIT or ensure rm -f $tmpfile at the
end of the script) so the temp file is always cleaned up even on errors.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
workflow/rules/annotation.smk (2)

26-34: Add -L and --fail flags to curl command.

The curl command should include -L to follow redirects and --fail to exit with error status on HTTP failures, ensuring robust downloads from Zenodo.

🔎 Proposed fix
     shell:
-        "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
+        "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"

37-60: Missing resources block causes runtime error.

The shell script references {resources.tmpdir} on line 50, but no resources block is declared in the rule. This will cause a Snakemake runtime error. Additionally, the temporary file is not cleaned up on exit.

🔎 Proposed fix
 rule process_revel_scores:
     input:
         "resources/revel_scores.zip",
     output:
         "resources/revel_scores.tsv.gz",
     params:
         build=get_reference_genome_build(),
     log:
         "logs/vep_plugins/process_revel_scores.log",
+    resources:
+        tmpdir=config.get("tmpdir", "/tmp"),
     conda:
         "../envs/htslib.yaml"
     shell:
         """
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp -p {resources.tmpdir} revel_scores.XXXXXX)
+        trap 'rm -f "$tmpfile"' EXIT
         unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
         if [ "{params.build}" == "GRCh38" ] ; then
             zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
         elif [ "{params.build}" == "GRCh37" ] ; then
             cat $tmpfile > {output}
         else
             echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
             exit 125
         fi
         """
🧹 Nitpick comments (1)
workflow/rules/annotation.smk (1)

102-125: LGTM! SIFT and PolyPhen flags are correctly included.

The rule structure is correct and now includes the --sift b --polyphen b flags in the extra parameter (line 118) as intended per the PR objectives.

Note: There's a minor trailing space after "...polyphen b " at line 118 that can be trimmed (optional formatting nitpick).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30894c7 and ba99965.

📒 Files selected for processing (2)
  • workflow/envs/curl.yaml (1 hunks)
  • workflow/rules/annotation.smk (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/envs/curl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.

Applied to files:

  • workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Testing
🔇 Additional comments (5)
workflow/rules/annotation.smk (5)

15-23: LGTM!

Standard VEP plugins setup with appropriate wrapper usage.


12-12: No action needed. The snakemake-wrappers version v8.0.2 used throughout the workflow is valid and current.


76-99: No issues found. The helper function get_plugin_aux() is correctly implemented in workflow/rules/common.smk, the lambda function calls with arguments are proper, and access.random() is the correct Snakemake API for this resource access pattern. The trailing space in the extra parameter at line 92 can optionally be trimmed for consistency.


63-73: Rule structure and implementation are correct.

The get_tabix_revel_params() function in workflow/rules/common.smk (lines 738-741) correctly returns build-appropriate tabix parameters for REVEL score indexing. It selects the correct column (2 for GRCh37, 3 otherwise) and uses appropriate tabix flags (-f -s 1 -b {column} -e {column}) for indexing the TSV file across different reference genomes.


1-12: Helper function is properly implemented and returns correct build strings.

The get_reference_genome_build() function in workflow/rules/common.smk is correctly implemented. Wrapper version v8.0.2 exists and is available in the snakemake-wrappers repository. The function validates the configuration and returns the expected values:

  • "GRCh37" for grch37 configuration
  • "GRCh38" for grch38 configuration

The rule structure correctly passes this value to the VEP cache wrapper as the build parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants