-
Notifications
You must be signed in to change notification settings - Fork 36
Add coverage #273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add coverage #273
Changes from all commits
0756eb9
358213d
a0aca5a
9812834
e5374c7
1c8b00b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -202,4 +202,5 @@ def wrapper(*args, **kwargs): | |
| filterbycov, | ||
| header, | ||
| scaling, | ||
| coverage, | ||
| ) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| #!/usr/bin/env python | ||
| # -*- coding: utf-8 -*- | ||
| """ | ||
| Command-line interface for coverage analysis. | ||
|
|
||
| This module provides a Click-based CLI that wraps the coverage_api module. | ||
| """ | ||
|
|
||
| import click | ||
| import bioframe | ||
|
|
||
| from ..lib.coverage import calculate_coverage, read_pairs | ||
| from . import cli, common_io_options | ||
|
|
||
|
|
||
| UTIL_NAME = "pairtools_coverage" | ||
|
|
||
|
|
||
| @cli.command() | ||
| @click.argument( | ||
| "pairs_path", | ||
| type=str, | ||
| required=False, | ||
| ) | ||
| @click.option( | ||
| "--side", | ||
| "-s", | ||
| type=click.Choice(["both", "1", "2"]), | ||
| default="1", | ||
| show_default=True, | ||
| help="both: both sides, 1: first side, 2: second side", | ||
| ) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Might be a good idea to allow using (1) the midpoints of the reads for calculation and (2) whole length of the read (each nucleotide of the read adds +1 to the coverage)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, will add it! At least the midpoint I was planning anyway for sure. |
||
| @click.option( | ||
| "--end", | ||
| type=click.Choice(["5", "3", "0"]), | ||
| default="0", | ||
| show_default=True, | ||
| help="5', 3' end, or 0 (default) - whatever is reported as pos1 and pos2 in pairs", | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| ) | ||
| @click.option( | ||
| "--shift", | ||
| type=int, | ||
| default=0, | ||
| show_default=True, | ||
| help="Shift value for strand-specific adjustment. Positive values shift downstream " | ||
| "relative to the corresponding strand, negative values shift upstream. " | ||
| "Can be useful to e.g. account for nucleosome size in micro-C by shifting by 73 bp", | ||
| ) | ||
| @click.option( | ||
| "--window-size", | ||
| type=int, | ||
| default=10, | ||
| show_default=True, | ||
| help="Window size for binning", | ||
| ) | ||
| @click.option( | ||
| "--chunksize", | ||
| type=int, | ||
| default=1000000, | ||
| show_default=True, | ||
| help="Number of pairs to process at once", | ||
| ) | ||
| @click.option( | ||
| "--threads", | ||
| "-t", | ||
| type=int, | ||
| default=1, | ||
| show_default=True, | ||
| help="Number of threads", | ||
| ) | ||
| @click.option( | ||
| "--chromsizes", | ||
| type=str, | ||
| default=None, | ||
| help="Chromosome order used to filter pairs: " | ||
| "path to a chromosomes file (e.g. UCSC chrom.sizes or similar) whose " | ||
| "first column lists scaffold names. If not provided, will be extracted " | ||
| "from the header of the input pairs file.", | ||
| ) | ||
| @click.option( | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to disable the output bgzip-/lz4c-file and have bigwig only? If bgzip-/lz4c- output is required, it shall have an enormous burden on the IO if you need bigwig only...
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think in pairtools we always have a text output, no? How bioframe.to_bigwig works is it saves to text and then converts anyway, so the burden is only double... But I guess we can at least default to saving to stdout and then if you don't want it you dump it to /dev/null? |
||
| "-o", | ||
| "--output", | ||
| type=str, | ||
| default="", | ||
| help="output file." | ||
| " If the path ends with .gz or .lz4, the output is bgzip-/lz4c-compressed." | ||
| " By default, the output is printed into stdout.", | ||
| ) | ||
| @click.option( | ||
| "--output-bigwig", | ||
| type=str, | ||
| default=None, | ||
| help="Output BigWig file (optional)", | ||
| ) | ||
| @common_io_options | ||
| def coverage( | ||
| pairs_path, | ||
| side, | ||
| end, | ||
| shift, | ||
| window_size, | ||
| chunksize, | ||
| threads, | ||
| chromsizes, | ||
| output, | ||
| output_bigwig, | ||
| **kwargs, | ||
| ): | ||
| """Generate coverage from pairs file. | ||
|
|
||
| PAIRS_PATH : input .pairs/.pairsam file. If the path ends with .gz or .lz4, the | ||
| input is decompressed by bgzip/lz4c. | ||
| By default, the input is read from stdin. | ||
| """ | ||
| coverage_py( | ||
| pairs_path=pairs_path, | ||
| output=output, | ||
| side=side, | ||
| end=end, | ||
| shift=shift, | ||
| window_size=window_size, | ||
| chunksize=chunksize, | ||
| threads=threads, | ||
| chromsizes=chromsizes, | ||
| output_bigwig=output_bigwig, | ||
| **kwargs, | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| coverage() | ||
|
|
||
|
|
||
| def coverage_py( | ||
| pairs_path, | ||
| output, | ||
| side, | ||
| end, | ||
| shift, | ||
| window_size, | ||
| chunksize, | ||
| threads, | ||
| chromsizes, | ||
| output_bigwig, | ||
| **kwargs, | ||
| ): | ||
| """Generate coverage from pairs file. | ||
|
|
||
| PAIRS_PATH : input .pairs/.pairsam file. If the path ends with .gz or .lz4, the | ||
| input is decompressed by bgzip/lz4c. | ||
| By default, the input is read from stdin. | ||
| """ | ||
|
|
||
| # Handle chromsizes - either from file or extract from pairs header | ||
| # Read pairs and get chromsizes from header | ||
| pairs_chunks, chromsizes_dict = read_pairs( | ||
| pairs_path if pairs_path else "-", chunksize, threads | ||
| ) | ||
| if chromsizes is not None: | ||
| chromsizes_dict = bioframe.read_chromsizes( | ||
| chromsizes, filter_chroms=False, natsort=False | ||
| ) | ||
|
|
||
| # Calculate coverage using the API | ||
| coverage_df = calculate_coverage( | ||
| pairs_chunks=pairs_chunks, | ||
| chromsizes=chromsizes_dict, | ||
| side=side, | ||
| end=end, | ||
| shift=shift, | ||
| window_size=window_size, | ||
| threads=threads, | ||
| ) | ||
|
|
||
| # Save the results | ||
| coverage_df.to_csv(output, sep="\t", index=False, header=False) | ||
|
|
||
| if output_bigwig is not None: | ||
| bioframe.to_bigwig( | ||
| coverage_df[["chrom", "start", "end", "count"]], | ||
| chromsizes_dict, | ||
| output_bigwig, | ||
| ) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -9,3 +9,4 @@ | |
| from . import restrict | ||
| from . import stats | ||
| from . import select | ||
| from . import coverage | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having two separate parameters "side" and "end", maybe it's better to allow the user to submit the list of columns to use for the coverage calculation? It's just a suggestion, might be not the smartest one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea. How would you specify it in case you want to have the coverage by midpoint or whole length, as you suggest below and I also wanted to implement anyway?