Skip to content

Commit 49eb8a9

Browse files
committed
revamp gtars docs
1 parent 742432a commit 49eb8a9

File tree

16 files changed

+776
-40
lines changed

16 files changed

+776
-40
lines changed

docs/gtars/README.md

Lines changed: 76 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,83 @@
44

55

66
<p align="center">
7-
<a href="https://pypi.org/project/gtars"><img src="https://img.shields.io/pypi/v/geniml" alt=""></a>
7+
<a href="https://pypi.org/project/gtars"><img src="https://img.shields.io/pypi/v/gtars" alt=""></a>
8+
<a href="https://crates.io/crates/gtars"><img src="https://img.shields.io/crates/v/gtars?&logo=rust" alt="crates.io"></a>
89
<a href="https://github.com/databio/gtars"><img src="https://img.shields.io/badge/source-github-354a75?logo=github"></a>
910
</p>
1011

11-
Gtars is a Rust package with Python bindings for genomic interval analysis.
1212

13-
Coming soon!
13+
14+
## Introduction
15+
16+
`gtars` is a high-performance toolkit for *genomic tools and algorithms in Rust*. Built with Rust for speed and reliability, gtars provides core utilies for machine learning on genomic intervals for the [geniml](https://github.com/databio/geniml) Python package. It also provides lots of utility as a standalone library for alternative downstream use cases.
17+
18+
## Installation
19+
20+
### Rust Library
21+
22+
Gtars uses a feature-flag system to allow you to include only the modules you need. Add to your `Cargo.toml`:
23+
24+
```toml
25+
[dependencies]
26+
# Install specific features
27+
gtars = { version = "0.5", features = ["overlaprs", "tokenizers"] }
28+
29+
# Or install from GitHub
30+
gtars = { git = "https://github.com/databio/gtars", features = ["overlaprs", "tokenizers"] }
31+
```
32+
33+
Modules:
34+
35+
- `core` - Core functionality and data structures
36+
- `tokenizers` - Genomic region tokenizers
37+
- `io` - I/O utilities
38+
- `refget` - Reference sequence access
39+
- `overlaprs` - Overlap operations
40+
- `uniwig` - Coverage computation
41+
- `igd` - Interval search
42+
- `bbcache` - BED file caching
43+
- `scoring` - Fragment scoring
44+
- `fragsplit` - Fragment splitting
45+
46+
47+
Example combinations:
48+
49+
```toml
50+
# For machine learning tasks
51+
gtars = { version = "0.5", features = ["tokenizers", "core"] }
52+
53+
# For genomic analysis
54+
gtars = { version = "0.5", features = ["overlaprs", "uniwig", "scoring"] }
55+
56+
# For data access
57+
gtars = { version = "0.5", features = ["refget", "bbcache", "io"] }
58+
```
59+
60+
### Python Package
61+
62+
```bash
63+
pip install gtars
64+
```
65+
66+
See further documentation under Python bindings.
67+
68+
### Command-Line Interface
69+
70+
Install from source:
71+
```bash
72+
git clone https://github.com/databio/gtars
73+
cd gtars
74+
cargo install --path gtars-cli
75+
```
76+
77+
Or download precompiled binaries from the [releases page](https://github.com/databio/gtars/releases).
78+
79+
80+
## Development
81+
82+
Run tests with `cargo test` from the workspace root. Please see [CONTRIBUTING.md](https://github.com/databio/gtars/blob/master/CONTRIBUTING.md) for development guidelines.
83+
84+
## Module organization
85+
86+
`gtars` is organized into modules. The modules section gives an [overview of each module](modules.md).

docs/gtars/changelog.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -5,36 +5,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
55
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

77

8-
## [0.3.0]
8+
## [0.3.0] -- 2023-07-30
99
- move digests functionality to refget
1010
- add RefgetStore to refget and its associated python bindings
1111
- integrate support for `bits` + backend types for tokenizers (AIList or BITS)
1212
- reworked the tokenization CLI to support the new `bits` and `backend` options
1313

14-
## [0.2.5]
14+
## [0.2.5] -- 2023-04-06
1515
- Rework tokenizer API to be more consistent with the HuggingFace tokenizers API.
1616
- Updates to `RegionSet` to improve performance and usability.
1717
- Added file_digest function to RegionSet struct
1818
- Fixed reqwest error in R bindings
1919
- Fixed [#107](https://github.com/databio/gtars/issues/107)
2020

21-
## [0.2.4]
21+
## [0.2.4] -- 2023-03-05
2222
- Attempt to fix failing python bindings in CI linux [#104](https://github.com/databio/gtars/issues/104)
2323

24-
## [0.2.3]
24+
## [0.2.3] -- 2023-03-05
2525
- Improved RegionSet, by adding a multiple new methods: `to_bed`, `to_bed_gz`, `to_bigbed`, `identifier()`, and others.
2626
- Fixed allowed `fasta_digest` to accept `Path` or `bytes` [#93](https://github.com/databio/gtars/issues/93)
2727

28-
## [0.2.2]
28+
## [0.2.2] -- 2023-02-18
2929
- fix [#90](https://github.com/databio/gtars/issues/90)
3030
- fix [#89](https://github.com/databio/gtars/issues/89)
3131

3232

33-
## [0.2.1]
33+
## [0.2.1] -- 2024-02-11
3434
- allow comments at the beginning of fragment files
3535
- bump bigtools to 0.5.5, fixing [#74](https://github.com/databio/gtars/issues/74) and [#77](https://github.com/databio/gtars/issues/77)
3636

37-
## [0.2.0]
37+
## [0.2.0] -- 2024-01-13
3838
- add position shift workflow for bam to bw (was previously added for bam to bed)
3939
- add scaling argument for bam to bw workflow [#53](https://github.com/databio/gtars/issues/53)
4040
- fix accumulation issue for bam workflow [#56](https://github.com/databio/gtars/issues/56)
@@ -58,10 +58,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5858
- fix IGD overlap issue [#45](https://github.com/databio/gtars/issues/45)
5959
- add ga4gh refget digest functionality [#58](https://github.com/databio/gtars/pull/58)
6060

61-
## [0.1.1]
61+
## [0.1.1] -- 2023-12-03
6262
- hot fix for broken python bindings; remove IGD from the python bindings for now
6363

64-
## [0.1.0]
64+
## [0.1.0] -- 2023-12-03
6565
- Rust implementation of `uniwig` that expands on the C++ version
6666
- Uniwig now accepts a single sorted `.bed` file, `.narrowPeak` file, or `.bam` file.
6767
- Outputs now include `.wig`, `.npy`, `.bedGraph`, and `.bw`
@@ -70,64 +70,64 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7070
- Region scoring matrix calculation for region clustering
7171
- Fragment file splitter for pseudobulking
7272

73-
## [0.0.15]
73+
## [0.0.15] -- 2023-07-29
7474
- added meta tokenization tools and a new `MetaTokenizer` struct that can be used to tokenize regions using the meta-token strategy.
7575
- added some annotations to the `pyo3` `#[pyclass]` and `#[pymethods]` attributes to make the python bindings more readable.
7676

77-
## [0.0.14]
77+
## [0.0.14] -- 2023-06-11
7878
- renamed repository to `gtars` to better reflect the project's goals.
7979

80-
## [0.0.13]
80+
## [0.0.13] -- 2023-06-03
8181
- implemented a fragment file tokenizer that will generate `.gtok` files directly from `fragments.tsv.gz` files.
8282
- fix an off-by-one error in the `region-to-id` maps in the `Universe` structs. This was leading to critical bugs in our models.
8383

84-
## [0.0.12]
84+
## [0.0.12] -- 2023-05-28
8585
- optimize creation of `PyRegionSet` to reduce expensive cloning of `Universe` structs.
8686

87-
## [0.0.11]
87+
## [0.0.11] -- 2023-05-22
8888
- redesigned API for the tokenizers to better emulate the huggingface tokenizers API.
8989
- implemented new traits for tokenizers to allow for more flexibility when creating new tokenizers.
9090
- bumped the version `pyo3` to `0.21.0`
9191
- added `rust-numpy` dependency to the python bindings for exporting tokenized regions as numpy arrays.
9292
- overall stability improvements to the tokenizers and the python bindings.
9393

94-
## [0.0.10]
94+
## [0.0.10] -- 2024-01-24
9595
- update file format specifications
9696

97-
## [0.0.9]
97+
## [0.0.9] -- 2024-01-22
9898
- start working on the concept of a `.gtok` file-format to store tokenized regions
9999
- added basic readers and writers for this format
100100

101-
## [0.0.8]
101+
## [0.0.8] -- 2024-01-17
102102
- add a new `ids_as_strs` getter to the `TokenizedRegionSet` struct so that we can get the ids as strings quickly, this is meant mostly for interface with geniml.
103103

104-
## [0.0.7]
104+
## [0.0.7] -- 2023-11-30
105105
- move things around based on rust club feedback
106106

107-
## [0.0.6]
107+
## [0.0.6] -- 2024-02-20
108108
- update python bindings to support the module/submodule structure (https://github.com/PyO3/pyo3/issues/759#issuecomment-1828431711)
109109
- change name of some submodules
110110
- remove `consts` submodule, just add to base
111111
- expose a `__version__` attribute in the python bindings
112112

113-
## [0.0.5]
113+
## [0.0.5] -- 2024-02-19
114114
- add many "core utils"
115115
- move `gtokenizers` into this package inside `gtars::tokenizers`
116116
- create `tokenize` cli
117117
- add tests for core utils and tokenizers
118118
- RegionSet is now backed by a polars DataFrame
119119
- new python bindings for core utils and tokenizers
120120

121-
## [0.0.4]
121+
## [0.0.4] -- 2023-11-06
122122
- add type annotations to the python bindings
123123

124-
## [0.0.3]
124+
## [0.0.3] -- 2023-11-06
125125
- work on python bindings initialization
126126

127-
## [0.0.2]
127+
## [0.0.2] -- 2023-09-20
128128
- prepare for first release
129129

130-
## [0.0.1]
130+
## [0.0.1] -- 2023-08-15
131131
- initial setup of repository
132132
- two main wrappers: 1) wrapper binary crate, and 2) wrapper library crate
133133
- `gtars` can be used as a library crate. or as a command line tool

docs/gtars/cli.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# gtars-cli
2+
3+
Command-line interface for gtars tools.
4+
5+
## Installation
6+
7+
```bash
8+
# From source
9+
cd /path/to/gtars
10+
cargo install --path gtars-cli
11+
12+
# Or download pre-built binary from releases page
13+
```
14+
15+
## Available Commands
16+
17+
The CLI provides the following subcommands (availability depends on features enabled during compilation):
18+
19+
### igd
20+
Build and query IGD (Integrated Genome Database) indexes:
21+
```bash
22+
gtars igd create --input regions.bed --output index.igd
23+
gtars igd query --database index.igd --query chr1:1000-2000
24+
```
25+
26+
### overlaprs
27+
Compute overlaps between genomic intervals:
28+
```bash
29+
gtars overlaprs --input1 regions1.bed --input2 regions2.bed
30+
```
31+
32+
### uniwig
33+
Generate coverage tracks from BED/BAM files:
34+
```bash
35+
gtars uniwig --input reads.bam --output coverage.bw
36+
```
37+
38+
### bbcache
39+
Cache and manage BED files from bedbase.org:
40+
```bash
41+
gtars bbcache get --id GSM123456
42+
gtars bbcache list
43+
```
44+
45+
### scoring
46+
Score fragment overlaps against a reference:
47+
```bash
48+
gtars scoring --fragments frags.tsv.gz --universe peaks.bed --output scores.txt
49+
```
50+
51+
### fragsplit
52+
Split fragment files by cell barcodes or clusters:
53+
```bash
54+
gtars fragsplit --fragments frags.tsv.gz --barcodes clusters.csv --output-dir splits/
55+
```
56+
57+
58+
## Global Options
59+
60+
```bash
61+
gtars --help # Show help
62+
gtars --version # Show version
63+
gtars <command> --help # Show command-specific help
64+
```
65+
66+
## Building with Specific Features
67+
68+
To build the CLI with specific tools:
69+
```bash
70+
cargo build --release --features "uniwig,overlaprs,igd"
71+
```

docs/gtars/core.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
# gtars-core
2+
3+
Core library providing fundamental data structures and utilities for genomic interval operations. This is the foundation that all other gtars modules build upon.
4+
5+
## Features
6+
7+
- Common genomic data structures (Region, RegionSet)
8+
- BED file parsing utilities
9+
- Shared constants and helper functions
10+
- Foundation for all gtars modules
11+
12+
## Core Data Types
13+
14+
### Region
15+
Represents a genomic interval with chromosome, start, and end coordinates:
16+
```rust
17+
use gtars_core::models::Region;
18+
19+
// Create a region
20+
let region = Region::new("chr1", 1000, 2000);
21+
22+
// Access properties
23+
println!("Chr: {}", region.chr);
24+
println!("Start: {}", region.start);
25+
println!("End: {}", region.end);
26+
```
27+
28+
### RegionSet
29+
Collection of genomic regions:
30+
```rust
31+
use gtars_core::models::RegionSet;
32+
use std::path::Path;
33+
34+
// Load from BED file
35+
let rs = RegionSet::try_from(Path::new("peaks.bed"))?;
36+
37+
// Access regions
38+
println!("Number of regions: {}", rs.regions.len());
39+
40+
// Iterate over regions
41+
for region in &rs.regions {
42+
println!("{}: {}-{}", region.chr, region.start, region.end);
43+
}
44+
```
45+
46+
## Available Modules
47+
48+
- `models` - Core data structures (Region, RegionSet)
49+
- `utils` - Utility functions for file handling and parsing
50+
- `consts` - Shared constants
51+
52+
## Dependencies
53+
54+
Minimal external dependencies:
55+
56+
- `anyhow` - Error handling
57+
- `flate2` - Gzip compression support
58+
- Other standard bioinformatics libraries
59+
60+
This module serves as the foundation for all other gtars modules and maintains backward compatibility within major versions.

0 commit comments

Comments
 (0)