You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Gtars is a Rust package with Python bindings for genomic interval analysis.
12
12
13
-
Coming soon!
13
+
14
+
## Introduction
15
+
16
+
`gtars` is a high-performance toolkit for *genomic tools and algorithms in Rust*. Built with Rust for speed and reliability, gtars provides core utilies for machine learning on genomic intervals for the [geniml](https://github.com/databio/geniml) Python package. It also provides lots of utility as a standalone library for alternative downstream use cases.
17
+
18
+
## Installation
19
+
20
+
### Rust Library
21
+
22
+
Gtars uses a feature-flag system to allow you to include only the modules you need. Add to your `Cargo.toml`:
23
+
24
+
```toml
25
+
[dependencies]
26
+
# Install specific features
27
+
gtars = { version = "0.5", features = ["overlaprs", "tokenizers"] }
gtars = { version = "0.5", features = ["tokenizers", "core"] }
52
+
53
+
# For genomic analysis
54
+
gtars = { version = "0.5", features = ["overlaprs", "uniwig", "scoring"] }
55
+
56
+
# For data access
57
+
gtars = { version = "0.5", features = ["refget", "bbcache", "io"] }
58
+
```
59
+
60
+
### Python Package
61
+
62
+
```bash
63
+
pip install gtars
64
+
```
65
+
66
+
See further documentation under Python bindings.
67
+
68
+
### Command-Line Interface
69
+
70
+
Install from source:
71
+
```bash
72
+
git clone https://github.com/databio/gtars
73
+
cd gtars
74
+
cargo install --path gtars-cli
75
+
```
76
+
77
+
Or download precompiled binaries from the [releases page](https://github.com/databio/gtars/releases).
78
+
79
+
80
+
## Development
81
+
82
+
Run tests with `cargo test` from the workspace root. Please see [CONTRIBUTING.md](https://github.com/databio/gtars/blob/master/CONTRIBUTING.md) for development guidelines.
83
+
84
+
## Module organization
85
+
86
+
`gtars` is organized into modules. The modules section gives an [overview of each module](modules.md).
- hot fix for broken python bindings; remove IGD from the python bindings for now
63
63
64
-
## [0.1.0]
64
+
## [0.1.0] -- 2023-12-03
65
65
- Rust implementation of `uniwig` that expands on the C++ version
66
66
- Uniwig now accepts a single sorted `.bed` file, `.narrowPeak` file, or `.bam` file.
67
67
- Outputs now include `.wig`, `.npy`, `.bedGraph`, and `.bw`
@@ -70,64 +70,64 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
70
70
- Region scoring matrix calculation for region clustering
71
71
- Fragment file splitter for pseudobulking
72
72
73
-
## [0.0.15]
73
+
## [0.0.15] -- 2023-07-29
74
74
- added meta tokenization tools and a new `MetaTokenizer` struct that can be used to tokenize regions using the meta-token strategy.
75
75
- added some annotations to the `pyo3``#[pyclass]` and `#[pymethods]` attributes to make the python bindings more readable.
76
76
77
-
## [0.0.14]
77
+
## [0.0.14] -- 2023-06-11
78
78
- renamed repository to `gtars` to better reflect the project's goals.
79
79
80
-
## [0.0.13]
80
+
## [0.0.13] -- 2023-06-03
81
81
- implemented a fragment file tokenizer that will generate `.gtok` files directly from `fragments.tsv.gz` files.
82
82
- fix an off-by-one error in the `region-to-id` maps in the `Universe` structs. This was leading to critical bugs in our models.
83
83
84
-
## [0.0.12]
84
+
## [0.0.12] -- 2023-05-28
85
85
- optimize creation of `PyRegionSet` to reduce expensive cloning of `Universe` structs.
86
86
87
-
## [0.0.11]
87
+
## [0.0.11] -- 2023-05-22
88
88
- redesigned API for the tokenizers to better emulate the huggingface tokenizers API.
89
89
- implemented new traits for tokenizers to allow for more flexibility when creating new tokenizers.
90
90
- bumped the version `pyo3` to `0.21.0`
91
91
- added `rust-numpy` dependency to the python bindings for exporting tokenized regions as numpy arrays.
92
92
- overall stability improvements to the tokenizers and the python bindings.
93
93
94
-
## [0.0.10]
94
+
## [0.0.10] -- 2024-01-24
95
95
- update file format specifications
96
96
97
-
## [0.0.9]
97
+
## [0.0.9] -- 2024-01-22
98
98
- start working on the concept of a `.gtok` file-format to store tokenized regions
99
99
- added basic readers and writers for this format
100
100
101
-
## [0.0.8]
101
+
## [0.0.8] -- 2024-01-17
102
102
- add a new `ids_as_strs` getter to the `TokenizedRegionSet` struct so that we can get the ids as strings quickly, this is meant mostly for interface with geniml.
103
103
104
-
## [0.0.7]
104
+
## [0.0.7] -- 2023-11-30
105
105
- move things around based on rust club feedback
106
106
107
-
## [0.0.6]
107
+
## [0.0.6] -- 2024-02-20
108
108
- update python bindings to support the module/submodule structure (https://github.com/PyO3/pyo3/issues/759#issuecomment-1828431711)
109
109
- change name of some submodules
110
110
- remove `consts` submodule, just add to base
111
111
- expose a `__version__` attribute in the python bindings
112
112
113
-
## [0.0.5]
113
+
## [0.0.5] -- 2024-02-19
114
114
- add many "core utils"
115
115
- move `gtokenizers` into this package inside `gtars::tokenizers`
116
116
- create `tokenize` cli
117
117
- add tests for core utils and tokenizers
118
118
- RegionSet is now backed by a polars DataFrame
119
119
- new python bindings for core utils and tokenizers
120
120
121
-
## [0.0.4]
121
+
## [0.0.4] -- 2023-11-06
122
122
- add type annotations to the python bindings
123
123
124
-
## [0.0.3]
124
+
## [0.0.3] -- 2023-11-06
125
125
- work on python bindings initialization
126
126
127
-
## [0.0.2]
127
+
## [0.0.2] -- 2023-09-20
128
128
- prepare for first release
129
129
130
-
## [0.0.1]
130
+
## [0.0.1] -- 2023-08-15
131
131
- initial setup of repository
132
132
- two main wrappers: 1) wrapper binary crate, and 2) wrapper library crate
133
133
-`gtars` can be used as a library crate. or as a command line tool
Core library providing fundamental data structures and utilities for genomic interval operations. This is the foundation that all other gtars modules build upon.
4
+
5
+
## Features
6
+
7
+
- Common genomic data structures (Region, RegionSet)
8
+
- BED file parsing utilities
9
+
- Shared constants and helper functions
10
+
- Foundation for all gtars modules
11
+
12
+
## Core Data Types
13
+
14
+
### Region
15
+
Represents a genomic interval with chromosome, start, and end coordinates:
0 commit comments