Skip to content

Commit 6c8af0b

Browse files
committed
Concatenated tokenizers into one page
1 parent 4748dea commit 6c8af0b

File tree

3 files changed

+57
-72
lines changed

3 files changed

+57
-72
lines changed

docs/gtars/python/tokenizers.md

Lines changed: 0 additions & 70 deletions
This file was deleted.

docs/gtars/tokenizers.md

Lines changed: 57 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# gtars-tokenizers
1+
# gtars tokenizers
22

33
The gtars package contains genomic tokenizers module.
44
These are used to convert genomic interval data from disparate sources into a consistent universe or consensus set.
@@ -66,4 +66,60 @@ The tokenizers were designed to be as compatible as possible with HuggingFace Tr
6666

6767
vocab_size = tokenizer.vocab_size
6868
special_tokens_map = tokenizer.special_tokens_map
69+
```
70+
71+
## Using a tokenizer from a pre-trained model
72+
We can also download the universe (vocabulary) for a pre-trained model from huggingface and use that to instantiate our tokenizer.
73+
=== "Python"
74+
```python
75+
from gtars.tokenizers import Tokenizer
76+
77+
# identical API to huggingface
78+
tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38")
79+
80+
tokens = tokenizer.tokenize("path/to/intervals.bed")
81+
print(tokens)
82+
# >>> ["chr1:100-200", "chr1:200-300", ...]
83+
```
84+
85+
## Working with the tokenizer API
86+
We designed the tokenizer API to be congruent with the [Hugging Face Tokenizers library](https://github.com/huggingface/tokenizers), making it easy to integrate with modern machine learning workflows tailored to genomic data.
87+
88+
### Getting input ids
89+
It is common to represent genomic intervals as input ids for machine learning models, particularly for transformer-based architectures. These input ids are typically derived from the tokenized representation of the genomic intervals. You can obtain the input ids from the tokenizer as follows:
90+
91+
=== "Python"
92+
```python
93+
from gtars.tokenizers import Tokenizer
94+
from gtars.models import RegionSet
95+
96+
tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38")
97+
rs = RegionSet("path/to/intervals.bed")
98+
99+
tokens = tokenizer(rs)
100+
print(tokens["input_ids"])
101+
# >>> [101, 202, 111]
102+
```
103+
104+
### Getting special tokens
105+
Special tokens are integral to the tokenizer's functionality, providing markers for padding, masking, and classification tasks. You can access the special tokens map from the tokenizer as follows:
106+
107+
=== "Python"
108+
```python
109+
from gtars.tokenizers import Tokenizer
110+
111+
tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38")
112+
print(tokenizer.special_tokens_map)
113+
# >>> {'pad_token': '<pad>', 'mask_token': '<mask>', 'cls_token': '<cls>', ...}
114+
```
115+
116+
### Decoding input id's
117+
For generative tasks, or when you need to convert input ids back to their original genomic interval representation, you can use the tokenizer's decode method:
118+
=== "Python"
119+
```python
120+
from gtars.tokenizers import Tokenizer
121+
122+
tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38")
123+
special_tokens_mask = tokenizer.decode([101, 202, 111])
124+
# >>> chr1:100-200, chr1:200-300, ...
69125
```

mkdocs.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,6 @@ nav:
187187
- Overview: gtars/python-overview.md
188188
- Digests: gtars/python/digests.md
189189
- RefgetStore: gtars/python/refgetstore.md
190-
- Tokenizers: gtars/python/tokenizers.md
191190
- Wasm:
192191
- Overview: gtars/wasm.md
193192
- Overlappers: gtars/wasm/overlappers.md

0 commit comments

Comments
 (0)