|
1 | | -# gtars-tokenizers |
| 1 | +# gtars tokenizers |
2 | 2 |
|
3 | 3 | The gtars package contains genomic tokenizers module. |
4 | 4 | These are used to convert genomic interval data from disparate sources into a consistent universe or consensus set. |
@@ -66,4 +66,60 @@ The tokenizers were designed to be as compatible as possible with HuggingFace Tr |
66 | 66 |
|
67 | 67 | vocab_size = tokenizer.vocab_size |
68 | 68 | special_tokens_map = tokenizer.special_tokens_map |
| 69 | + ``` |
| 70 | + |
| 71 | +## Using a tokenizer from a pre-trained model |
| 72 | +We can also download the universe (vocabulary) for a pre-trained model from huggingface and use that to instantiate our tokenizer. |
| 73 | +=== "Python" |
| 74 | + ```python |
| 75 | + from gtars.tokenizers import Tokenizer |
| 76 | + |
| 77 | + # identical API to huggingface |
| 78 | + tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38") |
| 79 | + |
| 80 | + tokens = tokenizer.tokenize("path/to/intervals.bed") |
| 81 | + print(tokens) |
| 82 | + # >>> ["chr1:100-200", "chr1:200-300", ...] |
| 83 | + ``` |
| 84 | + |
| 85 | +## Working with the tokenizer API |
| 86 | +We designed the tokenizer API to be congruent with the [Hugging Face Tokenizers library](https://github.com/huggingface/tokenizers), making it easy to integrate with modern machine learning workflows tailored to genomic data. |
| 87 | + |
| 88 | +### Getting input ids |
| 89 | +It is common to represent genomic intervals as input ids for machine learning models, particularly for transformer-based architectures. These input ids are typically derived from the tokenized representation of the genomic intervals. You can obtain the input ids from the tokenizer as follows: |
| 90 | + |
| 91 | +=== "Python" |
| 92 | + ```python |
| 93 | + from gtars.tokenizers import Tokenizer |
| 94 | + from gtars.models import RegionSet |
| 95 | + |
| 96 | + tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38") |
| 97 | + rs = RegionSet("path/to/intervals.bed") |
| 98 | + |
| 99 | + tokens = tokenizer(rs) |
| 100 | + print(tokens["input_ids"]) |
| 101 | + # >>> [101, 202, 111] |
| 102 | + ``` |
| 103 | + |
| 104 | +### Getting special tokens |
| 105 | +Special tokens are integral to the tokenizer's functionality, providing markers for padding, masking, and classification tasks. You can access the special tokens map from the tokenizer as follows: |
| 106 | + |
| 107 | +=== "Python" |
| 108 | + ```python |
| 109 | + from gtars.tokenizers import Tokenizer |
| 110 | + |
| 111 | + tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38") |
| 112 | + print(tokenizer.special_tokens_map) |
| 113 | + # >>> {'pad_token': '<pad>', 'mask_token': '<mask>', 'cls_token': '<cls>', ...} |
| 114 | + ``` |
| 115 | + |
| 116 | +### Decoding input id's |
| 117 | +For generative tasks, or when you need to convert input ids back to their original genomic interval representation, you can use the tokenizer's decode method: |
| 118 | +=== "Python" |
| 119 | + ```python |
| 120 | + from gtars.tokenizers import Tokenizer |
| 121 | + |
| 122 | + tokenizer = Tokenizer.from_pretrained("databio/atacformer-base-hg38") |
| 123 | + special_tokens_mask = tokenizer.decode([101, 202, 111]) |
| 124 | + # >>> chr1:100-200, chr1:200-300, ... |
69 | 125 | ``` |
0 commit comments