Skip to content

Commit e24b4f5

Browse files
authored
feat(ai_grouping): Send token length metrics on stacktraces sent to Seer (#101477)
In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes. Introduces usage of tokenizers library for token count. Added the local tokenization model to Sentry to be used for tokenization without external dependencies. Redo of #99873 which removed `tiktoken` dep by mistake. It is still used in `getsentry` and causes build errors if removed.
1 parent 644f987 commit e24b4f5

File tree

9 files changed

+31112
-4
lines changed

9 files changed

+31112
-4
lines changed

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ dependencies = [
9797
"structlog>=22.1.0",
9898
"symbolic>=12.14.1",
9999
"tiktoken>=0.8.0",
100+
"tokenizers>=0.22.0",
100101
"tldextract>=5.1.2",
101102
"toronado>=0.1.0",
102103
"typing-extensions>=4.12.0",
@@ -295,6 +296,7 @@ module = [
295296
"onelogin.saml2.idp_metadata_parser.*",
296297
"rb.*",
297298
"statsd.*",
299+
"tokenizers.*",
298300
"u2flib_server.model.*",
299301
]
300302
ignore_missing_imports = true

src/sentry/data/models/README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Sentry ML Models
2+
3+
This directory contains machine learning models used by Sentry.
4+
5+
## Tokenizer Model
6+
7+
### jina-embeddings-v2-base-en
8+
9+
This directory contains the tokenizer model for the Jina AI embeddings v2 base English model.
10+
11+
- **Model**: `jinaai/jina-embeddings-v2-base-en`
12+
- **File**: `jina-embeddings-v2-base-en/tokenizer.json`
13+
- **Usage**: Used by `src/sentry/seer/similarity/utils.py` for tokenizing stacktrace text
14+
15+
### Updating the Model
16+
17+
To update or re-download the tokenizer model, you can run:
18+
19+
```python
20+
from tokenizers import Tokenizer
21+
import os
22+
from sentry.constants import DATA_ROOT
23+
24+
# Download and save the model
25+
tokenizer = Tokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en")
26+
model_path = os.path.join(DATA_ROOT, "models", "jina-embeddings-v2-base-en", "tokenizer.json")
27+
os.makedirs(os.path.dirname(model_path), exist_ok=True)
28+
tokenizer.save(model_path)
29+
```

0 commit comments

Comments
 (0)