-
Notifications
You must be signed in to change notification settings - Fork 13.7k
convert : allow partial update to the chkhsh pre-tokenizer list #13847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
7697161
8bab97a
16247c4
ac5449b
eb23a95
787c36d
85e3350
5774158
defa0df
714d403
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,28 +1,6 @@ | ||
| #!/usr/bin/env python3 | ||
| # -*- coding: utf-8 -*- | ||
|
|
||
| # This script downloads the tokenizer models of the specified models from Huggingface and | ||
| # generates the get_vocab_base_pre() function for convert_hf_to_gguf.py | ||
| # | ||
| # This is necessary in order to analyze the type of pre-tokenizer used by the model and | ||
| # provide the necessary information to llama.cpp via the GGUF header in order to implement | ||
| # the same pre-tokenizer. | ||
| # | ||
| # ref: https://github.com/ggml-org/llama.cpp/pull/6920 | ||
| # | ||
| # Instructions: | ||
| # | ||
| # - Add a new model to the "models" list | ||
| # - Run the script with your huggingface token: | ||
| # | ||
| # python3 convert_hf_to_gguf_update.py <huggingface_token> | ||
| # | ||
| # - The convert_hf_to_gguf.py script will have had its get_vocab_base_pre() function updated | ||
| # - Update llama.cpp with the new pre-tokenizer if necessary | ||
| # | ||
| # TODO: generate tokenizer tests for llama.cpp | ||
| # | ||
|
|
||
| import logging | ||
| import os | ||
| import pathlib | ||
|
|
@@ -32,15 +10,22 @@ | |
| import sys | ||
| import json | ||
| import shutil | ||
| import argparse | ||
|
|
||
| from hashlib import sha256 | ||
| from enum import IntEnum, auto | ||
| from transformers import AutoTokenizer | ||
| from collections import OrderedDict | ||
|
|
||
| logging.basicConfig(level=logging.DEBUG) | ||
| logger = logging.getLogger("convert_hf_to_gguf_update") | ||
| sess = requests.Session() | ||
|
|
||
| convert_py_pth = pathlib.Path("convert_hf_to_gguf.py") | ||
| convert_py = convert_py_pth.read_text(encoding="utf-8") | ||
| hf_token_pth = pathlib.Path.home() / ".cache" / "huggingface" / "token" | ||
| hf_token = hf_token_pth.read_text(encoding="utf-8").strip() if hf_token_pth.exists() else None | ||
|
|
||
|
|
||
| class TOKENIZER_TYPE(IntEnum): | ||
| SPM = auto() | ||
|
|
@@ -49,20 +34,49 @@ class TOKENIZER_TYPE(IntEnum): | |
| UGM = auto() | ||
|
|
||
|
|
||
| DOC_STRING = """ | ||
| This script downloads the tokenizer models of the specified models from Huggingface and | ||
| generates the get_vocab_base_pre() function for convert_hf_to_gguf.py | ||
|
|
||
| /!\\ It is intended to be used by contributors and is not meant to be run by end users | ||
|
|
||
| This is necessary in order to analyze the type of pre-tokenizer used by the model and | ||
| provide the necessary information to llama.cpp via the GGUF header in order to implement | ||
| the same pre-tokenizer. | ||
|
|
||
| ref: https://github.com/ggml-org/llama.cpp/pull/6920 | ||
|
|
||
| Instructions: | ||
|
|
||
| - Add a new model to the "models" list | ||
| - Run the script with your huggingface token | ||
| By default, token will be read from ~/.cache/huggingface/token | ||
| - The convert_hf_to_gguf.py script will have had its get_vocab_base_pre() function updated | ||
| - Update llama.cpp with the new pre-tokenizer if necessary | ||
| """ | ||
| # TODO: generate tokenizer tests for llama.cpp | ||
|
|
||
| parser = argparse.ArgumentParser(description=DOC_STRING, formatter_class=argparse.RawTextHelpFormatter) | ||
| parser.add_argument( | ||
| "--full", action="store_true", | ||
| help="download full list of models - make sure you have access to all of them", | ||
| ) | ||
| parser.add_argument( | ||
| "hf_token", | ||
| help="optional HF token", | ||
| nargs="?", | ||
| ) | ||
| args = parser.parse_args() | ||
| hf_token = args.hf_token if args.hf_token is not None else hf_token | ||
|
|
||
| if hf_token is None: | ||
| logger.error("HF token is required. Please provide it as an argument or set it in ~/.cache/huggingface/token") | ||
| sys.exit(1) | ||
|
Comment on lines
+71
to
+73
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this will now be used for mostly public models I don't think we should require a token.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of them are public, but some are gated, so token is still needed For example: gemma, llama, dbrx, command-r, etc
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, but isn't the point that regular people don't have to download them now? |
||
|
|
||
| # TODO: this string has to exercise as much pre-tokenizer functionality as possible | ||
| # will be updated with time - contributions welcome | ||
| CHK_TXT = '\n \n\n \n\n\n \t \t\t \t\n \n \n \n \n🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български \'\'\'\'\'\'```````\"\"\"\"......!!!!!!?????? I\'ve been \'told he\'s there, \'RE you sure? \'M not sure I\'ll make it, \'D you like some tea? We\'Ve a\'lL' | ||
|
|
||
| if len(sys.argv) == 2: | ||
| token = sys.argv[1] | ||
| if not token.startswith("hf_"): | ||
| logger.info("Huggingface token seems invalid") | ||
| logger.info("Usage: python convert_hf_to_gguf_update.py <huggingface_token>") | ||
| sys.exit(1) | ||
| else: | ||
| logger.info("Usage: python convert_hf_to_gguf_update.py <huggingface_token>") | ||
| sys.exit(1) | ||
|
|
||
| # TODO: add models here, base models preferred | ||
| models = [ | ||
| {"name": "llama-spm", "tokt": TOKENIZER_TYPE.SPM, "repo": "https://huggingface.co/meta-llama/Llama-2-7b-hf", }, | ||
|
|
@@ -114,11 +128,19 @@ class TOKENIZER_TYPE(IntEnum): | |
| {"name": "trillion", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/trillionlabs/Trillion-7B-preview", }, | ||
| {"name": "bailingmoe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/inclusionAI/Ling-lite", }, | ||
| {"name": "llama4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct", }, | ||
| {"name": "chatglm-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-chat", }, | ||
| {"name": "glm4", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-hf", }, | ||
| {"name": "pixtral", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistral-community/pixtral-12b", }, | ||
| {"name": "seed-coder", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base", }, | ||
| ] | ||
|
|
||
| # some models are known to be broken upstream, so we will skip them as exceptions | ||
| pre_computed_hashes = [ | ||
| # chatglm-bpe has 2 hashes, why? | ||
| {"name": "chatglm-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-chat", "chkhsh": "b6e8e1518dc4305be2fe39c313ed643381c4da5db34a98f6a04c093f8afbe99b"}, | ||
| {"name": "chatglm-bpe", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/THUDM/glm-4-9b-chat", "chkhsh": "81d72c7348a9f0ebe86f23298d37debe0a5e71149e29bd283904c02262b27516"}, | ||
| ] | ||
|
|
||
|
|
||
| def download_file_with_auth(url, token, save_path): | ||
| headers = {"Authorization": f"Bearer {token}"} | ||
|
|
@@ -169,9 +191,29 @@ def download_model(model): | |
| if os.path.isfile(save_path): | ||
| logger.info(f"{name}: File {save_path} already exists - skipping") | ||
| continue | ||
| download_file_with_auth(f"{repo}/resolve/main/{file}", token, save_path) | ||
| download_file_with_auth(f"{repo}/resolve/main/{file}", hf_token, save_path) | ||
|
|
||
|
|
||
| # get list of existing models and chkhsh from the convert_hf_to_gguf.py file | ||
| # returns mapping res --> chkhsh | ||
| def get_existing_models(convert_py): | ||
| pattern = r'if chkhsh == "([a-f0-9]{64})":\s*\n\s*.*\s*res = "([^"]+)"' | ||
| matches = re.findall(pattern, convert_py) | ||
| output = OrderedDict() # make sure order is preserved | ||
ngxson marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| for chkhsh, res in matches: | ||
| output[res] = chkhsh | ||
| return output | ||
|
|
||
|
|
||
| existing_models = {} | ||
| all_models = models.copy() | ||
| if not args.full: | ||
| # Filter out models that already exist in convert_hf_to_gguf.py | ||
| existing_models = get_existing_models(convert_py) | ||
| all_models = models.copy() | ||
| models = [model for model in all_models if model["name"] not in existing_models] | ||
|
|
||
| logging.info(f"Downloading {len(models)} models...") | ||
| for model in models: | ||
| try: | ||
| download_model(model) | ||
|
|
@@ -182,9 +224,10 @@ def download_model(model): | |
| # generate the source code for the convert_hf_to_gguf.py:get_vocab_base_pre() function: | ||
|
|
||
| src_ifs = "" | ||
| for model in models: | ||
| for model in [*all_models, *pre_computed_hashes]: | ||
| name = model["name"] | ||
| tokt = model["tokt"] | ||
| chkhsh = model.get("chkhsh") | ||
|
|
||
| if tokt == TOKENIZER_TYPE.SPM or tokt == TOKENIZER_TYPE.UGM: | ||
| continue | ||
|
|
@@ -195,35 +238,44 @@ def download_model(model): | |
| continue | ||
|
|
||
| # create the tokenizer | ||
| try: | ||
| if name == "t5": | ||
| tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False) | ||
| else: | ||
| tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}") | ||
| except OSError as e: | ||
| logger.error(f"Error loading tokenizer for model {name}. The model may not exist or is not accessible with the provided token. Error: {e}") | ||
| continue # Skip to the next model if the tokenizer can't be loaded | ||
|
|
||
| chktok = tokenizer.encode(CHK_TXT) | ||
| chkhsh = sha256(str(chktok).encode()).hexdigest() | ||
|
|
||
| logger.info(f"model: {name}") | ||
| logger.info(f"tokt: {tokt}") | ||
| logger.info(f"repo: {model['repo']}") | ||
| logger.info(f"chktok: {chktok}") | ||
| logger.info(f"chkhsh: {chkhsh}") | ||
|
|
||
| # print the "pre_tokenizer" content from the tokenizer.json | ||
| with open(f"models/tokenizers/{name}/tokenizer.json", "r", encoding="utf-8") as f: | ||
| cfg = json.load(f) | ||
| normalizer = cfg["normalizer"] | ||
| logger.info("normalizer: " + json.dumps(normalizer, indent=4)) | ||
| pre_tokenizer = cfg["pre_tokenizer"] | ||
| logger.info("pre_tokenizer: " + json.dumps(pre_tokenizer, indent=4)) | ||
| if "ignore_merges" in cfg["model"]: | ||
| logger.info("ignore_merges: " + json.dumps(cfg["model"]["ignore_merges"], indent=4)) | ||
|
|
||
| logger.info("") | ||
| if chkhsh is not None: | ||
| # if the model has a pre-computed hash, use it | ||
| logger.info(f"Using pre-computed hash for model {name}: {chkhsh}") | ||
| elif name in existing_models: | ||
| # if the model already exists in convert_hf_to_gguf.py, skip compute hash | ||
| chkhsh = existing_models[name] | ||
| else: | ||
| # otherwise, compute the hash of the tokenizer | ||
| try: | ||
| logger.info(f"Loading tokenizer from {f'models/tokenizers/{name}'}...") | ||
| if name == "t5": | ||
| tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}", use_fast=False) | ||
| else: | ||
| tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}") | ||
| except OSError as e: | ||
| logger.error(f"Error loading tokenizer for model {name}. The model may not exist or is not accessible with the provided token. Error: {e}") | ||
| continue # Skip to the next model if the tokenizer can't be loaded | ||
|
|
||
| chktok = tokenizer.encode(CHK_TXT) | ||
| chkhsh = sha256(str(chktok).encode()).hexdigest() | ||
|
|
||
| logger.info(f"model: {name}") | ||
| logger.info(f"tokt: {tokt}") | ||
| logger.info(f"repo: {model['repo']}") | ||
| logger.info(f"chktok: {chktok}") | ||
| logger.info(f"chkhsh: {chkhsh}") | ||
|
|
||
| # print the "pre_tokenizer" content from the tokenizer.json | ||
| with open(f"models/tokenizers/{name}/tokenizer.json", "r", encoding="utf-8") as f: | ||
| cfg = json.load(f) | ||
| normalizer = cfg["normalizer"] | ||
| logger.info("normalizer: " + json.dumps(normalizer, indent=4)) | ||
| pre_tokenizer = cfg["pre_tokenizer"] | ||
| logger.info("pre_tokenizer: " + json.dumps(pre_tokenizer, indent=4)) | ||
| if "ignore_merges" in cfg["model"]: | ||
| logger.info("ignore_merges: " + json.dumps(cfg["model"]["ignore_merges"], indent=4)) | ||
|
|
||
| logger.info("") | ||
|
|
||
| src_ifs += f" if chkhsh == \"{chkhsh}\":\n" | ||
| src_ifs += f" # ref: {model['repo']}\n" | ||
|
|
@@ -271,8 +323,6 @@ def get_vocab_base_pre(self, tokenizer) -> str: | |
| return res | ||
| """ | ||
|
|
||
| convert_py_pth = pathlib.Path("convert_hf_to_gguf.py") | ||
| convert_py = convert_py_pth.read_text(encoding="utf-8") | ||
| convert_py = re.sub( | ||
| r"(# Marker: Start get_vocab_base_pre)(.+?)( +# Marker: End get_vocab_base_pre)", | ||
| lambda m: m.group(1) + src_func + m.group(3), | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here, was the model updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They updated tokenizer.json, and removed
{ "type": "Digits", "individual_digits": true }Might warrant an updated regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we need to preserve the old hash as
minerva-7b, using this regex:llama.cpp/src/llama-vocab.cpp
Lines 337 to 341 in c3a2624
And add a new name for the new hash, using this regex:
llama.cpp/src/llama-vocab.cpp
Lines 347 to 351 in c3a2624
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought back the old hash, so nothing change for this model. Tbh I think no one is actually using it, so not worth the time to fix it.