Update rabbit_hole.py tokenizer encounters a special token (\n)#1054
Update rabbit_hole.py tokenizer encounters a special token (\n)#1054canapaio wants to merge 2 commits intocheshire-cat-ai:mainfrom canapaio:patch-2
\n)#1054Conversation
The problem is that the tokenizer encounters a special token (`\n`) that is not allowed by default, causing an error. To fix it, we explicitly allow `\n` using `allowed_special={"\n"}` and disable checks for other special tokens with `disallowed_special=()` .
|
Never saw a problem with P.s. PR on |
|
error log resolved with the added code: |
|
|
|
is that I often import documentation on prompts and llm systems and often these documentaions have special characters, and from this type of error, I have also implemented a function to keep the flow of data from blocking the execution of processes # pulizia stringhe da caratteri nocivi #
########################################
def kre(text: str) -> str:
#def kre(text: str, cat) -> str:
"""
Resta il codice originale.
Args:
text (str): Il testo da modificare.
Returns:
str: Il testo modificato.
"""
# settings = cat.mad_hatter.get_plugin().load_settings()
old: str
new: str
sostituzioni = [
('<think>', '<Ragionamento>'),
('</think>', '</Ragionamento>'),
('\[', '['),
('\]', ']'),
('\|', '|'),
('<', '<'),
('>', '>'),
('@', '@'),
('{', '{'),
('}', '}')
]
for old, new in sostituzioni:
text = re.sub(old, new, text)
return text
# pulizia stringhe da caratteri nocivi #
########################################
def krec(text: str, cat) -> str:
#def kre(text: str, cat) -> str:
"""
Resta il codice originale.
Args:
text (str): Il testo da modificare.
Returns:
str: Il testo modificato.
"""
settings = cat.mad_hatter.get_plugin().load_settings()
old: str
new: str
sostituzioni = [
('- AI', '- KaguraAI'),
('<think>', '<Ragionamento>'),
('</think>', '</Ragionamento>'),
# ('- Human', '- Canapaio'),
('- Human', f" - {settings['user_name']}"),
('\[', '['),
('\]', ']'),
('\|', '|'),
('<', '<'),
('>', '>'),
('@', '@'),
('{', '{'),
('}', '}')
]
for old, new in sostituzioni:
text = re.sub(old, new, text)
return text |
The problem is that the tokenizer encounters a special token (
\n) that is not allowed by default, causing an error. To fix it, we explicitly allow\nusingallowed_special={"\n"}and disable checks for other special tokens withdisallowed_special=().Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Related to issue #(issue)
Type of change
Checklist: