-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
The bug
I have fine tuned a mistral 7b v3 on this dataset https://huggingface.co/datasets/maurya/alpaca_ccass_motivations_sommaires_titres which includes special tokens that need to be added to the tokenizer as mentioned in the README. However, when using this model with its specific tokenizer with guidance, it gives me a ParserError :
Warning: Parser Error: token " N" doesn't satisfy the grammar; forced bytes: got 'N'; applying ' '; stopping
ValueError Traceback (most recent call last)
Cell In[14], line 1
----> 1 model_guidance + select(["INJONCTION DE PAYER", "PROCEDURE"])
File ~/.conda/envs/axolotl/lib/python3.11/site-packages/guidance/models/_base/_[model.py:104](http://model.py:104/), in Model.add(self, other)
102 return other(self)
103 if isinstance(other, ASTNode):
--> 104 self = self._apply_node(other)
105 self = self._update_open_block_captures()
106 return self
File ~/.conda/envs/axolotl/lib/python3.11/site-packages/guidance/models/_base/_[model.py:132](http://model.py:132/), in Model._apply_node(self, node)
129 else:
130 self._update_trace_node(self._id, self._parent_id, StatelessGuidanceInput(value=node))
--> 132 for i, output_attr in enumerate(self._[interpreter.run](http://interpreter.run/)(node)):
133 if isinstance(output_attr, TokenOutput) and not output_[attr.is](http://attr.is/)_input:
134 # TODO: put this elsewhere (inside state?)
135 self.token_count += 1
File ~/.conda/envs/axolotl/lib/python3.11/site-packages/guidance/models/_base/_[interpreter.py:36](http://interpreter.py:36/), in [Interpreter.run](http://interpreter.run/)(self, node, **kwargs)
35 def run(self, node: ASTNode, **kwargs) -> Iterator[OutputAttr]:
---> 36 yield from node.simplify()._run(self, **kwargs)
File ~/.conda/envs/axolotl/lib/python3.11/site-packages/guidance/models/_engine/_[interpreter.py:77](http://interpreter.py:77/), in EngineInterpreter.grammar(self, node, **kwargs)
69 engine_gen = self.engine(
70 state=self.state,
71 grammar=node.ll_grammar(),
72 ensure_bos_token=True,
73 echo=False,
74 )
76 delayed_bytes = b""
---> 77 for chunk in engine_gen:
78 new_bytes = recode_special_tokens(self.engine.tokenizer, [chunk.new](http://chunk.new/)_bytes)
79 new_text, delayed_bytes = partial_decode(delayed_bytes + new_bytes)
File ~/.conda/envs/axolotl/lib/python3.11/site-packages/guidance/models/_engine/_[engine.py:181](http://engine.py:181/), in Engine.call(self, state, grammar, ensure_bos_token, echo)
179 recode = True
180 else:
--> 181 backtrack, ff_tokens, mask_fut = parser.advance(
182 token_id=engine_output.issued_token.token_id
183 )
185 if backtrack:
186 backtracked_bytes = self.tokenizer.decode(tokens[-backtrack:])
File ~/.conda/envs/axolotl/lib/python3.11/site-packages/guidance/_[parser.py:66](http://parser.py:66/), in TokenParser.advance(self, token_id)
63 if self.done():
64 raise TokenParserException("Cannot advance on a done parser")
---> 66 return self._generator.send(token_id)
File ~/.conda/envs/axolotl/lib/python3.11/site-packages/guidance/_[parser.py:171](http://parser.py:171/), in TokenParser._parse(self)
163 if not mask[token_id]:
164 # Note: we could punt this probem to ll_[interpreter.post](http://interpreter.post/)_process,
165 # but it's a bit clearer to handle it here
166 raise InvalidTokenException(
167 token=token_id,
168 valid_tokens=[i for i in range(len(mask)) if mask[i]],
169 )
--> 171 backtrack, ff_tokens = self.ll_interpreter.commit_token(
172 token_id
173 )
ValueError: Parser Error: token " N" doesn't satisfy the grammar; forced bytes: got 'N'; applying ' '
<state>
Tokens: ⟦I‧ N⟧
2 tokens, 1 bytes; grm_prefix: ""
Flags:
Parser: {
"compute_time_us": 262,
"rows": 2,
"cached_rows": 0,
"all_items": 2,
"lexer_cost": 2174,
"slices_applied": 0,
"trie_nodes_walked": 461,
"definitive_bytes": 19,
"lexer_ops": 0,
"num_lex_errors": 0,
"num_lexemes": 0
}
Stop: ParserTooComplex
Error: Parser Error: token " N" doesn't satisfy the grammar; forced bytes: got 'N'; applying ' '
</state><grammar>
%llguidance {}
start: START
START: SELECT
SELECT: "INJONCTION DE PAYER"
| "PROCEDURE"
</grammar>
To Reproduce
from guidance import select, models
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sommaire-titres-model-2025-mistral-v3")
model_guidance = models.Transformers("sommaire-titres-model-2025-mistral-v3", tokenizer=tokenizer, echo=False, device_map = 'cuda')
model_guidance + select(["INJONCTION DE PAYER", "PROCEDURE"])System info (please complete the following information):
- Ubuntu, python 3.11.11
- Guidance Version 0.2.3
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels