Skip to content

Expose get_capture and get_captures API to Python#289

Open
kmaehashi wants to merge 2 commits intoguidance-ai:mainfrom
kmaehashi:expose-capture-to-python
Open

Expose get_capture and get_captures API to Python#289
kmaehashi wants to merge 2 commits intoguidance-ai:mainfrom
kmaehashi:expose-capture-to-python

Conversation

@kmaehashi
Copy link

When debugging grammars, it is handy if users can see which part of string are captured by which rule. This PR adds get_capture and get_captures to the Matcher and expose them to Python so that matched strings for [capture] rules can be inspected from Python.

@hudson-ai
Copy link
Contributor

Hey @kmaehashi thanks for the contribution! Exposing the captures through python is reasonable, esp. from a debugging standpoint. Would you mind adding a test or two to python/torch_tests/test_matcher.py? (note to self: this needs to be renamed to something other than torch_tests...).

I also notice that the python api is a little inconsistent with the rust api, namely captures vs get_captures. I might actually agree with you here, with get_captures feeling a bit more idiomatic for python. Looks good :)

@kmaehashi
Copy link
Author

Hi @hudson-ai, thanks for the review! I've just added the tests. I was actually waiting to make sure we were on the same page regarding the API design before writing them, so I'm glad we agree on get_captures :)

@hudson-ai
Copy link
Contributor

Looking good! One more request (would have mentioned sooner, but I had to re-acquaint myself to this code...) -- would you document (and encode in the test) that get_capture returns the last matching capture, and get_captures includes all captures, even when there are repeats? Not sure how to phrase that nicely...

I.e.

grm = r"""start: "hello " group1 group2+
group1[capture,lazy]: /[a-z]+/
group2[capture="body"]: /[a-z]{4}/"""
m = matcher(grm)

m.consume_tokens(tokenizer().tokenize_str("hello worldabcd"))

assert m.get_capture("group1") == b"w"
assert m.get_capture("body") == b"abcd"
assert m.get_captures() == [("group1", b"w"), ("body", b"orld"), ("body", b"abcd")]

This answers the question that I found myself asking... "why doesn't get_captures return a dict?"

@kmaehashi
Copy link
Author

Makes sense! Updated accordingly, let me know what you think.

@hudson-ai
Copy link
Contributor

hudson-ai commented Feb 26, 2026

Looks good to me, but please remove the nfs temp file you committed by accident:
python/llguidance/.nfs000000000077e0330000005c

Beyond this, any objections @riedgar-ms ?

@kmaehashi kmaehashi force-pushed the expose-capture-to-python branch from e84c7a5 to 279d366 Compare February 27, 2026 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants