Skip to content

support for INDENT/DEDENT tokens#128

Draft
mmoskal wants to merge 7 commits intomainfrom
indent
Draft

support for INDENT/DEDENT tokens#128
mmoskal wants to merge 7 commits intomainfrom
indent

Conversation

@mmoskal
Copy link
Member

@mmoskal mmoskal commented Feb 24, 2025

Fixes #107

@nchammas
Copy link

Hello @mmoskal. I am trying to have Guidance generate code with significant indentation, so I am following your work here.

Do you intend as part of this work to implement support for Lark's declare statement? It's a critical part of how Lark enables support for significant indentation, at least as far as I understand it from reviewing the relevant docs as well as Lark's Python grammar.

@mmoskal
Copy link
Member Author

mmoskal commented Mar 17, 2025

%declare in Lark just says the definition of the token is provided elsewhere. For constraint we actually need to know what IDENT and DEDENT tokens do. Unfortunately, it's far from simple.

@nchammas
Copy link

Yes, in Lark you need to provide an instance of lark.indenter.Indenter:

class TreeIndenter(Indenter):
    NL_type = '_NL'
    OPEN_PAREN_types = []
    CLOSE_PAREN_types = []
    INDENT_type = '_INDENT'
    DEDENT_type = '_DEDENT'
    tab_len = 8

parser = Lark(tree_grammar, parser='lalr', postlex=TreeIndenter())

I already have an indentation-significant DSL built using Lark, and I am hoping to use the same grammar mostly as-is with Guidance to have an LLM output queries in my DSL. To do that, Guidance's Lark interface would probably need to accept some bit of configuration equivalent to the above.

Is that something you are planning to do, or will the approach be very different? I know this is a work in progress, so I don't expect any definite answers. Just sharing my use case.

Separately, would it help at all if Lark itself provided some kind of API to help with next token prediction? I can see that you have built your own implementation of Lark in Rust (roughly speaking), but I wonder if direct support from Lark itself would also be useful somehow.

@mmoskal
Copy link
Member Author

mmoskal commented Mar 18, 2025

there is a similar setup in this PR

When designing a new DSL to be written by LLMs I would suggest not using indentation. AFAIU it makes the LLM stupider, as it has to keep track of it, instead of simply following braces.

Unfortunately, any changes in Lark Python code cannot be used in LLGuidance.

@nchammas
Copy link

When designing a new DSL to be written by LLMs I would suggest not using indentation. AFAIU it makes the LLM stupider, as it has to keep track of it, instead of simply following braces.

Oh, that's surprising to hear. I assumed that due to the popularity of Python and YAML, LLMs wouldn't have a particular problem with indentation-significant languages. I wonder why tracking indentation level would be much harder for LLMs than tracking braces.

The DSL I designed is primarily for humans to write, but I am exploring how practical it is to have an LLM assist non-technical users by converting the queries they write in English into DSL queries.

If indentation is such a problem, perhaps I should develop some kind of additional JSON format for my DSL just for LLMs to target. Then I would be able to use the much more mature support for JSON schema-constrained generation. Not sure if this would be a lot of work, or if it would work well, as I haven't worked with JSON schemas before. But that's what I'll research next, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Indent/Dedent

2 participants