-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Closed
Description
The blog post at https://www.boundaryml.com/blog/schema-aligned-parsing proposes a schema-aligned parsing technique for structured extraction using LLMs. It highlights:
- Defining explicit schemas for JSON outputs
- Prompting the model to adhere strictly to the schema, reducing hallucinations
- Validating the generated output against the schema
- Iteratively refining prompts and schema definitions
This approach could improve our document-to-JSON converters by:
- Defining JSON schemas for each content block (heading, paragraph, code, list, etc.)
- Enhancing prompts in our parsing pipeline to enforce schema conformance
- Adding a validation step post-generation to catch schema violations
- Experimenting with automatic schema adaptation based on document type
Proposed tasks:
- Research and design core JSON schemas for existing content-block types
- Update the parsing agent to include schema definitions in prompts
- Implement a lightweight JSON schema validator in our pipeline
- Measure improvements in accuracy and error rates
References:
- Boundary ML blog: https://www.boundaryml.com/blog/schema-aligned-parsing
Metadata
Metadata
Assignees
Labels
No labels