Skip to content

Ensuring accurate table extraction and JSON validation using OpenAI models or other (for Pydantic users) #3383

@MansourKama

Description

@MansourKama

Question

Hi everyone,

I know this isn’t directly a Pydantic-related question, but since many of you work with structured data validation and model outputs, I believe some might have faced this issue.

When I use OpenAI’s o4-mini model (or others) to extract and format a table (for example, a “large” table) into JSON,
the model sometimes skips rows or hallucinates entries.
Interestingly, ChatGPT itself handles this perfectly, but my API-based agents do not.

I’m looking for ways to make the output structurally accurate before validation with Pydantic.

🧩 Approaches I’ve seen mentioned

  • Splitting the document into chunks before feeding it to the model
  • Creating an agent that dynamically generates code to parse and reformat the table
  • Enforcing a strict Pydantic schema with retries until valid output is returned

I haven’t tested these yet, so if anyone has experience with these methods and found them reliable and efficient,
I’d really appreciate it if you could share your code or any implementation examples.

Thanks a lot!

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions