Ensuring accurate table extraction and JSON validation using OpenAI models or other (for Pydantic users)

### Question

Hi everyone,

I know this isn’t directly a Pydantic-related question, but since many of you work with **structured data validation** and **model outputs**, I believe some might have faced this issue.

When I use **OpenAI’s `o4-mini` model** (or others) to extract and format a table (for example, a *“large”* table) into JSON,  
the model sometimes **skips rows** or **hallucinates entries**.  
Interestingly, **ChatGPT itself handles this perfectly**, but my API-based agents do not.

I’m looking for ways to make the output **structurally accurate** before validation with Pydantic.

### 🧩 Approaches I’ve seen mentioned
- **Splitting the document into chunks before feeding it to the model**  
- **Creating an agent that dynamically generates code to parse and reformat the table**  
- **Enforcing a strict Pydantic schema with retries until valid output is returned**

I haven’t tested these yet, so if anyone has experience with these methods and found them **reliable and efficient**,  
I’d really appreciate it if you could **share your code or any implementation examples**.

Thanks a lot!

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensuring accurate table extraction and JSON validation using OpenAI models or other (for Pydantic users) #3383

Question

🧩 Approaches I’ve seen mentioned

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ensuring accurate table extraction and JSON validation using OpenAI models or other (for Pydantic users) #3383

Description

Question

🧩 Approaches I’ve seen mentioned

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions