Skip to content

bottleneck identification using instructor and pydantic output models.#51

Draft
bhupatiraju wants to merge 2 commits intomainfrom
PFM-bottlenecks
Draft

bottleneck identification using instructor and pydantic output models.#51
bhupatiraju wants to merge 2 commits intomainfrom
PFM-bottlenecks

Conversation

@bhupatiraju
Copy link
Contributor

This branch contains the refactored code from the PFM-bottleneck extraction project.

The data_extraction notebook deals with extraction of the PEF and PFR docs from the WB docs API, stores a metadata table, and constructs a table of chunks from the chunked document texts. This table of chunks is accessed the LLM pipeline works on the chunks from here.

The runner (notebook) is the entry point for running the LLM pipeline. The models folder contains the Pydantic models defining the output structure from the LLMs. Currently, its organized into the extraction and validation stages and related models. An alternate option was to have the models organized bottleneck-wise such that extraction and validation models for that specific bottleneck remain in one place. For now, since we had only 4 bottlenecks in place this is the organization I went with.

The construction of the client is separated into the azure_service.py file, and the prompts.py file contains the methods to help format the required prompt in the extraction and validation stages and finally the consts contains the LLM model signature and the initial descriptions of the bottleneck which is subsequently used in formatting the prompts.

@yukinko-iwasaki
Copy link
Contributor

yukinko-iwasaki commented Jun 26, 2025

This is great! thanks for sharing your code!
Should we create a separate repo for this? This looks irrelevant to our boost project.
I could request to create a new repo under dime-worldbank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants