bottleneck identification using instructor and pydantic output models.#51
Draft
bhupatiraju wants to merge 2 commits intomainfrom
Draft
bottleneck identification using instructor and pydantic output models.#51bhupatiraju wants to merge 2 commits intomainfrom
bhupatiraju wants to merge 2 commits intomainfrom
Conversation
Contributor
|
This is great! thanks for sharing your code! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch contains the refactored code from the PFM-bottleneck extraction project.
The data_extraction notebook deals with extraction of the PEF and PFR docs from the WB docs API, stores a metadata table, and constructs a table of chunks from the chunked document texts. This table of chunks is accessed the LLM pipeline works on the chunks from here.
The runner (notebook) is the entry point for running the LLM pipeline. The models folder contains the Pydantic models defining the output structure from the LLMs. Currently, its organized into the extraction and validation stages and related models. An alternate option was to have the models organized bottleneck-wise such that extraction and validation models for that specific bottleneck remain in one place. For now, since we had only 4 bottlenecks in place this is the organization I went with.
The construction of the client is separated into the azure_service.py file, and the prompts.py file contains the methods to help format the required prompt in the extraction and validation stages and finally the consts contains the LLM model signature and the initial descriptions of the bottleneck which is subsequently used in formatting the prompts.