Skip to content

Finetuning Embedding Model #4

@Lawrence-Godfrey

Description

@Lawrence-Godfrey

We need to figure out a way to finetune the embedding model. The idea being to improve it's performance over using a foundation model like code-bert. In this case "performance" means producing embedding vectors which capture the meaning of a code unit and which are similar to the embeddings of the natural language queries used to search the code units.

CodeBERT is pre-trained on pairs of code and documentation, so it already has some understanding of the relationship between natural language and the code units. But it still might not be optimised for the exact kind of semantic searching we're using it for.

For fine-tuning, we could use "contrastive learning", which is where you train the model to produce similar embeddings for related pieces of text (in our case, a query and the code unit we want it to match with) and very different embeddings for unrelated pieces of text. So each training sample has query | correct code unit | unrelated code unit.

To start we could create our own dataset, maybe using an open-source or a fake codebase. Down the line we could look into generating these training samples using GPT/Claude, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions