Finetuning Embedding Model

We need to figure out a way to finetune the embedding model. The idea being to improve it's performance over using a foundation model like code-bert. In this case "performance" means producing embedding vectors which capture the meaning of a code unit and which are similar to the embeddings of the natural language queries used to search the code units.

CodeBERT is pre-trained on pairs of code and documentation, so it already has some understanding of the relationship between natural language and the code units. But it still might not be optimised for the exact kind of semantic searching we're using it for. 

For fine-tuning, we could use "contrastive learning", which is where you train the model to produce similar embeddings for related pieces of text (in our case, a query and the code unit we want it to match with) and very different embeddings for unrelated pieces of text. So each training sample has `query` | `correct code unit` | `unrelated code unit`.

To start we could create our own dataset, maybe using an open-source or a fake codebase. Down the line we could look into generating these training samples using GPT/Claude, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Embedding Model #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Finetuning Embedding Model #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions