Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/building-with-codegen/symbol-api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ All symbols share common APIs for manipulation:
- [symbol.source](/api-reference/core/Symbol#source)
- [symbol.docstring](/api-reference/core/Symbol#docstring)
- Edit operations
- [symbol.set_docstring](/api-reference/core/Symbol#add_comment)
- [symbol.set_docstring](/api-reference/core/Symbol#set-docstring)
- [symbol.move_to_file](/api-reference/core/Symbol#move-to-file) (see [Moving Symbols](/building-with-codegen/moving-symbols))
- Graph relations (See [Usages and Dependencies](/building-with-codegen/dependencies-and-usages))
- [symbol.usages](/api-reference/core/Symbol#usages)
Expand Down
1 change: 1 addition & 0 deletions docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@
"tutorials/modularity",
"tutorials/deleting-dead-code",
"tutorials/increase-type-coverage",
"tutorials/training-data",
"tutorials/manage-feature-flags",
"tutorials/managing-typescript-exports",
"tutorials/converting-default-exports",
Expand Down
235 changes: 235 additions & 0 deletions docs/tutorials/training-data.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
---
title: "Generating Training Data for LLMs"
sidebarTitle: "Training Data"
description: "Learn how to generate training data for large language models using Codegen"
icon: "network-wired"
iconType: "solid"
---

This guide demonstrates how to use Codegen to generate high-quality training data for large language models (LLMs) by extracting function implementations along with their dependencies and usages. This approach is similar to [word2vec](https://www.tensorflow.org/text/tutorials/word2vec) or [node2vec](https://snap.stanford.edu/node2vec/) - given the context of a function, learn to predict the function's implementation.

<Info>View the full code in our [examples repository](https://github.com/codegen-sh/codegen-examples/blob/main/generate_training_data/run.py)</Info>

<Tip>This example works with both Python and Typescript repositories without modification</Tip>

## Overview

The process involves three main steps:

1. Finding all functions in the codebase
2. Extracting their implementations, dependencies, and usages
3. Generating structured training data

Let's walk through each step using Codegen.

## Step 1: Finding Functions and Their Context

First, we will do a "graph expansion" for each function - grab the function's source, as well as the full source of all usages of the function and all dependencies.

<Info>See [dependencies and usages](/building-with-codegen/dependencies-and-usages) to learn more about navigating the code graph</Info>

First, let's import the types we need from Codegen:

```python
import codegen
from codegen import Codebase
from codegen.sdk.core.external_module import ExternalModule
from codegen.sdk.core.import_resolution import Import
from codegen.sdk.core.symbol import Symbol
```

Here's how we get the full context for each function:

```python
def get_function_context(function) -> dict:
"""Get the implementation, dependencies, and usages of a function."""
context = {
"implementation": {"source": function.source, "filepath": function.filepath},
"dependencies": [],
"usages": [],
}

# Add dependencies
for dep in function.dependencies:
# Hop through imports to find the root symbol source
if isinstance(dep, Import):
dep = hop_through_imports(dep)

context["dependencies"].append({"source": dep.source, "filepath": dep.filepath})

# Add usages
for usage in function.usages:
context["usages"].append({
"source": usage.usage_symbol.source,
"filepath": usage.usage_symbol.filepath,
})

return context
```

Notice how we use `hop_through_imports` to resolve dependencies. When working with imports, symbols can be re-exported multiple times. For example, a helper function might be imported and re-exported through several files before being used. We need to follow this chain to find the actual implementation:

```python
def hop_through_imports(imp: Import) -> Symbol | ExternalModule:
"""Finds the root symbol for an import."""
if isinstance(imp.imported_symbol, Import):
return hop_through_imports(imp.imported_symbol)
return imp.imported_symbol
```

This creates a structured representation of each function's context:

```json
{
"implementation": {
"source": "def process_data(input: str) -> dict: ...",
"filepath": "src/data_processor.py"
},
"dependencies": [
{
"source": "def validate_input(data: str) -> bool: ...",
"filepath": "src/validators.py"
}
],
"usages": [
{
"source": "result = process_data(user_input)",
"filepath": "src/api.py"
}
]
}
```

## Step 2: Processing the Codebase

Next, we process all functions in the codebase to generate our training data:

```python
def run(codebase: Codebase):
"""Generate training data using a node2vec-like approach for code embeddings."""
# Track all function contexts
training_data = {
"functions": [],
"metadata": {
"total_functions": len(codebase.functions),
"total_processed": 0,
"avg_dependencies": 0,
"avg_usages": 0,
},
}

# Process each function in the codebase
for function in codebase.functions:
# Skip if function is too small
if len(function.source.split("\n")) < 2:
continue

# Get function context
context = get_function_context(function)

# Only keep functions with enough context
if len(context["dependencies"]) + len(context["usages"]) > 0:
training_data["functions"].append(context)

# Update metadata
training_data["metadata"]["total_processed"] = len(training_data["functions"])
if training_data["functions"]:
training_data["metadata"]["avg_dependencies"] = sum(
len(f["dependencies"]) for f in training_data["functions"]
) / len(training_data["functions"])
training_data["metadata"]["avg_usages"] = sum(
len(f["usages"]) for f in training_data["functions"]
) / len(training_data["functions"])

return training_data
```

## Step 3: Running the Generator

Finally, we can run our training data generator on any codebase.

<Note>See [parsing codebases](/building-with-codegen/parsing-codebases) to learn more</Note>

```python
if __name__ == "__main__":
print("Initializing codebase...")
codebase = Codebase.from_repo("fastapi/fastapi")

print("Generating training data...")
training_data = run(codebase)

print("Saving training data...")
with open("training_data.json", "w") as f:
json.dump(training_data, f, indent=2)
print("Training data saved to training_data.json")
```

This will:
1. Load the target codebase
2. Process all functions
3. Save the structured training data to a JSON file

<Tip>
You can use any Git repository as your source codebase by passing the repo URL
to [Codebase.from_repo(...)](/api-reference/core/codebase#from-repo).
</Tip>

## Using the Training Data

The generated data can be used to train LLMs in several ways:

1. **Masked Function Prediction**: Hide a function's implementation and predict it from dependencies and usages
2. **Code Embeddings**: Generate embeddings that capture semantic relationships between functions
3. **Dependency Prediction**: Learn to predict which functions are likely to be dependencies
4. **Usage Pattern Learning**: Train models to understand common usage patterns

For example, to create a masked prediction task:

```python
def create_training_example(function_data):
"""Create a masked prediction example from function data."""
return {
"context": {
"dependencies": function_data["dependencies"],
"usages": function_data["usages"]
},
"target": function_data["implementation"]
}

# Create training examples
examples = [create_training_example(f) for f in training_data["functions"]]
```

## Best Practices

1. **Filter Small Functions**: Skip trivial functions that won't provide meaningful training data:
```python
if len(function.source.split("\n")) < 2:
continue
```

2. **Ensure Sufficient Context**: Only use functions with dependencies or usages:
```python
if len(context["dependencies"]) + len(context["usages"]) > 0:
training_data["functions"].append(context)
```

3. **Track Metadata**: Keep statistics about your training data:
```python
training_data["metadata"] = {
"total_functions": len(codebase.functions),
"total_processed": len(training_data["functions"]),
"avg_dependencies": average_dependencies,
"avg_usages": average_usages
}
```

4. **Handle Import Chains**: Follow import chains to find root implementations:
```python
def hop_through_imports(imp: Import) -> Symbol | ExternalModule:
if isinstance(imp.imported_symbol, Import):
return hop_through_imports(imp.imported_symbol)
return imp.imported_symbol
```

By following these guidelines, you can generate high-quality training data for your LLM projects while maintaining code quality and consistency.
20 changes: 20 additions & 0 deletions src/codegen/cli/api/endpoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,23 @@
LOOKUP_ENDPOINT = f"https://{MODAL_PREFIX}--cli-lookup.modal.run"
RUN_ON_PR_ENDPOINT = f"https://{MODAL_PREFIX}--cli-run-on-pull-request.modal.run"
PR_LOOKUP_ENDPOINT = f"https://{MODAL_PREFIX}--cli-pr-lookup.modal.run"

# Base URLs
CODEGEN_API_URL = "https://api.codegen.sh"
CODEGEN_WEB_URL = "https://codegen.sh"

# API endpoints
CODEGEN_API_DOCS = f"{CODEGEN_API_URL}/docs"
CODEGEN_API_EXAMPLES = f"{CODEGEN_API_URL}/examples"
CODEGEN_API_CODEMOD = f"{CODEGEN_API_URL}/codemod"
CODEGEN_API_CODEMOD_DEPLOY = f"{CODEGEN_API_URL}/codemod/deploy"
CODEGEN_API_CODEMOD_DEPLOY_STATUS = f"{CODEGEN_API_URL}/codemod/deploy/status"
CODEGEN_API_CODEMOD_DEPLOY_CANCEL = f"{CODEGEN_API_URL}/codemod/deploy/cancel"
CODEGEN_API_CODEMOD_DEPLOY_LOGS = f"{CODEGEN_API_URL}/codemod/deploy/logs"

# Web URLs
CODEGEN_WEB_PLAYGROUND = f"{CODEGEN_WEB_URL}/playground"
CODEGEN_WEB_DOCS = f"{CODEGEN_WEB_URL}/docs"

# System prompt URL
CODEGEN_SYSTEM_PROMPT_URL = "https://gist.githubusercontent.com/jayhack/15681a2ceaccd726f19e6fdb3a44738b/raw/17c08054e3931b3b7fdf424458269c9e607541e8/codegen-system-prompt.txt"
3 changes: 1 addition & 2 deletions src/codegen/cli/commands/init/render.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,4 @@ def get_success_message(codegen_dir: Path, docs_dir: Path, examples_dir: Path) -
return """📁 .codegen configuration folder created:
[dim]config.toml[/dim] Project configuration
[dim]codemods/[/dim] Your codemod implementations
[dim]jupyter/[/dim] Notebooks for codebase exploration
[dim]prompts/[/dim] AI system prompts (gitignored)"""
[dim]codegen-system-prompt.txt[/dim] AI system prompt (gitignored)"""
13 changes: 13 additions & 0 deletions src/codegen/cli/workspace/initialize_workspace.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from contextlib import nullcontext
from pathlib import Path

import requests
import rich
import toml
from rich.status import Status
Expand Down Expand Up @@ -78,6 +79,7 @@ def initialize_codegen(
CONFIG_PATH = CODEGEN_FOLDER / "config.toml"
JUPYTER_DIR = CODEGEN_FOLDER / "jupyter"
CODEMODS_DIR = CODEGEN_FOLDER / "codemods"
SYSTEM_PROMPT_PATH = CODEGEN_FOLDER / "codegen-system-prompt.txt"

# If status is a string, create a new spinner
context = create_spinner(f" {status} folders...") if isinstance(status, str) else nullcontext()
Expand All @@ -91,6 +93,16 @@ def initialize_codegen(
JUPYTER_DIR.mkdir(parents=True, exist_ok=True)
CODEMODS_DIR.mkdir(parents=True, exist_ok=True)

# Download system prompt
try:
from codegen.cli.api.endpoints import CODEGEN_SYSTEM_PROMPT_URL

response = requests.get(CODEGEN_SYSTEM_PROMPT_URL)
response.raise_for_status()
SYSTEM_PROMPT_PATH.write_text(response.text)
except Exception as e:
rich.print(f"[yellow]Warning: Could not download system prompt: {e}[/yellow]")

if not repo:
rich.print("No git repository found. Please run this command in a git repository.")
else:
Expand Down Expand Up @@ -152,6 +164,7 @@ def modify_gitignore(codegen_folder: Path):
"examples/",
"prompts/",
"jupyter/",
"codegen-system-prompt.txt", # Add system prompt to gitignore
"",
"# Python cache files",
"__pycache__/",
Expand Down