Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

Environment Trajectory Utils

Utility scripts for converting and tokenizing environmentally-native trajectories ($\mathcal{D}^{\text{env}}$) from daVinci-Dev.

These tools transform SWE-agent trajectory outputs into LLM-trainable formats compatible with frameworks like SLIME.

Quick Start

0. Setup Environment

We recommend using uv to manage your Python environment:

# Create and activate a virtual environment
uv venv
source .venv/bin/activate

# Install the Hugging Face CLI as a tool
uv tool install huggingface_hub

# Install dependencies for the scripts
uv pip install transformers numpy pyarrow

1. Download the Dataset

Download the environment-native trajectories from Hugging Face:

⚠️ Storage Warning: env-native.jsonl is approximately 15.6 GB. Ensure you have sufficient disk space before downloading.

hf download GAIR/daVinci-Dev env-native.jsonl --repo-type dataset --local-dir .

2. Convert Trajectories to XML Format

The convert_trajectories.py script converts SWE-agent native tool call format to XML function calling format:

uv run convert_trajectories.py env-native.jsonl -o env-native-converted.jsonl

What it does:

  • Converts JSON tool calls to XML format (e.g., <function=bash><parameter=command>ls</parameter></function>)
  • Wraps reasoning content in <think>...</think> tags
  • Replaces the system prompt with an XML-compatible version
  • Converts tool response messages to user messages
Input/Output format example

Input format (native):

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "...", "reasoning_content": "...", "tool_calls": [{"function": {"name": "bash", "arguments": "{\"command\": \"ls\"}"}}]},
    {"role": "tool", "content": "file1.py\nfile2.py"}
  ]
}

Output format (XML):

{
  "messages": [
    {"role": "system", "content": "...XML function definitions..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "<think>...</think>\n\n...\n\n<function=bash>\n<parameter=command>ls</parameter>\n</function>"},
    {"role": "user", "content": "file1.py\nfile2.py"}
  ]
}

3. Tokenize and Filter by Length

The tokenize_trajectories.py script tokenizes trajectories, computes statistics, and filters by token length:

# Analyze token lengths
uv run tokenize_trajectories.py env-native-converted.jsonl -m Qwen/Qwen2.5-72B

# Filter to max 128K tokens and output filtered JSONL
uv run tokenize_trajectories.py env-native-converted.jsonl \
    -m Qwen/Qwen2.5-72B \
    --max-tokens 131072 \
    --filtered-output env-native-128k.jsonl

# Also output tokenized parquet (for mid-training framework)
uv run tokenize_trajectories.py env-native-converted.jsonl \
    -m Qwen/Qwen2.5-72B \
    --max-tokens 131072 \
    --filtered-output env-native-128k.jsonl \
    --parquet-output env-native-128k.parquet \
    --rows-per-file 10000
Output statistics example
==================================================
TOKEN LENGTH STATISTICS
==================================================
Count:                         1,234
Mean:                       12,345.67
Std:                         5,678.90
Min:                           1,234
Max:                          65,432
--------------------------------------------------
PERCENTILES
--------------------------------------------------
5%:                          3,456.00
25% (Q1):                    7,890.00
50% (Median):               11,234.00
75% (Q3):                   15,678.00
90%:                        23,456.00
95%:                        28,901.00
99%:                        45,678.00
==================================================
Command References

convert_trajectories.py

usage: convert_trajectories.py [-h] [-o OUTPUT] input

positional arguments:
  input                 Input JSONL file with trajectories

optional arguments:
  -o, --output OUTPUT   Output JSONL file (default: input_converted.jsonl)

tokenize_trajectories.py

usage: tokenize_trajectories.py [-h] [-m MODEL] [-b BATCH_SIZE] [-o OUTPUT]
                                [--max-tokens MAX_TOKENS]
                                [--filtered-output FILTERED_OUTPUT]
                                [--parquet-output PARQUET_OUTPUT]
                                [--rows-per-file ROWS_PER_FILE]
                                input

positional arguments:
  input                 Input JSONL file

optional arguments:
  -m, --model MODEL     Tokenizer model name (default: Qwen/Qwen2.5-72B)
  -b, --batch-size      Batch size for reading (default: 1024)
  -o, --output OUTPUT   Save statistics to JSON file
  --max-tokens          Filter max tokens
  --filtered-output     Output JSONL file for filtered trajectories
  --parquet-output      Output parquet file base name (e.g., data.parquet)
  --rows-per-file       Number of rows per Parquet file (default: 20000)

Training with SLIME

After converting and filtering, the output can be used with the SLIME framework for supervised fine-tuning. Please check out Official documentation.

License

These scripts are released under the Apache-2.0 license as part of the daVinci-Dev project.