Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
convert_trajectories.py	convert_trajectories.py
tokenize_trajectories.py	tokenize_trajectories.py

Environment Trajectory Utils

Utility scripts for converting and tokenizing environmentally-native trajectories ($\mathcal{D}^{\text{env}}$) from daVinci-Dev.

These tools transform SWE-agent trajectory outputs into LLM-trainable formats compatible with frameworks like SLIME.

Quick Start

0. Setup Environment

We recommend using uv to manage your Python environment:

# Create and activate a virtual environment
uv venv
source .venv/bin/activate

# Install the Hugging Face CLI as a tool
uv tool install huggingface_hub

# Install dependencies for the scripts
uv pip install transformers numpy pyarrow

1. Download the Dataset

Download the environment-native trajectories from Hugging Face:

⚠️ Storage Warning: env-native.jsonl is approximately 15.6 GB. Ensure you have sufficient disk space before downloading.

hf download GAIR/daVinci-Dev env-native.jsonl --repo-type dataset --local-dir .

2. Convert Trajectories to XML Format

The convert_trajectories.py script converts SWE-agent native tool call format to XML function calling format:

uv run convert_trajectories.py env-native.jsonl -o env-native-converted.jsonl

What it does:

Converts JSON tool calls to XML format (e.g., <function=bash><parameter=command>ls</parameter></function>)
Wraps reasoning content in <think>...</think> tags
Replaces the system prompt with an XML-compatible version
Converts tool response messages to user messages

Input/Output format example

Input format (native):

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "...", "reasoning_content": "...", "tool_calls": [{"function": {"name": "bash", "arguments": "{\"command\": \"ls\"}"}}]},
    {"role": "tool", "content": "file1.py\nfile2.py"}
  ]
}

Output format (XML):

{
  "messages": [
    {"role": "system", "content": "...XML function definitions..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "<think>...</think>\n\n...\n\n<function=bash>\n<parameter=command>ls</parameter>\n</function>"},
    {"role": "user", "content": "file1.py\nfile2.py"}
  ]
}

3. Tokenize and Filter by Length

The tokenize_trajectories.py script tokenizes trajectories, computes statistics, and filters by token length:

# Analyze token lengths
uv run tokenize_trajectories.py env-native-converted.jsonl -m Qwen/Qwen2.5-72B

# Filter to max 128K tokens and output filtered JSONL
uv run tokenize_trajectories.py env-native-converted.jsonl \
    -m Qwen/Qwen2.5-72B \
    --max-tokens 131072 \
    --filtered-output env-native-128k.jsonl

# Also output tokenized parquet (for mid-training framework)
uv run tokenize_trajectories.py env-native-converted.jsonl \
    -m Qwen/Qwen2.5-72B \
    --max-tokens 131072 \
    --filtered-output env-native-128k.jsonl \
    --parquet-output env-native-128k.parquet \
    --rows-per-file 10000

Output statistics example

==================================================
TOKEN LENGTH STATISTICS
==================================================
Count:                         1,234
Mean:                       12,345.67
Std:                         5,678.90
Min:                           1,234
Max:                          65,432
--------------------------------------------------
PERCENTILES
--------------------------------------------------
5%:                          3,456.00
25% (Q1):                    7,890.00
50% (Median):               11,234.00
75% (Q3):                   15,678.00
90%:                        23,456.00
95%:                        28,901.00
99%:                        45,678.00
==================================================

Command References

convert_trajectories.py

usage: convert_trajectories.py [-h] [-o OUTPUT] input

positional arguments:
  input                 Input JSONL file with trajectories

optional arguments:
  -o, --output OUTPUT   Output JSONL file (default: input_converted.jsonl)

tokenize_trajectories.py

usage: tokenize_trajectories.py [-h] [-m MODEL] [-b BATCH_SIZE] [-o OUTPUT]
                                [--max-tokens MAX_TOKENS]
                                [--filtered-output FILTERED_OUTPUT]
                                [--parquet-output PARQUET_OUTPUT]
                                [--rows-per-file ROWS_PER_FILE]
                                input

positional arguments:
  input                 Input JSONL file

optional arguments:
  -m, --model MODEL     Tokenizer model name (default: Qwen/Qwen2.5-72B)
  -b, --batch-size      Batch size for reading (default: 1024)
  -o, --output OUTPUT   Save statistics to JSON file
  --max-tokens          Filter max tokens
  --filtered-output     Output JSONL file for filtered trajectories
  --parquet-output      Output parquet file base name (e.g., data.parquet)
  --rows-per-file       Number of rows per Parquet file (default: 20000)

Training with SLIME

After converting and filtering, the output can be used with the SLIME framework for supervised fine-tuning. Please check out Official documentation.

License

These scripts are released under the Apache-2.0 license as part of the daVinci-Dev project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Environment Trajectory Utils

Quick Start

0. Setup Environment

1. Download the Dataset

2. Convert Trajectories to XML Format

3. Tokenize and Filter by Length

convert_trajectories.py

tokenize_trajectories.py

Training with SLIME

License

FilesExpand file tree

env_traj_utils

Directory actions

More options

Directory actions

More options

Latest commit

History

env_traj_utils

Folders and files

parent directory

README.md

Environment Trajectory Utils

Quick Start

0. Setup Environment

1. Download the Dataset

2. Convert Trajectories to XML Format

3. Tokenize and Filter by Length

convert_trajectories.py

tokenize_trajectories.py

Training with SLIME

License