Utility scripts for converting and tokenizing environmentally-native trajectories (
These tools transform SWE-agent trajectory outputs into LLM-trainable formats compatible with frameworks like SLIME.
We recommend using uv to manage your Python environment:
# Create and activate a virtual environment
uv venv
source .venv/bin/activate
# Install the Hugging Face CLI as a tool
uv tool install huggingface_hub
# Install dependencies for the scripts
uv pip install transformers numpy pyarrowDownload the environment-native trajectories from Hugging Face:
⚠️ Storage Warning:env-native.jsonlis approximately 15.6 GB. Ensure you have sufficient disk space before downloading.
hf download GAIR/daVinci-Dev env-native.jsonl --repo-type dataset --local-dir .The convert_trajectories.py script converts SWE-agent native tool call format to XML function calling format:
uv run convert_trajectories.py env-native.jsonl -o env-native-converted.jsonlWhat it does:
- Converts JSON tool calls to XML format (e.g.,
<function=bash><parameter=command>ls</parameter></function>) - Wraps reasoning content in
<think>...</think>tags - Replaces the system prompt with an XML-compatible version
- Converts tool response messages to user messages
Input/Output format example
Input format (native):
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "...", "reasoning_content": "...", "tool_calls": [{"function": {"name": "bash", "arguments": "{\"command\": \"ls\"}"}}]},
{"role": "tool", "content": "file1.py\nfile2.py"}
]
}Output format (XML):
{
"messages": [
{"role": "system", "content": "...XML function definitions..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "<think>...</think>\n\n...\n\n<function=bash>\n<parameter=command>ls</parameter>\n</function>"},
{"role": "user", "content": "file1.py\nfile2.py"}
]
}The tokenize_trajectories.py script tokenizes trajectories, computes statistics, and filters by token length:
# Analyze token lengths
uv run tokenize_trajectories.py env-native-converted.jsonl -m Qwen/Qwen2.5-72B
# Filter to max 128K tokens and output filtered JSONL
uv run tokenize_trajectories.py env-native-converted.jsonl \
-m Qwen/Qwen2.5-72B \
--max-tokens 131072 \
--filtered-output env-native-128k.jsonl
# Also output tokenized parquet (for mid-training framework)
uv run tokenize_trajectories.py env-native-converted.jsonl \
-m Qwen/Qwen2.5-72B \
--max-tokens 131072 \
--filtered-output env-native-128k.jsonl \
--parquet-output env-native-128k.parquet \
--rows-per-file 10000Output statistics example
==================================================
TOKEN LENGTH STATISTICS
==================================================
Count: 1,234
Mean: 12,345.67
Std: 5,678.90
Min: 1,234
Max: 65,432
--------------------------------------------------
PERCENTILES
--------------------------------------------------
5%: 3,456.00
25% (Q1): 7,890.00
50% (Median): 11,234.00
75% (Q3): 15,678.00
90%: 23,456.00
95%: 28,901.00
99%: 45,678.00
==================================================
Command References
usage: convert_trajectories.py [-h] [-o OUTPUT] input
positional arguments:
input Input JSONL file with trajectories
optional arguments:
-o, --output OUTPUT Output JSONL file (default: input_converted.jsonl)
usage: tokenize_trajectories.py [-h] [-m MODEL] [-b BATCH_SIZE] [-o OUTPUT]
[--max-tokens MAX_TOKENS]
[--filtered-output FILTERED_OUTPUT]
[--parquet-output PARQUET_OUTPUT]
[--rows-per-file ROWS_PER_FILE]
input
positional arguments:
input Input JSONL file
optional arguments:
-m, --model MODEL Tokenizer model name (default: Qwen/Qwen2.5-72B)
-b, --batch-size Batch size for reading (default: 1024)
-o, --output OUTPUT Save statistics to JSON file
--max-tokens Filter max tokens
--filtered-output Output JSONL file for filtered trajectories
--parquet-output Output parquet file base name (e.g., data.parquet)
--rows-per-file Number of rows per Parquet file (default: 20000)
After converting and filtering, the output can be used with the SLIME framework for supervised fine-tuning. Please check out Official documentation.
These scripts are released under the Apache-2.0 license as part of the daVinci-Dev project.