Add Turkish tool-calling dataset v1 with synthetic single-turn conversations generated using ToolsGen and Qwen models

atasoglu · atasoglu · commit d541dd2edf9a · 2025-11-16T20:57:40.000+03:00
diff --git a/examples/turkish_tool_calling_v1/README.md b/examples/turkish_tool_calling_v1/README.md
@@ -0,0 +1,111 @@
+# Turkish Tool Calling v1
+
+A synthetic Turkish tool-calling dataset generated using [ToolsGen](https://github.com/atasoglu/toolsgen) with Qwen models via OpenRouter.
+
+## Dataset Details
+
+- **Generated with**: ToolsGen
+- **Total Samples**: 1,000
+- **Language**: Turkish
+- **Format**: Single-turn conversations with tool calls
+
+### Models Used
+
+- **Problem Generator**: qwen/qwen3-235b-a22b-2507 (temp=1.0)
+- **Tool Caller**: qwen/qwen3-235b-a22b-2507 (temp=0.0)
+- **Judge**: qwen/qwen3-235b-a22b-2507 (temp=0.0)
+
+## Dataset Structure
+
+Each record contains:
+
+```json
+{
+  "id": "record_000000",
+  "language": "turkish",
+  "tools": [...],
+  "messages": [
+    {"role": "user", "content": "İstanbul'da hava durumu nasıl?"}
+  ],
+  "assistant_calls": [
+    {
+      "id": "call_...",
+      "type": "function",
+      "function": {
+        "name": "get_weather",
+        "arguments": "{\"location\": \"Istanbul, Turkey\"}"
+      }
+    }
+  ],
+  "problem_metadata": {...},
+  "judge": {
+    "tool_relevance": 0.4,
+    "argument_quality": 0.38,
+    "clarity": 0.2,
+    "score": 0.98,
+    "verdict": "accept",
+    "rationale": "...",
+    "rubric_version": "0.1.0",
+    "model": "qwen/qwen3-235b-a22b-2507",
+    "temperature": 0.0
+  },
+  "quality_tags": [],
+  "tools_metadata": {"num_tools": 2}
+}
+```
+
+## Generation Details
+
+### Configuration
+
+- **Strategy**: Random tool sampling
+- **Tools per sample**: 1-8 (k_min=1, k_max=8)
+- **Max attempts**: 1
+- **Train split**: 80%
+- **Seed**: Random (1-10M range)
+
+### Quality Control
+
+All samples passed through an LLM-as-a-judge evaluation with a multi-dimensional rubric:
+
+- **Tool Relevance** (40%): Are the selected tools appropriate?
+- **Argument Quality** (38%): Are arguments valid and plausible?
+- **Clarity** (20%): Is the response complete and clear?
+
+Samples with `score >= 0.7` and `verdict == "accept"` are included.
+
+## Usage
+
+```python
+from datasets import load_dataset
+
+dataset = load_dataset("atasoglu/turkish-tool-calling-v1")
+
+# Access a sample
+sample = dataset["train"][0]
+print(sample["messages"])
+print(sample["assistant_calls"])
+```
+
+## Limitations
+
+- Single-turn conversations only
+- Turkish language only
+- Synthetic data generated by LLMs (may contain artifacts)
+- No actual tool execution or validation
+- Judge scores are model-based assessments
+
+## Citation
+
+```bibtex
+@software{toolsgen2025,
+  title = {ToolsGen: Synthetic Tool-Calling Dataset Generator},
+  author = {Ataşoğlu, Ahmet},
+  year = {2025},
+  url = {https://github.com/atasoglu/toolsgen}
+}
+```
+
+## License
+
+MIT License