Skip to content

Commit 3d861cb

Browse files
authored
Merge pull request #6 from atasoglu/develop
v0.5.0
2 parents d3adb98 + 87fb7ed commit 3d861cb

File tree

11 files changed

+574
-4
lines changed

11 files changed

+574
-4
lines changed

.env.example

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,6 @@ OPENAI_API_KEY=sk-your-api-key-here
1616
# vLLM / Local server
1717
# OPENAI_BASE_URL=http://localhost:8000/v1
1818
# OPENAI_API_KEY=dummy-key
19+
20+
# Optional: Hugging Face token to push generated datasets
21+
# HF_TOKEN=your-huggingface-token-here

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,19 @@ The format is based on Keep a Changelog, and this project adheres to Semantic Ve
88

99
Nothing yet.
1010

11+
## [0.5.0] - 2025-01-11
12+
### Added
13+
- Hugging Face Hub integration for direct dataset uploads
14+
- `push_to_hub()` function in new `hf_hub` module to upload datasets to HF Hub
15+
- Uploads JSONL files (train.jsonl, val.jsonl), manifest.json, and auto-generated README.md
16+
- CLI flags: `--push-to-hub`, `--repo-id`, `--hf-token`, `--private`
17+
- Support for both public and private repositories
18+
- Auto-generated dataset cards with dataset statistics, model info, usage examples, and citation
19+
- Optional dependency: `huggingface_hub>=0.20.0` (install with `pip install toolsgen[hf]`)
20+
- Example in `examples/hf_hub_upload/` with dotenv configuration
21+
- Test suite for HF Hub functionality in `tests/test_hf_hub.py`
22+
- `push_to_hub` exported from main `toolsgen` package for easier imports
23+
1124
## [0.4.0] - 2025-01-10
1225
### Added
1326
- Quality tagging system for generated records

README.md

Lines changed: 57 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,16 +76,26 @@ toolsgen generate \
7676
--num 500 \
7777
--workers 6 \
7878
--worker-batch-size 4
79+
80+
# Generate and push directly to Hugging Face Hub
81+
export HF_TOKEN="your-hf-token-here"
82+
toolsgen generate \
83+
--tools tools.json \
84+
--out output_dir \
85+
--num 100 \
86+
--push-to-hub \
87+
--repo-id username/dataset-name
7988
```
8089

8190
### Python API Usage
8291

8392
```python
84-
import os
8593
from pathlib import Path
94+
from dotenv import load_dotenv
95+
8696
from toolsgen.core import GenerationConfig, ModelConfig, generate_dataset
8797

88-
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
98+
load_dotenv() # Load from .env file
8999

90100
# Configuration
91101
tools_path = Path("tools.json")
@@ -119,6 +129,50 @@ print(f"Generated {manifest['num_generated']}/{manifest['num_requested']} record
119129
print(f"Failed: {manifest['num_failed']} attempts")
120130
```
121131

132+
### Push to Hugging Face Hub
133+
134+
```python
135+
from pathlib import Path
136+
from dotenv import load_dotenv
137+
138+
from toolsgen import GenerationConfig, ModelConfig, generate_dataset, push_to_hub
139+
140+
load_dotenv() # Load from .env file
141+
142+
tools_path = Path("tools.json")
143+
output_dir = Path("output")
144+
145+
gen_config = GenerationConfig(
146+
num_samples=100,
147+
strategy="random",
148+
seed=42,
149+
train_split=0.9,
150+
)
151+
152+
model_config = ModelConfig(
153+
model="gpt-4o-mini",
154+
temperature=0.7,
155+
)
156+
157+
# Generate dataset
158+
manifest = generate_dataset(
159+
output_dir=output_dir,
160+
gen_config=gen_config,
161+
model_config=model_config,
162+
tools_path=tools_path,
163+
)
164+
165+
# Push to Hub
166+
hub_info = push_to_hub(
167+
output_dir=output_dir,
168+
repo_id="username/dataset-name",
169+
private=False,
170+
)
171+
172+
print(f"Generated: {manifest['num_generated']} records")
173+
print(f"Repository: {hub_info['repo_url']}")
174+
```
175+
122176
See `examples/` directory for complete working examples.
123177

124178
**Note**: The examples in `examples/` use `python-dotenv` for convenience (load API keys from `.env` file). Install it with `pip install python-dotenv` if you want to use this approach.
@@ -228,7 +282,7 @@ For detailed information about the system architecture, pipeline, and core compo
228282
- [ ] Custom prompt template system
229283
- [x] Parallel generation with multiprocessing
230284
- [ ] Additional sampling strategies (coverage-based, difficulty-based)
231-
- [ ] Integration with Hugging Face Hub for direct dataset uploads
285+
- [x] Integration with Hugging Face Hub for direct dataset uploads
232286
- [ ] Support for more LLM providers (Anthropic, Cohere, etc.)
233287
- [ ] Web UI for dataset inspection and curation
234288
- [ ] Advanced filtering and deduplication

examples/hf_hub_upload/README.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Hugging Face Hub Upload Example
2+
3+
This example demonstrates how to generate a dataset and push it directly to Hugging Face Hub.
4+
5+
## Prerequisites
6+
7+
1. OpenAI API key
8+
2. Hugging Face account and token with write access
9+
10+
## Setup
11+
12+
```bash
13+
# Install dependencies
14+
pip install toolsgen huggingface_hub python-dotenv
15+
16+
# Create .env file from example
17+
cp .env.example .env
18+
19+
# Edit .env and add your API keys
20+
# OPENAI_API_KEY=your-openai-api-key
21+
# HF_TOKEN=your-huggingface-token
22+
```
23+
24+
## Usage
25+
26+
### Python API
27+
28+
```python
29+
python example.py
30+
```
31+
32+
Make sure to update the `repo_id` in the script to your own repository name.
33+
34+
### CLI
35+
36+
```bash
37+
toolsgen generate \
38+
--tools ../basic/tools.json \
39+
--out output \
40+
--num 50 \
41+
--push-to-hub \
42+
--repo-id your-username/your-dataset-name
43+
```
44+
45+
## What Gets Uploaded
46+
47+
The following files are automatically uploaded to your HF Hub repository:
48+
49+
- `train.jsonl` - Training dataset
50+
- `val.jsonl` - Validation dataset (if train_split < 1.0)
51+
- `manifest.json` - Generation metadata
52+
- `README.md` - Auto-generated dataset card
53+
54+
## Repository Visibility
55+
56+
By default, repositories are public. To create a private repository:
57+
58+
**Python API:**
59+
```python
60+
hub_info = push_to_hub(
61+
output_dir=output_dir,
62+
repo_id="username/dataset-name",
63+
private=True,
64+
)
65+
```
66+
67+
**CLI:**
68+
```bash
69+
toolsgen generate ... --push-to-hub --private
70+
```
71+
72+
## Notes
73+
74+
- The HF token can be provided via `--hf-token` flag or `HF_TOKEN` environment variable
75+
- If a repository already exists, it will be updated with new files
76+
- A dataset card (README.md) is automatically generated if not present

examples/hf_hub_upload/example.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
"""Example: Generate dataset and push to Hugging Face Hub."""
2+
3+
from pathlib import Path
4+
5+
from dotenv import load_dotenv
6+
7+
from toolsgen import GenerationConfig, ModelConfig, generate_dataset, push_to_hub
8+
9+
# Load environment variables from .env file
10+
load_dotenv()
11+
12+
# Configuration
13+
tools_path = Path(__file__).parent.parent / "basic" / "tools.json"
14+
output_dir = Path(__file__).parent / "output"
15+
16+
gen_config = GenerationConfig(
17+
num_samples=50,
18+
strategy="random",
19+
seed=42,
20+
train_split=0.9,
21+
)
22+
23+
model_config = ModelConfig(
24+
model="gpt-4o-mini",
25+
temperature=0.7,
26+
)
27+
28+
# Generate dataset
29+
manifest = generate_dataset(
30+
output_dir=output_dir,
31+
gen_config=gen_config,
32+
model_config=model_config,
33+
tools_path=tools_path,
34+
)
35+
36+
# Push to Hub
37+
hub_info = push_to_hub(
38+
output_dir=output_dir,
39+
repo_id="your-username/your-dataset-name", # Change this!
40+
private=False,
41+
)
42+
43+
print("\n✓ Dataset generated and uploaded!")
44+
print(f" Generated: {manifest['num_generated']} records")
45+
print(f" Repository: {hub_info['repo_url']}")

pyproject.toml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ toolsgen = ["prompts/*.txt"]
1313

1414
[project]
1515
name = "toolsgen"
16-
version = "0.4.0"
16+
version = "0.5.0"
1717
description = "Generate tool-calling datasets from OpenAI-compatible tool specs"
1818
readme = "README.md"
1919
requires-python = ">=3.9"
@@ -40,6 +40,11 @@ dependencies = [
4040
"tqdm>=4.66.0",
4141
]
4242

43+
[project.optional-dependencies]
44+
hf = [
45+
"huggingface_hub>=0.20.0",
46+
]
47+
4348
[project.urls]
4449
Homepage = "https://github.com/atasoglu/toolsgen"
4550
Repository = "https://github.com/atasoglu/toolsgen"

requirements-dev.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ pytest-cov>=5.0.0
33
pre-commit>=3.7.0
44
ruff>=0.6.0
55
python-dotenv>=1.0.0
6+
huggingface_hub>=0.20.0

src/toolsgen/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
load_tool_specs,
1515
write_dataset_jsonl,
1616
)
17+
from .hf_hub import push_to_hub
1718
from .judge import JudgeResponse, judge_tool_calls
1819
from .problem_generator import generate_problem
1920
from .tool_caller import generate_tool_calls
@@ -55,6 +56,8 @@
5556
"generate_dataset",
5657
"load_tool_specs",
5758
"write_dataset_jsonl",
59+
# HF Hub
60+
"push_to_hub",
5861
# Judge
5962
"JudgeResponse",
6063
"judge_tool_calls",

src/toolsgen/cli.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,28 @@ def create_parser() -> argparse.ArgumentParser:
168168
help="Temperature for judging (defaults to --temperature)",
169169
)
170170

171+
# Hugging Face Hub options
172+
gen_parser.add_argument(
173+
"--push-to-hub",
174+
action="store_true",
175+
help="Push dataset to Hugging Face Hub after generation",
176+
)
177+
gen_parser.add_argument(
178+
"--repo-id",
179+
default=None,
180+
help="HF Hub repository ID (e.g., 'username/dataset-name')",
181+
)
182+
gen_parser.add_argument(
183+
"--hf-token",
184+
default=None,
185+
help="HF API token (defaults to HF_TOKEN env var)",
186+
)
187+
gen_parser.add_argument(
188+
"--private",
189+
action="store_true",
190+
help="Create private repository on HF Hub",
191+
)
192+
171193
return parser
172194

173195

@@ -275,9 +297,15 @@ def cmd_generate(args: argparse.Namespace) -> None:
275297
max_tokens=args.max_tokens,
276298
)
277299

300+
# Validate HF Hub options
301+
if args.push_to_hub and not args.repo_id:
302+
print("Error: --repo-id is required when using --push-to-hub", file=sys.stderr)
303+
sys.exit(1)
304+
278305
# Generate dataset
279306
try:
280307
print(f"Generating {args.num} samples using {args.model}...")
308+
281309
manifest = generate_dataset(
282310
args.out, gen_config, model_config, tools_path=args.tools
283311
)
@@ -297,6 +325,20 @@ def cmd_generate(args: argparse.Namespace) -> None:
297325

298326
print(f" - Manifest: {args.out / 'manifest.json'}")
299327

328+
if args.push_to_hub:
329+
from .hf_hub import push_to_hub
330+
331+
print("\nPushing to Hugging Face Hub...")
332+
hub_info = push_to_hub(
333+
output_dir=args.out,
334+
repo_id=args.repo_id,
335+
token=args.hf_token,
336+
private=args.private,
337+
)
338+
print("✓ Pushed to Hugging Face Hub")
339+
print(f" - Repository: {hub_info['repo_url']}")
340+
print(f" - Files uploaded: {', '.join(hub_info['files_uploaded'])}")
341+
300342
except ValueError as e:
301343
print(f"Error: {e}", file=sys.stderr)
302344
sys.exit(1)

0 commit comments

Comments
 (0)