Skip to content

Commit 1f55a33

Browse files
committed
feat: Add Hugging Face Hub push functionality with CLI options and environment support
- Introduced command-line arguments for pushing datasets to Hugging Face Hub (`--push-to-hub`, `--repo-id`, `--hf-token`, `--private`) - Updated generate command to optionally push generated datasets directly to HF Hub - Added environment variable support for HF token via `.env` file using `python-dotenv` - Created new `hf_hub.py` module for HF Hub upload implementation - Enhanced README with example commands for generating and pushing datasets - Updated `.env.example` to include optional `HF_TOKEN` for HF Hub authentication
1 parent 2889299 commit 1f55a33

File tree

9 files changed

+555
-3
lines changed

9 files changed

+555
-3
lines changed

.env.example

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,6 @@ OPENAI_API_KEY=sk-your-api-key-here
1616
# vLLM / Local server
1717
# OPENAI_BASE_URL=http://localhost:8000/v1
1818
# OPENAI_API_KEY=dummy-key
19+
20+
# Optional: Hugging Face token to push generated datasets
21+
# HF_TOKEN=your-huggingface-token-here

README.md

Lines changed: 57 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,16 +76,26 @@ toolsgen generate \
7676
--num 500 \
7777
--workers 6 \
7878
--worker-batch-size 4
79+
80+
# Generate and push directly to Hugging Face Hub
81+
export HF_TOKEN="your-hf-token-here"
82+
toolsgen generate \
83+
--tools tools.json \
84+
--out output_dir \
85+
--num 100 \
86+
--push-to-hub \
87+
--repo-id username/dataset-name
7988
```
8089

8190
### Python API Usage
8291

8392
```python
84-
import os
8593
from pathlib import Path
94+
from dotenv import load_dotenv
95+
8696
from toolsgen.core import GenerationConfig, ModelConfig, generate_dataset
8797

88-
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
98+
load_dotenv() # Load from .env file
8999

90100
# Configuration
91101
tools_path = Path("tools.json")
@@ -119,6 +129,50 @@ print(f"Generated {manifest['num_generated']}/{manifest['num_requested']} record
119129
print(f"Failed: {manifest['num_failed']} attempts")
120130
```
121131

132+
### Push to Hugging Face Hub
133+
134+
```python
135+
from pathlib import Path
136+
from dotenv import load_dotenv
137+
138+
from toolsgen import GenerationConfig, ModelConfig, generate_dataset, push_to_hub
139+
140+
load_dotenv() # Load from .env file
141+
142+
tools_path = Path("tools.json")
143+
output_dir = Path("output")
144+
145+
gen_config = GenerationConfig(
146+
num_samples=100,
147+
strategy="random",
148+
seed=42,
149+
train_split=0.9,
150+
)
151+
152+
model_config = ModelConfig(
153+
model="gpt-4o-mini",
154+
temperature=0.7,
155+
)
156+
157+
# Generate dataset
158+
manifest = generate_dataset(
159+
output_dir=output_dir,
160+
gen_config=gen_config,
161+
model_config=model_config,
162+
tools_path=tools_path,
163+
)
164+
165+
# Push to Hub
166+
hub_info = push_to_hub(
167+
output_dir=output_dir,
168+
repo_id="username/dataset-name",
169+
private=False,
170+
)
171+
172+
print(f"Generated: {manifest['num_generated']} records")
173+
print(f"Repository: {hub_info['repo_url']}")
174+
```
175+
122176
See `examples/` directory for complete working examples.
123177

124178
**Note**: The examples in `examples/` use `python-dotenv` for convenience (load API keys from `.env` file). Install it with `pip install python-dotenv` if you want to use this approach.
@@ -228,7 +282,7 @@ For detailed information about the system architecture, pipeline, and core compo
228282
- [ ] Custom prompt template system
229283
- [x] Parallel generation with multiprocessing
230284
- [ ] Additional sampling strategies (coverage-based, difficulty-based)
231-
- [ ] Integration with Hugging Face Hub for direct dataset uploads
285+
- [x] Integration with Hugging Face Hub for direct dataset uploads
232286
- [ ] Support for more LLM providers (Anthropic, Cohere, etc.)
233287
- [ ] Web UI for dataset inspection and curation
234288
- [ ] Advanced filtering and deduplication

examples/hf_hub_upload/README.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Hugging Face Hub Upload Example
2+
3+
This example demonstrates how to generate a dataset and push it directly to Hugging Face Hub.
4+
5+
## Prerequisites
6+
7+
1. OpenAI API key
8+
2. Hugging Face account and token with write access
9+
10+
## Setup
11+
12+
```bash
13+
# Install dependencies
14+
pip install toolsgen huggingface_hub python-dotenv
15+
16+
# Create .env file from example
17+
cp .env.example .env
18+
19+
# Edit .env and add your API keys
20+
# OPENAI_API_KEY=your-openai-api-key
21+
# HF_TOKEN=your-huggingface-token
22+
```
23+
24+
## Usage
25+
26+
### Python API
27+
28+
```python
29+
python example.py
30+
```
31+
32+
Make sure to update the `repo_id` in the script to your own repository name.
33+
34+
### CLI
35+
36+
```bash
37+
toolsgen generate \
38+
--tools ../basic/tools.json \
39+
--out output \
40+
--num 50 \
41+
--push-to-hub \
42+
--repo-id your-username/your-dataset-name
43+
```
44+
45+
## What Gets Uploaded
46+
47+
The following files are automatically uploaded to your HF Hub repository:
48+
49+
- `train.jsonl` - Training dataset
50+
- `val.jsonl` - Validation dataset (if train_split < 1.0)
51+
- `manifest.json` - Generation metadata
52+
- `README.md` - Auto-generated dataset card
53+
54+
## Repository Visibility
55+
56+
By default, repositories are public. To create a private repository:
57+
58+
**Python API:**
59+
```python
60+
hub_info = push_to_hub(
61+
output_dir=output_dir,
62+
repo_id="username/dataset-name",
63+
private=True,
64+
)
65+
```
66+
67+
**CLI:**
68+
```bash
69+
toolsgen generate ... --push-to-hub --private
70+
```
71+
72+
## Notes
73+
74+
- The HF token can be provided via `--hf-token` flag or `HF_TOKEN` environment variable
75+
- If a repository already exists, it will be updated with new files
76+
- A dataset card (README.md) is automatically generated if not present

examples/hf_hub_upload/example.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
"""Example: Generate dataset and push to Hugging Face Hub."""
2+
3+
from pathlib import Path
4+
5+
from dotenv import load_dotenv
6+
7+
from toolsgen import GenerationConfig, ModelConfig, generate_dataset, push_to_hub
8+
9+
# Load environment variables from .env file
10+
load_dotenv()
11+
12+
# Configuration
13+
tools_path = Path(__file__).parent.parent / "basic" / "tools.json"
14+
output_dir = Path(__file__).parent / "output"
15+
16+
gen_config = GenerationConfig(
17+
num_samples=50,
18+
strategy="random",
19+
seed=42,
20+
train_split=0.9,
21+
)
22+
23+
model_config = ModelConfig(
24+
model="gpt-4o-mini",
25+
temperature=0.7,
26+
)
27+
28+
# Generate dataset
29+
manifest = generate_dataset(
30+
output_dir=output_dir,
31+
gen_config=gen_config,
32+
model_config=model_config,
33+
tools_path=tools_path,
34+
)
35+
36+
# Push to Hub
37+
hub_info = push_to_hub(
38+
output_dir=output_dir,
39+
repo_id="your-username/your-dataset-name", # Change this!
40+
private=False,
41+
)
42+
43+
print("\n✓ Dataset generated and uploaded!")
44+
print(f" Generated: {manifest['num_generated']} records")
45+
print(f" Repository: {hub_info['repo_url']}")

requirements-dev.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ pytest-cov>=5.0.0
33
pre-commit>=3.7.0
44
ruff>=0.6.0
55
python-dotenv>=1.0.0
6+
huggingface_hub>=0.20.0

src/toolsgen/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
load_tool_specs,
1515
write_dataset_jsonl,
1616
)
17+
from .hf_hub import push_to_hub
1718
from .judge import JudgeResponse, judge_tool_calls
1819
from .problem_generator import generate_problem
1920
from .tool_caller import generate_tool_calls
@@ -55,6 +56,8 @@
5556
"generate_dataset",
5657
"load_tool_specs",
5758
"write_dataset_jsonl",
59+
# HF Hub
60+
"push_to_hub",
5861
# Judge
5962
"JudgeResponse",
6063
"judge_tool_calls",

src/toolsgen/cli.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,28 @@ def create_parser() -> argparse.ArgumentParser:
168168
help="Temperature for judging (defaults to --temperature)",
169169
)
170170

171+
# Hugging Face Hub options
172+
gen_parser.add_argument(
173+
"--push-to-hub",
174+
action="store_true",
175+
help="Push dataset to Hugging Face Hub after generation",
176+
)
177+
gen_parser.add_argument(
178+
"--repo-id",
179+
default=None,
180+
help="HF Hub repository ID (e.g., 'username/dataset-name')",
181+
)
182+
gen_parser.add_argument(
183+
"--hf-token",
184+
default=None,
185+
help="HF API token (defaults to HF_TOKEN env var)",
186+
)
187+
gen_parser.add_argument(
188+
"--private",
189+
action="store_true",
190+
help="Create private repository on HF Hub",
191+
)
192+
171193
return parser
172194

173195

@@ -275,9 +297,15 @@ def cmd_generate(args: argparse.Namespace) -> None:
275297
max_tokens=args.max_tokens,
276298
)
277299

300+
# Validate HF Hub options
301+
if args.push_to_hub and not args.repo_id:
302+
print("Error: --repo-id is required when using --push-to-hub", file=sys.stderr)
303+
sys.exit(1)
304+
278305
# Generate dataset
279306
try:
280307
print(f"Generating {args.num} samples using {args.model}...")
308+
281309
manifest = generate_dataset(
282310
args.out, gen_config, model_config, tools_path=args.tools
283311
)
@@ -297,6 +325,20 @@ def cmd_generate(args: argparse.Namespace) -> None:
297325

298326
print(f" - Manifest: {args.out / 'manifest.json'}")
299327

328+
if args.push_to_hub:
329+
from .hf_hub import push_to_hub
330+
331+
print("\nPushing to Hugging Face Hub...")
332+
hub_info = push_to_hub(
333+
output_dir=args.out,
334+
repo_id=args.repo_id,
335+
token=args.hf_token,
336+
private=args.private,
337+
)
338+
print("✓ Pushed to Hugging Face Hub")
339+
print(f" - Repository: {hub_info['repo_url']}")
340+
print(f" - Files uploaded: {', '.join(hub_info['files_uploaded'])}")
341+
300342
except ValueError as e:
301343
print(f"Error: {e}", file=sys.stderr)
302344
sys.exit(1)

0 commit comments

Comments
 (0)