feat: add modelLocalPaths support and Qwen model by zeyuyuyu · Pull Request #337 · 0gfoundation/0g-serving-broker

zeyuyuyu · 2026-01-30T04:56:34Z

Summary

This PR adds support for using pre-downloaded models directly from the provider's local filesystem, eliminating the need to download large models from 0G Storage for every fine-tuning task.

Features

Add modelLocalPaths configuration option to map model hashes to local file paths
Support local model paths for any model (including predefined models)
Create symlink to local model instead of downloading from 0G Storage
Add Qwen2.5-0.5B-Instruct model support

Benefits

Significantly faster task setup for large models (e.g., Qwen2.5-32B)
Reduced bandwidth usage
More reliable in TEE environments with network restrictions

Test Plan

Tested on Phala Cloud CVM with H200 GPU
Verified localPath feature loads model from /dstack/persistent/models/
Confirmed full fine-tuning flow: Setup → Train → Deliver
Documentation added: docs/localPath-feature-test.md

Configuration Example

service:
  modelLocalPaths:
    "0xb4f76a886b8655c92bb021922d60b5e4d9271a5c9da98b6cb10937a06c2c75a7": "/dstack/persistent/models/Qwen2.5-0.5B-Instruct"

- Since a comma-separated hex string rather than a prue hex

Co-authored-by: Peter Zhang <peter@0g.ai>

- Modify Dockerfile for serving - Add config file for CocktailSGD - See more details of CocktailSGD in https://github.com/0glabs/CocktailSGD Co-authored-by: Sidonie <sidonie@0g.ai>

…start server

- If image with the same 'repo:tag' exists, skip pulling

…ks are in progress

… event

* Fix(ft): Update dockerfile for cocktail - Update requirements.txt - Update env for cocktail - Use same dockerfile, add dependency installed with existed conda * Fix(ft): Update dockerfile and hash for cocktail --------- Co-authored-by: Sidonie <sidonie@0g.ai>

…lder

Feat(fine-tuning): Add retry for submitting transaction

…ling - Restructure prepareModel with clear fallback chain: 1. Local path (symlink) - fastest, no download 2. HuggingFace download - to task directory 3. 0G Storage download - last resort - HuggingFace fallback now triggers independently if configured - Better error logging showing all failed sources - Simplify prepareData with cleaner fallback logic - Consistent error messages for debugging

- Add skipStorageUpload config option in Service struct - When enabled, finalizer skips encryption and 0G Storage upload - Task still marked as finished, LoRA available via TEE API - Useful for testing or when 0G Storage is unavailable

- Document broker configuration options - Step-by-step testing instructions - Model/dataset fallback chain explanation - Troubleshooting guide - LoRA download from TEE instructions

…oRA delivery Major improvements: - Add dataset upload API (POST /v1/user/:address/dataset) - Encrypt LoRA locally even when skipStorageUpload=true - Return encrypted LoRA via TEE download endpoint - Support persistent storage via dataDir config - Add file retention and cleanup logic - Update documentation to English Changes: - finalizer.go: Always encrypt LoRA, save to {output}_encrypted.data - ctrl/task.go: Add SaveDataset(), GetLoRAModel() returns encrypted file - handler/task.go: Add UploadDataset endpoint - setup.go: Support user-uploaded datasets (3-tier fallback) - settlement.go: Clean up .data encrypted files - config.go: Add dataDir and fileRetentionHours fields - task_log.go: Use configurable data directory - .gitignore: Add binary files and environment-specific configs - TESTING.md: Complete English documentation with encryption flow Dataset fallback: config paths → uploaded datasets → 0G Storage Model fallback: local paths → HuggingFace → 0G Storage This enables users to upload datasets, train models, and download encrypted LoRA adapters with proper key exchange via smart contracts.

Changes: - ctrl/task.go: Add convertJSONLToHF() to convert uploaded JSONL datasets to HuggingFace DatasetDict format required by token counter - setup.go: Prefer HF format (_hf suffix) over raw JSONL when loading user-uploaded datasets Dataset loading priority for user uploads: 1. {dataDir}/datasets/{user}/{hash}_hf (HuggingFace format) 2. {dataDir}/datasets/{user}/{hash} (raw JSONL fallback) Note: The conversion runs via Docker container (qwen-lora:v3) which requires the broker to have Docker socket access.

Comprehensive guide covering: - Complete test flow from dataset upload to LoRA download - Signature generation for task creation - Task progress monitoring - Contract interaction (acknowledgeDeliverable) - Troubleshooting common issues - Model hash reference table

1. Remove binary files from git (broker-*) - ~325 MB of binaries were bloating git history - Added broker-* pattern to .gitignore 2. Remove environment-specific config files - config.yaml, config-local-model.yaml, config-model-local-paths.yaml - Updated .gitignore to properly exclude config.yaml pattern - Only config-example.yaml and config.example.yaml are tracked 3. Add authentication to LoRA download endpoint - Requires signature of keccak256(taskID) signed by user - Prevents unauthorized access to encrypted LoRA files - Added VerifyDownloadSignature function in ctrl 4. Add nil check in GetLoRAModel - Defense-in-depth for task ID parameter

- Try direct Python execution first before falling back to Docker - Add support for both instruction/input/output and messages (chat) formats - Better error handling for dataset conversion

- Add CLI-based testing commands (0g-compute-cli) - Document skipStorageUpload configuration for direct TEE download - Add Docker volume mount requirements (/tmp, dstack.sock) - Include complete test session example script - Add troubleshooting section for common issues - Update flow diagram with new TEE-direct download path

- Explain why /tmp:/tmp mount is required - Describe Docker bind mount requirements - Add visual flow diagram - Document alternative for large models using /dstack/persistent

1. Revert docker-compose.yml to dstack deploy format - Port: #PORT#:3080 (placeholder for dstack) - Config path: /dstack/user_config (fixed dstack path) 2. Remove /tmp/train_lora_fixed.py workaround in executor.go - This was a temporary fix for fsspec file:// prefix issue 3. Align VerifyDownloadSignature with validateSignature format - Use same message format: TextHash(keccak256(binaryTaskID)) - Consistent signature validation across all endpoints 4. Consolidate config example files - Merge config-example.yaml into config.example.yaml - Single comprehensive example showing all config options 5. Add design documentation for SaveDataset - Explain why content hash (not task ID) is used as filename - Task ID doesn't exist at upload time - Content hash ensures deduplication and integrity

- Add HF cache directory configuration to fix permission errors - Pin dependency versions to resolve pyarrow/numpy compatibility issues - Fix labels for causal LM loss computation - Add keep_in_memory=True to avoid cache file writes during tokenization - Support additional config key aliases (num_epochs, batch_size, etc.) Tested successfully with Qwen2.5-0.5B-Instruct LoRA fine-tuning.

- Update Dockerfile to use transformers>=4.51.0 for Qwen3 architecture - Update accelerate>=0.30.0 and peft>=0.11.0 for compatibility - Remove deprecated tokenizer parameter from Trainer (transformers 5.x) - Use data_collator to pass tokenizer instead Fixes the "TypeError: Trainer.__init__() got an unexpected keyword argument 'tokenizer'" error when using transformers 5.0+

- Add complete Qwen3-32B test script with CLI commands - Document model setup and broker configuration - Add timing reference for 32B model (~5-6 minutes) - Include Qwen3-specific troubleshooting section - Update output structure with LoRA adapter sizes by model

The database config was using an incorrect nested structure (mysql.host, mysql.port, etc.) but the Go code expects a single DSN string in database.fineTune field. Fixed to match the expected format in config/config.go: database: fineTune: "user:password@tcp(host:port)/database?parseTime=true" Co-authored-by: Cursor <cursoragent@cursor.com>

…ancelTask - Change HTTP method from GET to POST for /user/:userAddress/task/:taskID/lora - Move signature from query parameter to request body (JSON) - Update swagger documentation accordingly This change makes the DownloadLoRA endpoint consistent with CancelTask, where the signature is passed in the request body for better security. Co-authored-by: Cursor <cursoragent@cursor.com>

1. ctrl/task.go: Replace hardcoded "qwen-lora:v3" with c.config.Images.ExecutionImageName to use configured image 2. api/Dockerfile: Explicitly add huggingface_hub package and add comments explaining it provides huggingface-cli 3. setup.go: Add comment explaining huggingface-cli dependency Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

- Replace deprecated `torch_dtype` with `dtype` in model loading - Remove `save_safetensors` parameter from TrainingArguments (removed in transformers 5.x, safetensors is now the default format) These changes fix the empty output_model issue where training would crash with TypeError on `save_safetensors` before producing any LoRA weights. Co-authored-by: Cursor <cursoragent@cursor.com>

Pin transformers to >=5.0.0,<6.0.0 to prevent future API breakage. The train_lora.py script uses transformers 5.x API (dtype instead of torch_dtype, no save_safetensors parameter). Co-authored-by: Cursor <cursoragent@cursor.com>

Ravenyjh and others added 30 commits February 25, 2025 16:01

Fix(fine-tuning): Encode to hex string

1b86beb

- Since a comma-separated hex string rather than a prue hex

Refactor(fine-tuning): encrypt data by chunk (#138)

be92cd3

Fix(fine-tuning): Support cases when with many roots

1d90309

Fix(fine-tuning): Ignore clean up error

8b47bd8

Feat(fine-tuning): Launch container with configured quotas (#140)

6ddc59e

Fix(fine-tuning): Fix deepseek fine-tuning script (#143)

73e8dc2

Co-authored-by: Peter Zhang <peter@0g.ai>

Feat(ft): Support finetuning with CocktailSGD (#121)

6e6db1d

- Modify Dockerfile for serving - Add config file for CocktailSGD - See more details of CocktailSGD in https://github.com/0glabs/CocktailSGD Co-authored-by: Sidonie <sidonie@0g.ai>

Fix(fine-tuning): fix the log message format (#144)

309faf1

Fix(fine-tuning): Change task state from inProgress to failed when re…

f4d4ec0

…start server

Feat(inference): Add support for distinguishing servers in monitoring

5ed869d

Refactor(fine-tuning): Enable image configuration support

62f99af

Feat(fine-tuning): Enable pulling image when start fine tuning

5bf9941

- If image with the same 'repo:tag' exists, skip pulling

Refactor(fine-tuning): uses a separate image for cocktailsdg

5b6abbe

Fix(fine-tuning): Fix settlement goroutine incorrectly exists

c9c317b

Fix(fine-tuning): Change the server status to 'available' when no tas…

6b36932

…ks are in progress

bump version (#151)

4c46b99

fix issues (#148)

03efbb3

Refactor(inference): Use different zk instances for broker server and…

b121a8e

… event

Refactor(router): upGrade router for aligning to broker

24e0f6f

Refactor(router): Update docker compose file for router

6f33428

Chore(fine-tuning): Replace specific port in config file with placeho…

63b216a

…lder

Add retry for submitting transaction

3899867

increase max gas-price

f0cc5fd

Merge pull request #154 from 0glabs/txretry

adbeba6

Feat(fine-tuning): Add retry for submitting transaction

Fix(fine-tuning): Fix the incorrect update of the service availability

9b3f1ef

Refactor(fine-tuning): remove test small data (#156)

809e99b

Fix(fine-tuning): Bunmp storage to v0.6.6

e8326f3

Feat(fine-tuning): Change image tag in docker compose to 0.2.0

52f0d58

Fix(fine-tuning): Extend timeout duration

ba50f09

zeyu and others added 25 commits February 10, 2026 19:53

docs: add fine-tuning service testing guide

7224f4c

- Document broker configuration options - Step-by-step testing instructions - Model/dataset fallback chain explanation - Troubleshooting guide - LoRA download from TEE instructions

feat: improve JSONL to HuggingFace format conversion

96a2882

- Try direct Python execution first before falling back to Docker - Add support for both instruction/input/output and messages (chat) formats - Better error handling for dataset conversion

docs: add detailed explanation of Docker volume mount mechanism

5bf0574

- Explain why /tmp:/tmp mount is required - Describe Docker bind mount requirements - Add visual flow diagram - Document alternative for large models using /dstack/persistent

chore: remove accidentally committed broker binary and add to gitignore

fb4507f

Co-authored-by: Cursor <cursoragent@cursor.com>

chore(fine-tune): regenerate abi

80a516b

chore: remove duplicate doc

a8f4ddd

fix: use hf instead of huggingface-cli

7ca377f

reuse hugging face download result

f528633

let broker calculate task fee

008be32

Ravenyjh force-pushed the feature/qwen-support branch from 6117ed7 to 008be32 Compare February 10, 2026 11:54

Ravenyjh approved these changes Feb 10, 2026

View reviewed changes

Ravenyjh merged commit 2b77eb4 into main Feb 10, 2026
1 check passed

Ravenyjh deleted the feature/qwen-support branch February 10, 2026 11:55

claude bot added the claude-code-assisted label Feb 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: add modelLocalPaths support and Qwen model#337

feat: add modelLocalPaths support and Qwen model#337
Ravenyjh merged 433 commits intomainfrom
feature/qwen-support

zeyuyuyu commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

Conversation

zeyuyuyu commented Jan 30, 2026

Summary

Features

Benefits

Test Plan

Configuration Example

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants