Skip to content

Comments

feat: add modelLocalPaths support and Qwen model#337

Merged
Ravenyjh merged 433 commits intomainfrom
feature/qwen-support
Feb 10, 2026
Merged

feat: add modelLocalPaths support and Qwen model#337
Ravenyjh merged 433 commits intomainfrom
feature/qwen-support

Conversation

@zeyuyuyu
Copy link
Collaborator

Summary

This PR adds support for using pre-downloaded models directly from the provider's local filesystem, eliminating the need to download large models from 0G Storage for every fine-tuning task.

Features

  • Add modelLocalPaths configuration option to map model hashes to local file paths
  • Support local model paths for any model (including predefined models)
  • Create symlink to local model instead of downloading from 0G Storage
  • Add Qwen2.5-0.5B-Instruct model support

Benefits

  • Significantly faster task setup for large models (e.g., Qwen2.5-32B)
  • Reduced bandwidth usage
  • More reliable in TEE environments with network restrictions

Test Plan

  • Tested on Phala Cloud CVM with H200 GPU
  • Verified localPath feature loads model from /dstack/persistent/models/
  • Confirmed full fine-tuning flow: Setup → Train → Deliver
  • Documentation added: docs/localPath-feature-test.md

Configuration Example

service:
  modelLocalPaths:
    "0xb4f76a886b8655c92bb021922d60b5e4d9271a5c9da98b6cb10937a06c2c75a7": "/dstack/persistent/models/Qwen2.5-0.5B-Instruct"

Ravenyjh and others added 30 commits February 25, 2025 16:01
- Since  a comma-separated hex string rather than a prue hex
Co-authored-by: Peter Zhang <peter@0g.ai>
- Modify Dockerfile for serving
  - Add config file for CocktailSGD
  - See more details of CocktailSGD in https://github.com/0glabs/CocktailSGD

Co-authored-by: Sidonie <sidonie@0g.ai>
- If image with the same 'repo:tag' exists, skip pulling
* Fix(ft): Update dockerfile for cocktail

- Update requirements.txt
- Update env for cocktail
- Use same dockerfile, add dependency installed with existed conda

* Fix(ft): Update dockerfile and hash for cocktail

---------

Co-authored-by: Sidonie <sidonie@0g.ai>
Feat(fine-tuning): Add retry for submitting transaction
zeyu and others added 25 commits February 10, 2026 19:53
…ling

- Restructure prepareModel with clear fallback chain:
  1. Local path (symlink) - fastest, no download
  2. HuggingFace download - to task directory
  3. 0G Storage download - last resort
- HuggingFace fallback now triggers independently if configured
- Better error logging showing all failed sources
- Simplify prepareData with cleaner fallback logic
- Consistent error messages for debugging
- Add skipStorageUpload config option in Service struct
- When enabled, finalizer skips encryption and 0G Storage upload
- Task still marked as finished, LoRA available via TEE API
- Useful for testing or when 0G Storage is unavailable
- Document broker configuration options
- Step-by-step testing instructions
- Model/dataset fallback chain explanation
- Troubleshooting guide
- LoRA download from TEE instructions
…oRA delivery

Major improvements:
- Add dataset upload API (POST /v1/user/:address/dataset)
- Encrypt LoRA locally even when skipStorageUpload=true
- Return encrypted LoRA via TEE download endpoint
- Support persistent storage via dataDir config
- Add file retention and cleanup logic
- Update documentation to English

Changes:
- finalizer.go: Always encrypt LoRA, save to {output}_encrypted.data
- ctrl/task.go: Add SaveDataset(), GetLoRAModel() returns encrypted file
- handler/task.go: Add UploadDataset endpoint
- setup.go: Support user-uploaded datasets (3-tier fallback)
- settlement.go: Clean up .data encrypted files
- config.go: Add dataDir and fileRetentionHours fields
- task_log.go: Use configurable data directory
- .gitignore: Add binary files and environment-specific configs
- TESTING.md: Complete English documentation with encryption flow

Dataset fallback: config paths → uploaded datasets → 0G Storage
Model fallback: local paths → HuggingFace → 0G Storage

This enables users to upload datasets, train models, and download
encrypted LoRA adapters with proper key exchange via smart contracts.
Changes:
- ctrl/task.go: Add convertJSONLToHF() to convert uploaded JSONL datasets
  to HuggingFace DatasetDict format required by token counter
- setup.go: Prefer HF format (_hf suffix) over raw JSONL when loading
  user-uploaded datasets

Dataset loading priority for user uploads:
1. {dataDir}/datasets/{user}/{hash}_hf (HuggingFace format)
2. {dataDir}/datasets/{user}/{hash} (raw JSONL fallback)

Note: The conversion runs via Docker container (qwen-lora:v3) which
requires the broker to have Docker socket access.
Comprehensive guide covering:
- Complete test flow from dataset upload to LoRA download
- Signature generation for task creation
- Task progress monitoring
- Contract interaction (acknowledgeDeliverable)
- Troubleshooting common issues
- Model hash reference table
1. Remove binary files from git (broker-*)
   - ~325 MB of binaries were bloating git history
   - Added broker-* pattern to .gitignore

2. Remove environment-specific config files
   - config.yaml, config-local-model.yaml, config-model-local-paths.yaml
   - Updated .gitignore to properly exclude config.yaml pattern
   - Only config-example.yaml and config.example.yaml are tracked

3. Add authentication to LoRA download endpoint
   - Requires signature of keccak256(taskID) signed by user
   - Prevents unauthorized access to encrypted LoRA files
   - Added VerifyDownloadSignature function in ctrl

4. Add nil check in GetLoRAModel
   - Defense-in-depth for task ID parameter
- Try direct Python execution first before falling back to Docker
- Add support for both instruction/input/output and messages (chat) formats
- Better error handling for dataset conversion
- Add CLI-based testing commands (0g-compute-cli)
- Document skipStorageUpload configuration for direct TEE download
- Add Docker volume mount requirements (/tmp, dstack.sock)
- Include complete test session example script
- Add troubleshooting section for common issues
- Update flow diagram with new TEE-direct download path
- Explain why /tmp:/tmp mount is required
- Describe Docker bind mount requirements
- Add visual flow diagram
- Document alternative for large models using /dstack/persistent
1. Revert docker-compose.yml to dstack deploy format
   - Port: #PORT#:3080 (placeholder for dstack)
   - Config path: /dstack/user_config (fixed dstack path)

2. Remove /tmp/train_lora_fixed.py workaround in executor.go
   - This was a temporary fix for fsspec file:// prefix issue

3. Align VerifyDownloadSignature with validateSignature format
   - Use same message format: TextHash(keccak256(binaryTaskID))
   - Consistent signature validation across all endpoints

4. Consolidate config example files
   - Merge config-example.yaml into config.example.yaml
   - Single comprehensive example showing all config options

5. Add design documentation for SaveDataset
   - Explain why content hash (not task ID) is used as filename
   - Task ID doesn't exist at upload time
   - Content hash ensures deduplication and integrity
- Add HF cache directory configuration to fix permission errors
- Pin dependency versions to resolve pyarrow/numpy compatibility issues
- Fix labels for causal LM loss computation
- Add keep_in_memory=True to avoid cache file writes during tokenization
- Support additional config key aliases (num_epochs, batch_size, etc.)

Tested successfully with Qwen2.5-0.5B-Instruct LoRA fine-tuning.
- Update Dockerfile to use transformers>=4.51.0 for Qwen3 architecture
- Update accelerate>=0.30.0 and peft>=0.11.0 for compatibility
- Remove deprecated tokenizer parameter from Trainer (transformers 5.x)
- Use data_collator to pass tokenizer instead

Fixes the "TypeError: Trainer.__init__() got an unexpected keyword
argument 'tokenizer'" error when using transformers 5.0+
- Add complete Qwen3-32B test script with CLI commands
- Document model setup and broker configuration
- Add timing reference for 32B model (~5-6 minutes)
- Include Qwen3-specific troubleshooting section
- Update output structure with LoRA adapter sizes by model
The database config was using an incorrect nested structure (mysql.host, mysql.port, etc.)
but the Go code expects a single DSN string in database.fineTune field.

Fixed to match the expected format in config/config.go:
  database:
    fineTune: "user:password@tcp(host:port)/database?parseTime=true"

Co-authored-by: Cursor <cursoragent@cursor.com>
…ancelTask

- Change HTTP method from GET to POST for /user/:userAddress/task/:taskID/lora
- Move signature from query parameter to request body (JSON)
- Update swagger documentation accordingly

This change makes the DownloadLoRA endpoint consistent with CancelTask,
where the signature is passed in the request body for better security.

Co-authored-by: Cursor <cursoragent@cursor.com>
1. ctrl/task.go: Replace hardcoded "qwen-lora:v3" with
   c.config.Images.ExecutionImageName to use configured image

2. api/Dockerfile: Explicitly add huggingface_hub package and
   add comments explaining it provides huggingface-cli

3. setup.go: Add comment explaining huggingface-cli dependency

Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace deprecated `torch_dtype` with `dtype` in model loading
- Remove `save_safetensors` parameter from TrainingArguments (removed in
  transformers 5.x, safetensors is now the default format)

These changes fix the empty output_model issue where training would crash
with TypeError on `save_safetensors` before producing any LoRA weights.

Co-authored-by: Cursor <cursoragent@cursor.com>
Pin transformers to >=5.0.0,<6.0.0 to prevent future API breakage.
The train_lora.py script uses transformers 5.x API (dtype instead of
torch_dtype, no save_safetensors parameter).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Ravenyjh Ravenyjh force-pushed the feature/qwen-support branch from 6117ed7 to 008be32 Compare February 10, 2026 11:54
@Ravenyjh Ravenyjh merged commit 2b77eb4 into main Feb 10, 2026
1 check passed
@Ravenyjh Ravenyjh deleted the feature/qwen-support branch February 10, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants