feat: add modelLocalPaths support and Qwen model#337
Merged
Conversation
- Since a comma-separated hex string rather than a prue hex
Co-authored-by: Peter Zhang <peter@0g.ai>
- Modify Dockerfile for serving - Add config file for CocktailSGD - See more details of CocktailSGD in https://github.com/0glabs/CocktailSGD Co-authored-by: Sidonie <sidonie@0g.ai>
- If image with the same 'repo:tag' exists, skip pulling
…ks are in progress
* Fix(ft): Update dockerfile for cocktail - Update requirements.txt - Update env for cocktail - Use same dockerfile, add dependency installed with existed conda * Fix(ft): Update dockerfile and hash for cocktail --------- Co-authored-by: Sidonie <sidonie@0g.ai>
Feat(fine-tuning): Add retry for submitting transaction
…ling - Restructure prepareModel with clear fallback chain: 1. Local path (symlink) - fastest, no download 2. HuggingFace download - to task directory 3. 0G Storage download - last resort - HuggingFace fallback now triggers independently if configured - Better error logging showing all failed sources - Simplify prepareData with cleaner fallback logic - Consistent error messages for debugging
- Add skipStorageUpload config option in Service struct - When enabled, finalizer skips encryption and 0G Storage upload - Task still marked as finished, LoRA available via TEE API - Useful for testing or when 0G Storage is unavailable
- Document broker configuration options - Step-by-step testing instructions - Model/dataset fallback chain explanation - Troubleshooting guide - LoRA download from TEE instructions
…oRA delivery
Major improvements:
- Add dataset upload API (POST /v1/user/:address/dataset)
- Encrypt LoRA locally even when skipStorageUpload=true
- Return encrypted LoRA via TEE download endpoint
- Support persistent storage via dataDir config
- Add file retention and cleanup logic
- Update documentation to English
Changes:
- finalizer.go: Always encrypt LoRA, save to {output}_encrypted.data
- ctrl/task.go: Add SaveDataset(), GetLoRAModel() returns encrypted file
- handler/task.go: Add UploadDataset endpoint
- setup.go: Support user-uploaded datasets (3-tier fallback)
- settlement.go: Clean up .data encrypted files
- config.go: Add dataDir and fileRetentionHours fields
- task_log.go: Use configurable data directory
- .gitignore: Add binary files and environment-specific configs
- TESTING.md: Complete English documentation with encryption flow
Dataset fallback: config paths → uploaded datasets → 0G Storage
Model fallback: local paths → HuggingFace → 0G Storage
This enables users to upload datasets, train models, and download
encrypted LoRA adapters with proper key exchange via smart contracts.
Changes:
- ctrl/task.go: Add convertJSONLToHF() to convert uploaded JSONL datasets
to HuggingFace DatasetDict format required by token counter
- setup.go: Prefer HF format (_hf suffix) over raw JSONL when loading
user-uploaded datasets
Dataset loading priority for user uploads:
1. {dataDir}/datasets/{user}/{hash}_hf (HuggingFace format)
2. {dataDir}/datasets/{user}/{hash} (raw JSONL fallback)
Note: The conversion runs via Docker container (qwen-lora:v3) which
requires the broker to have Docker socket access.
Comprehensive guide covering: - Complete test flow from dataset upload to LoRA download - Signature generation for task creation - Task progress monitoring - Contract interaction (acknowledgeDeliverable) - Troubleshooting common issues - Model hash reference table
1. Remove binary files from git (broker-*) - ~325 MB of binaries were bloating git history - Added broker-* pattern to .gitignore 2. Remove environment-specific config files - config.yaml, config-local-model.yaml, config-model-local-paths.yaml - Updated .gitignore to properly exclude config.yaml pattern - Only config-example.yaml and config.example.yaml are tracked 3. Add authentication to LoRA download endpoint - Requires signature of keccak256(taskID) signed by user - Prevents unauthorized access to encrypted LoRA files - Added VerifyDownloadSignature function in ctrl 4. Add nil check in GetLoRAModel - Defense-in-depth for task ID parameter
- Try direct Python execution first before falling back to Docker - Add support for both instruction/input/output and messages (chat) formats - Better error handling for dataset conversion
- Add CLI-based testing commands (0g-compute-cli) - Document skipStorageUpload configuration for direct TEE download - Add Docker volume mount requirements (/tmp, dstack.sock) - Include complete test session example script - Add troubleshooting section for common issues - Update flow diagram with new TEE-direct download path
- Explain why /tmp:/tmp mount is required - Describe Docker bind mount requirements - Add visual flow diagram - Document alternative for large models using /dstack/persistent
1. Revert docker-compose.yml to dstack deploy format - Port: #PORT#:3080 (placeholder for dstack) - Config path: /dstack/user_config (fixed dstack path) 2. Remove /tmp/train_lora_fixed.py workaround in executor.go - This was a temporary fix for fsspec file:// prefix issue 3. Align VerifyDownloadSignature with validateSignature format - Use same message format: TextHash(keccak256(binaryTaskID)) - Consistent signature validation across all endpoints 4. Consolidate config example files - Merge config-example.yaml into config.example.yaml - Single comprehensive example showing all config options 5. Add design documentation for SaveDataset - Explain why content hash (not task ID) is used as filename - Task ID doesn't exist at upload time - Content hash ensures deduplication and integrity
- Add HF cache directory configuration to fix permission errors - Pin dependency versions to resolve pyarrow/numpy compatibility issues - Fix labels for causal LM loss computation - Add keep_in_memory=True to avoid cache file writes during tokenization - Support additional config key aliases (num_epochs, batch_size, etc.) Tested successfully with Qwen2.5-0.5B-Instruct LoRA fine-tuning.
- Update Dockerfile to use transformers>=4.51.0 for Qwen3 architecture - Update accelerate>=0.30.0 and peft>=0.11.0 for compatibility - Remove deprecated tokenizer parameter from Trainer (transformers 5.x) - Use data_collator to pass tokenizer instead Fixes the "TypeError: Trainer.__init__() got an unexpected keyword argument 'tokenizer'" error when using transformers 5.0+
- Add complete Qwen3-32B test script with CLI commands - Document model setup and broker configuration - Add timing reference for 32B model (~5-6 minutes) - Include Qwen3-specific troubleshooting section - Update output structure with LoRA adapter sizes by model
The database config was using an incorrect nested structure (mysql.host, mysql.port, etc.)
but the Go code expects a single DSN string in database.fineTune field.
Fixed to match the expected format in config/config.go:
database:
fineTune: "user:password@tcp(host:port)/database?parseTime=true"
Co-authored-by: Cursor <cursoragent@cursor.com>
…ancelTask - Change HTTP method from GET to POST for /user/:userAddress/task/:taskID/lora - Move signature from query parameter to request body (JSON) - Update swagger documentation accordingly This change makes the DownloadLoRA endpoint consistent with CancelTask, where the signature is passed in the request body for better security. Co-authored-by: Cursor <cursoragent@cursor.com>
1. ctrl/task.go: Replace hardcoded "qwen-lora:v3" with c.config.Images.ExecutionImageName to use configured image 2. api/Dockerfile: Explicitly add huggingface_hub package and add comments explaining it provides huggingface-cli 3. setup.go: Add comment explaining huggingface-cli dependency Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace deprecated `torch_dtype` with `dtype` in model loading - Remove `save_safetensors` parameter from TrainingArguments (removed in transformers 5.x, safetensors is now the default format) These changes fix the empty output_model issue where training would crash with TypeError on `save_safetensors` before producing any LoRA weights. Co-authored-by: Cursor <cursoragent@cursor.com>
Pin transformers to >=5.0.0,<6.0.0 to prevent future API breakage. The train_lora.py script uses transformers 5.x API (dtype instead of torch_dtype, no save_safetensors parameter). Co-authored-by: Cursor <cursoragent@cursor.com>
6117ed7 to
008be32
Compare
Ravenyjh
approved these changes
Feb 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for using pre-downloaded models directly from the provider's local filesystem, eliminating the need to download large models from 0G Storage for every fine-tuning task.
Features
modelLocalPathsconfiguration option to map model hashes to local file pathsBenefits
Test Plan
/dstack/persistent/models/docs/localPath-feature-test.mdConfiguration Example