-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or request
Milestone
Description
Problem
Uploaded dataset files stored at {dataDir}/datasets/{userAddress}/{datasetHash} lack an automated cleanup mechanism, leading to indefinite storage accumulation.
Current Behavior
-
Dataset upload (ctrl/task.go:294-333):
- Files saved to
{dataDir}/datasets/{userAddress}/{hash} - HF converted format saved to
{dataDir}/datasets/{userAddress}/{hash}_hf/ - No cleanup mechanism targets these directories
- Files saved to
-
Existing cleanup (settlement.go:358-378):
- Only removes task-specific workspace files:
{dataDir}/{taskID}/data/{dataDir}/{taskID}/model/{dataDir}/{taskID}/output_model/
- Does NOT clean uploaded dataset files
- Only removes task-specific workspace files:
-
Unused config field:
fileRetentionHoursdefined in config.go:45-48 but never used in code
Impact
- Storage bloat: Dataset files accumulate indefinitely
- Partial mitigation: Content-addressable storage provides deduplication (same content = same hash = same file)
- Still problematic: Different datasets never get cleaned up
Proposed Solution
Implement cleanup mechanism for uploaded datasets, options:
Option A: Time-based Cleanup (Recommended)
- Use the existing
fileRetentionHoursconfig field - Add cleanup worker to periodically scan
{dataDir}/datasets/ - Remove files older than retention period
- Similar to existing task cleanup in settlement.go
Option B: Reference Counting
- Track which tasks reference each dataset hash
- Only delete datasets when no tasks reference them
- More complex but safer
Option C: Combined Approach
- Keep datasets while referenced by any task
- After all referencing tasks are deleted, apply time-based retention
Implementation Checklist
- Choose cleanup strategy (recommend Option A or C)
- Implement cleanup logic in settlement service
- Make
fileRetentionHoursconfig actually functional - Add logging for cleanup operations
- Add metrics for storage usage
- Update documentation
Related Files
api/fine-tuning/internal/ctrl/task.go(SaveDataset)api/fine-tuning/internal/services/settlement.go(existing cleanup)api/fine-tuning/config/config.go(fileRetentionHours config)
Context
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or request