Skip to content

Implement automated cleanup for uploaded dataset files #349

@claude

Description

@claude

Problem

Uploaded dataset files stored at {dataDir}/datasets/{userAddress}/{datasetHash} lack an automated cleanup mechanism, leading to indefinite storage accumulation.

Current Behavior

  1. Dataset upload (ctrl/task.go:294-333):

    • Files saved to {dataDir}/datasets/{userAddress}/{hash}
    • HF converted format saved to {dataDir}/datasets/{userAddress}/{hash}_hf/
    • No cleanup mechanism targets these directories
  2. Existing cleanup (settlement.go:358-378):

    • Only removes task-specific workspace files:
      • {dataDir}/{taskID}/data/
      • {dataDir}/{taskID}/model/
      • {dataDir}/{taskID}/output_model/
    • Does NOT clean uploaded dataset files
  3. Unused config field:

    • fileRetentionHours defined in config.go:45-48 but never used in code

Impact

  • Storage bloat: Dataset files accumulate indefinitely
  • Partial mitigation: Content-addressable storage provides deduplication (same content = same hash = same file)
  • Still problematic: Different datasets never get cleaned up

Proposed Solution

Implement cleanup mechanism for uploaded datasets, options:

Option A: Time-based Cleanup (Recommended)

  • Use the existing fileRetentionHours config field
  • Add cleanup worker to periodically scan {dataDir}/datasets/
  • Remove files older than retention period
  • Similar to existing task cleanup in settlement.go

Option B: Reference Counting

  • Track which tasks reference each dataset hash
  • Only delete datasets when no tasks reference them
  • More complex but safer

Option C: Combined Approach

  • Keep datasets while referenced by any task
  • After all referencing tasks are deleted, apply time-based retention

Implementation Checklist

  • Choose cleanup strategy (recommend Option A or C)
  • Implement cleanup logic in settlement service
  • Make fileRetentionHours config actually functional
  • Add logging for cleanup operations
  • Add metrics for storage usage
  • Update documentation

Related Files

  • api/fine-tuning/internal/ctrl/task.go (SaveDataset)
  • api/fine-tuning/internal/services/settlement.go (existing cleanup)
  • api/fine-tuning/config/config.go (fileRetentionHours config)

Context

Identified during PR #337 code review by @Ravenyjh.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions