Skip to content

[Bug] Harvester abandons entire directory tree when encountering I/O errors during recursive plot scanning #19383

@wallentx

Description

@wallentx

What happened?

When using recursive_plot_scanning, if the harvester encounters an I/O error from a mount that is a subdirectory of a plot_directories entry, the harvester permanently abandons the entire plot_directories entry, instead of just skipping the problematic subdirectory. This can cause a farm to suddenly stop farming all plots due to a single mount issue.

Chia behavior

My current plot directories are:
chia plots show

/mnt/ssdarray1/buffer
/mnt/plot

My farm setup

/mnt/plot
  ├── plots1
  ├── plots2
  ├── plots3
  ├── plots4
  ├── plots5
  └── ... (over 160 more plot directories)

Each plots# folder represents a drive mount.
I have chia configured with recursive_plot_scan: true.

I'll be farming normally:

Local Harvester
   10507 plots of size: 837.501 TiB on-disk, 1.016 PiBe (effective)

And I'll also be plotting, and writing new plots to a drive. Life is good.
But then out of nowhere, it hit me... and it feelt like I suddenly had a wallet full of Fine Art BUCKS

FineART BUCKS

But I ain't that lucky. Instead, I had a case of the BAD I/O.

Chia sees this:

94144:2025-03-11T19:26:03.143 2.5.3.dev123 harvester chia.plotting.util      : WARNING  Error reading directory /mnt/plot [Errno 5] Input/output error: '/mnt/plot/plots42'

And my harvester drops to:

Local Harvester
   2 plots of size: 162 GiB or something on-disk. pathetic.
   Estimated time to win: 900000 years

This happens because when the harvester hits an I/O error during normal harvesting operation, it stops looking at that directory recursively.
Since all of my HDD plots are stored at /mnt/plot/, the only plots my harvester is able to see are freshly created plots stored on my SSD buffer volume, which are only there until they complete transfer to their final HDD resting place.
A single drive having an I/O error causes the harvester to abandon the entire directory structure.

The only way to recover from this is to unmount the bad mount, and restart the harvester service.

Bonus details because I have ADHD

I have a few drives that seem to have some pretty crappy write cache. That's my diagnosis so far. But it is pretty common that if I'm writing plots to these drives, this will happen:

dmesg

I/O error, dev sdaa, sector 16944 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
I/O error, dev sdaa, sector 16944 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
I/O error, dev sdaa, sector 16944 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
I/O error, dev sdaa, sector 0 op 0x1:(WRITE) flags 0x23800 phys_seg 1 prio class 0
Buffer I/O error on dev sdaa, logical block 0, lost sync page write
EXT4-fs (sdaa): I/O error while writing superblock

Any attempt to interact with the mount when this happens just looks like this:

ls /mnt/plot/plots* 1>/dev/null
"/mnt/plot/plots97": Permission denied (os error 13)

I can unmount the volume, but I cannot re-mount it. Linux can still see the /dev/sdXXX entry for it, but I am entirely unable to interact with the device until I reboot- or that'd be the case if I was some normie, because I found a way to get these back in a good state with a script I made.

For these drives, I'm blaming poor write cache.. but I've had I/O errors in the past for other reasons, and I imagine this could happen to drives that might be starting to go bad, or if the volume mount is network based, or if something happens with power, or the HBA, etc.

I/O errors can happen.
Chia doesn't handle things well when this happens.

What I'd expect instead

The harvester should be more resilient to I/O errors:

  1. If it hits an I/O error on a specific subdirectory, it should log it but continue with other subdirectories
  2. It should retry drives that had errors during the next refresh cycle
  3. There should be a way to force a complete refresh without having to restart the service

Who this affects

This is relevant for farmers who have:

  • Multiple drive mounts
  • Drives with poor write cache or beginning to fail
  • Network-attached storage
  • USB drive farms
  • Occasional I/O issues on individual drives

Making the harvester more resilient to these kinds of errors would improve farming reliability and reduce the need for manual intervention.

Version

All

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingstale-issueflagged as stale and will be closed in 7 days if not updated

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions