Bad file descriptor (os error 9) causing window post to fail for provable sectors, reproduced on multiple systems running miner #7961
-
Lotus Version
Describe the BugHello, I'm having a miner problem where at the beginning of the proving period it reports all the sectors are good, but then immediately starts throwing errors. The errors include the text "Bad file descriptor (os error 9)" and repeat for the remainder of the proving period. The errors stop after that, and the miner appears to be operating fine for the rest of the day. The sector files are there, which is confirmed by the faults being declared resolved before each day's deadline, as well as by several successful runs of The issue started around the 8th or 9th of Jan, but I'm not sure why. I think that's around the time I did a small network reconfiguration, but nothing on the Lotus machines changed during that time, and I don't think I've changed anything else with them since then. Between the 8th and today I've been able to successfully prove at least one round of deadlines. I consulted with @f8-ptrk and @LexLuthr who suggested the errors may indicate a hardware problem, and after moving the miner to a different machine using the backup/restore function, the next deadline was successful. But then the following day the same thing started happening again on the new miner. I also fully rebuilt Lotus on the new miner and checked for missing dependencies. The new miner reported some mismatched parameter cache file checksums (I'm not sure if the files were corrupted or if the parameters changed?) but I was able to download correct files before the proving deadline started, and still got the errors. FD limit on the miners is 1048576. I also attempted to increase it in the service file after seeing some related issues, but that didn't have any effect on the problem. Here's the service config section for my miner. It's basically the same on each machine I've tried.
The
Below are the errors that repeat throughout the proving period. I'm also attaching a combined rust/go log with debugging on that covers the last few days. On one or two of the days I just shut everything down right after the errors began reocurring so that's why there's a break at that point on some day(s). Logging Information
Repo Steps
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 8 replies
-
Hi @shotcollin Sorry to hear that - thats a lot of slashing.
This will check the params, if they are already downloaded. And you should be able to get the
Im not able to re-produce the issue. So this is not Lotus software related, this is hardware malfunction,network or incorrect configs, permissions etc. I have only encounter this in the past and it turn out to be broken hardware every time. Moving this issue to Lotus Discussions for help and troubleshooting. |
Beta Was this translation helpful? Give feedback.
Hi @shotcollin
Sorry to hear that - thats a lot of slashing.
I usually take go 1 by 1 stick, boot up the machine
This will check the params, if they are already downloaded. And you should be able to get the
mismatched parameter cache file checksums
ERROR/WARN just by doing that. When you get error, you have found the broken RAM…