Getting intermittent stales with every version after 1.2.7 #9685

sobertram · 2021-12-28T16:52:24Z

sobertram
Dec 28, 2021

I noticed that I get intermittent stales on every version after 1.2.7 and I am puzzled by what the issue might be. I have compared the time it takes for a proof to be found and there is no difference in the average. I know at some point I will not be able to continue using 1.2.7 and so I am hoping to find a solution before then.

It became a problem mainly during the dust storms. 1.2.7 behaves terribly during these events and I read somewhere that the latest version handles the events better. I installed, 1.2.11, and indeed it worked better during the storm event. But when the event was over, I noticed that I was still getting a few stales. I was operating at approximately 75% good to 25% stale. I left it for a few hours but this didn't change. So switched it back to 1.2.7 as a test and the stales went away. I installed every version except 1.2.9, and they all had intermittent stales. So I am assuming this is something that started in 1.2.8. But I see no errors or anything in my debug logs that would point to the issue.

Here are the systems I am running. I have 2 harvesters. I will refer to the one running blockchain as farmer and the second as harvester. And both harvesters produced stales when 1.2.x, x > 7, is installed. The chia version currently running is 1.2.7.

Farmer:

chia farm summary
Farming status: Farming
Total chia farmed: 0.5
User transaction fees: 0.0
Block rewards: 0.5
Last height farmed: 1013765
Local Harvester
   103 plots of size: 10.196 TiB
Remote Harvester for IP: 192.168.88.217
   1560 plots of size: 154.399 TiB
Plot count for all harvesters: 1663

Linux 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

   Speedtest by Ookla

     Server: Comcast - Boston, MA (id = 1774)
        ISP: Comcast Cable
    Latency:     8.28 ms   (1.08 ms jitter)
   Download:    93.85 Mbps (data used: 44.4 MB )
     Upload:    41.17 Mbps (data used: 31.7 MB )
Packet Loss: Not available.

top - 11:24:29 up 7 days,  3:42,  1 user,  load average: 0.09, 0.16, 0.17
Tasks: 219 total,   1 running, 218 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.3 us,  0.2 sy,  0.0 ni, 95.9 id,  0.5 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7847.4 total,    123.1 free,   2885.9 used,   4838.4 buff/cache
MiB Swap:    976.0 total,    778.2 free,    197.8 used.   4634.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 208107 user  20   0 1623096 978.9m  19768 S  14.0  12.5   8:38.84 chia_full_node
 207982 user  20   0  206232  55048  16672 S   0.3   0.7   0:04.56 chia_daemon
 208105 user  20   0 2204864  61060  17504 S   0.3   0.8   0:09.17 chia_harvester
 208106 user  20   0  362744  71440  17932 S   0.3   0.9   0:42.04 chia_farmer

lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           60
Model name:                      Intel(R) Core(TM) i3-4330 CPU @ 3.50GHz
Stepping:                        3
CPU MHz:                         883.515
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        6983.27
Virtualization:                  VT-x
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        512 KiB
L3 cache:                        4 MiB
NUMA node0 CPU(s):               0-3

lsmem
RANGE                                  SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000cfffffff  3.3G online       yes  0-25
0x0000000100000000-0x000000022fffffff  4.8G online       yes 32-69

Memory block size:       128M
Total online memory:       8G
Total offline memory:      0B

Harvester:

Sample find times from logs:
2021-12-28T11:29:14.239 harvester chia.harvester.harvester: INFO     4 plots were eligible for farming f3ad6c4892... Found 1 proofs. Time: 0.37120 s. Total 1560 plots
2021-12-28T11:32:04.437 harvester chia.harvester.harvester: INFO     3 plots were eligible for farming f3ad6c4892... Found 2 proofs. Time: 1.00212 s. Total 1560 plots
2021-12-28T11:32:22.897 harvester chia.harvester.harvester: INFO     3 plots were eligible for farming f3ad6c4892... Found 1 proofs. Time: 0.84253 s. Total 1560 plots

Linux 5.4.0-81-generic #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

   Speedtest by Ookla

     Server: Comcast - Boston, MA (id = 1774)
        ISP: Comcast Cable
    Latency:     7.25 ms   (0.90 ms jitter)
   Download:    87.96 Mbps (data used: 85.7 MB )
     Upload:    40.60 Mbps (data used: 18.4 MB )
Packet Loss: Not available.

top - 11:39:44 up 122 days, 22:13,  1 user,  load average: 0.00, 0.00, 0.00
Tasks: 228 total,   1 running, 227 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.5 id,  0.2 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15854.8 total,    175.5 free,    436.1 used,  15243.3 buff/cache
MiB Swap:   4096.0 total,   4091.2 free,      4.8 used.  15079.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1206544 user 20   0 4353668  90800  18212 S   1.3   0.6   0:47.32 chia_harvester

lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i3-10100 CPU @ 3.60GHz
Stepping:                        3
CPU MHz:                         800.688
CPU max MHz:                     4300.0000
CPU min MHz:                     800.0000
BogoMIPS:                        7200.00
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        6 MiB
NUMA node0 CPU(s):               0-7

lsmem
RANGE                                  SIZE  STATE REMOVABLE  BLOCK
0x0000000000000000-0x000000008fffffff  2.3G online       yes   0-17
0x0000000100000000-0x000000046fffffff 13.8G online       yes 32-141

Memory block size:       128M
Total online memory:      16G
Total offline memory:      0B

Looking for help to debug this issue and hopefully solve it. Thanks.

Jacek-ghub · 2022-01-04T06:23:41Z

Jacek-ghub
Jan 4, 2022

There are potentially two sources of stales. One is something wrong with HDs, and the other is potentially the farmer (full node) kind of overwhelmed with network traffic / peer handling. As you compared your proof timings, and there were no differences, that points to an overwhelmed farmer. And, by "overwhelmed" farmer I mean not really slow CPU or drive holding blockchain, but rather poorly written synchronization between peers and db processes, that goes out of whack.

Basically, in v1.2.8, the network protocol / db synchronization was partially broken. Patches that were added in v1.2.11 didn't address the core problems.

As your CPUs are on rather weak side, I would drop your peer count down to 10 peers (config.yaml -> full_node / target_peer_count: 10). That will reduce both the network load, as well as blockchain db r/w. I would hope that with that change, your stales will go down to below 0.1%. I would not go (much) below 10 peers, as the less peers you have, the more likely your node may end up with "slow" peers. There is no magic behind either 10 or 80, just 10 reduces node's resource requirements, still keeping a node afloat, and being useful for to the network.

0 replies

sobertram · 2022-01-04T15:44:29Z

sobertram
Jan 4, 2022
Author

Excellent, glad it's something tangible I was losing my mind a bit on this one. I may try switching roles of both servers. The CPU on the secondary node is better and may solve this problem without sacrificing peer count. Will try a few things and respond.Thanks.

…

On Tue, Jan 4, 2022 at 1:23 AM Jacek-ghub ***@***.***> wrote: There are potentially two sources of stales. One is something wrong with HDs, and the other is potentially the farmer (full node) kind of overwhelmed with network traffic / peer handling. As you compared your proof timings, and there were no differences, that points to an overwhelmed farmer. And, by "overwhelmed" farmer I mean not really slow CPU or drive holding blockchain, but rather poorly written synchronization between peers and db processes, that goes out of whack. Basically, in v1.2.8, the network protocol / db synchronization was partially broken. Patches that were added in v1.2.11 didn't address the core problems. As your CPUs are on rather weak side, I would drop your peer count down to 10 peers (config.yaml -> full_node / target_peer_count: 10). That will reduce both the network load, as well as blockchain db r/w. I would hope that with that change, your stales will go down to below 0.1%. I would not go (much) below 10 peers, as the less peers you have, the more likely your node may end up with "slow" peers. There is no magic behind either 10 or 80, just 10 reduces node's resource requirements, still keeping a node afloat, and being useful for to the network. — Reply to this email directly, view it on GitHub <#9685 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACGGWOU6E2TXWOFH7I43IRDUUKG7PANCNFSM5K4OM54A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: <Chia-Network/chia-blockchain/repo-discussions/9685/comments/1904008@ github.com>

1 reply

Jacek-ghub Jan 4, 2022

I would just drop the number of peers to 10 on your current full node, and that should do the trick.

There is no magic in the number of connected nodes, so I would not really look at that as "sacrificing peer count." Basically, you need to do manually, what the installer should be doing in the first place (not using a threadripper approach for every node), and what the code should be doing automatically in case of problems on the network (dust storms).

By the way, I was running on v1.2.6 for quite some time and didn't have problems with stales. About two weeks ago, I updated to v1.2.10 (just couldn't update to v1.2.11 for some reason) and again didn't have any stales. Yesterday, I finally upgraded to v1.2.11, and my ChiaDog complains about some missing signage points (basically never happened before). Also, the number of stales went up by 2x-3x (is in the range of 0.2% right now, was below 0.1% before).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Getting intermittent stales with every version after 1.2.7 #9685

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Getting intermittent stales with every version after 1.2.7 #9685

Uh oh!

sobertram Dec 28, 2021

Replies: 2 comments · 1 reply

Uh oh!

Jacek-ghub Jan 4, 2022

Uh oh!

sobertram Jan 4, 2022 Author

Uh oh!

Jacek-ghub Jan 4, 2022

sobertram
Dec 28, 2021

Replies: 2 comments 1 reply

Jacek-ghub
Jan 4, 2022

sobertram
Jan 4, 2022
Author