-
Notifications
You must be signed in to change notification settings - Fork 747
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Internal/External
External otherwise.
Area
Other Stakepool node operation
Summary
After epoch start + 48hours when rewards calculation takes place the network slows to a halt for about 2,5 minutes. This was fixed with 1.21.1 but returned with 1.23.0.
Steps to reproduce
Steps to reproduce the behavior:
- Run pool on 1.23.0
- Wait on epoch switch + 48 hours
In the worst case have a slot in this time window + 2,5 mintues as I did:
12 | leader | - | 15897752 | 172952 | 2020-12-08 22:47:23 CET - Look at CPU graph and block height graph
In worst case notice missing block, BP will produce it but it will fail to propagate in time and get lost - See error
Expected behavior
A node should sail through reward calculation on 1.23.0 as it did on 1.21.1 without stop of slot production.
System info (please complete the following information):
- OS: Ubunto
- Version: 20.04.1 LTS
- Node version:
cardano-node 1.23.0 - linux-x86_64 - ghc-8.10
git rev eed2505
Screenshots and attachments
Shows bp did wake up, compares non-leader to leader count:
CPU spike on all nodes (BPs and Relays)
Stuck blockchain for 2.5 minutes:
Additional context
See old issue which was fixed before
glitch40, Crypto2099, papacarp, gitmachtl, angelstakepool and 5 moregufmar
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
glitch40 commentedon Dec 9, 2020
Have also noticed loss of incoming peers and pauses in transactions at the 48h mark as we experienced previously. Nodes that have been updated to 1.23 experience increasing instability on relays after updating to 1.23 the longer they run 15h or longer from my experience after this time nodes have started to "lag" and lose incoming connections which continues to increase until they are restarted.
AndrewWestberg commentedon Dec 9, 2020
papacarp commentedon Dec 9, 2020
About 20% of reporters dropped out (200+ reporters right now)
papacarp commentedon Feb 7, 2021
Bug still there in 1.25.1. If I understand this correctly, its related to reward processing. Won't this just keep getting worse as we get more wallets staking?
gitmachtl commentedon Feb 7, 2021
This becomes worse, we have to address this asap! 6mins without blocks and 100% usage.
MarcelKlammer commentedon Feb 7, 2021
Yep, could be please do something about this heavy calculation?
JaredCorduan commentedon Feb 8, 2021
Yes, this is almost certainly due to the reward calculation. We are actively testing out some options (such as this) so that we can address this soon.
AndrewWestberg commentedon Feb 9, 2021
Hopefully, the epoch cutover calculations can be similarly spread out with the pulse technique.
JaredCorduan commentedon Feb 9, 2021
indeed, that's on our radar as well (it's the stake distribution calculation needed for the snapshot that is consuming the resources). the pulse technique is certainly one candidate solution.
angelstakepool commentedon Feb 12, 2021
My relays are losing a considerable amount of peers at the 48h time mark. The network behaves really unstable at this time and several SPOs (using decent hardware specs) have reported losing blocks because of this
papacarp commentedon Mar 4, 2021
Red Alert!! 5 minutes without blocks today.
JaredCorduan commentedon Mar 4, 2021
Our fix for this has been merged in the ledger repo (IntersectMBO/cardano-ledger#2142), but we are still working on integrating the changes into the node.
mmahut commentedon Mar 9, 2021
Today, we have seen 10+ minutes.
Straightpool commentedon Apr 10, 2021
With just 22% of nodes upgraded to 1.26.1 the issue was greatly improved on the last run through this:
This was the epoch before on 1.25.1
I consider this thus fixed and close the issue.