Skip to content

[BUG] - The return of the epoch + 48 hour rewards calculation freeze #2205

@Straightpool

Description

@Straightpool

Internal/External
External otherwise.

Area
Other Stakepool node operation

Summary
After epoch start + 48hours when rewards calculation takes place the network slows to a halt for about 2,5 minutes. This was fixed with 1.21.1 but returned with 1.23.0.

Steps to reproduce
Steps to reproduce the behavior:

  1. Run pool on 1.23.0
  2. Wait on epoch switch + 48 hours
    In the worst case have a slot in this time window + 2,5 mintues as I did:
    12 | leader | - | 15897752 | 172952 | 2020-12-08 22:47:23 CET
  3. Look at CPU graph and block height graph
    In worst case notice missing block, BP will produce it but it will fail to propagate in time and get lost
  4. See error

Expected behavior
A node should sail through reward calculation on 1.23.0 as it did on 1.21.1 without stop of slot production.

System info (please complete the following information):

  • OS: Ubunto
  • Version: 20.04.1 LTS
  • Node version:
    cardano-node 1.23.0 - linux-x86_64 - ghc-8.10
    git rev eed2505

Screenshots and attachments
Shows bp did wake up, compares non-leader to leader count:
Screen Shot 2020-12-08 at 23 03 15
CPU spike on all nodes (BPs and Relays)
Screen Shot 2020-12-08 at 22 58 16
Stuck blockchain for 2.5 minutes:
Screen Shot 2020-12-08 at 22 56 15

Additional context
See old issue which was fixed before

Activity

glitch40

glitch40 commented on Dec 9, 2020

@glitch40

Have also noticed loss of incoming peers and pauses in transactions at the 48h mark as we experienced previously. Nodes that have been updated to 1.23 experience increasing instability on relays after updating to 1.23 the longer they run 15h or longer from my experience after this time nodes have started to "lag" and lose incoming connections which continues to increase until they are restarted.

AndrewWestberg

AndrewWestberg commented on Dec 9, 2020

@AndrewWestberg

Screenshot from 2020-12-08 22-16-20
Screenshot from 2020-12-08 22-16-51
Screenshot from 2020-12-08 22-17-08
Screenshot from 2020-12-08 22-17-21
Screenshot from 2020-12-08 22-17-34
Screenshot from 2020-12-08 22-17-47
Screenshot from 2020-12-08 22-17-56

papacarp

papacarp commented on Dec 9, 2020

@papacarp

About 20% of reporters dropped out (200+ reporters right now)

image

papacarp

papacarp commented on Feb 7, 2021

@papacarp

Bug still there in 1.25.1. If I understand this correctly, its related to reward processing. Won't this just keep getting worse as we get more wallets staking?

image

gitmachtl

gitmachtl commented on Feb 7, 2021

@gitmachtl
Contributor

This becomes worse, we have to address this asap! 6mins without blocks and 100% usage.

MarcelKlammer

MarcelKlammer commented on Feb 7, 2021

@MarcelKlammer

Yep, could be please do something about this heavy calculation?

JaredCorduan

JaredCorduan commented on Feb 8, 2021

@JaredCorduan
Contributor

Yes, this is almost certainly due to the reward calculation. We are actively testing out some options (such as this) so that we can address this soon.

AndrewWestberg

AndrewWestberg commented on Feb 9, 2021

@AndrewWestberg

Hopefully, the epoch cutover calculations can be similarly spread out with the pulse technique.

JaredCorduan

JaredCorduan commented on Feb 9, 2021

@JaredCorduan
Contributor

Hopefully, the epoch cutover calculations can be similarly spread out with the pulse technique.

indeed, that's on our radar as well (it's the stake distribution calculation needed for the snapshot that is consuming the resources). the pulse technique is certainly one candidate solution.

angelstakepool

angelstakepool commented on Feb 12, 2021

@angelstakepool

My relays are losing a considerable amount of peers at the 48h time mark. The network behaves really unstable at this time and several SPOs (using decent hardware specs) have reported losing blocks because of this

papacarp

papacarp commented on Mar 4, 2021

@papacarp

Red Alert!! 5 minutes without blocks today.

image

image

JaredCorduan

JaredCorduan commented on Mar 4, 2021

@JaredCorduan
Contributor

Our fix for this has been merged in the ledger repo (IntersectMBO/cardano-ledger#2142), but we are still working on integrating the changes into the node.

mmahut

mmahut commented on Mar 9, 2021

@mmahut
Contributor

Today, we have seen 10+ minutes.

Straightpool

Straightpool commented on Apr 10, 2021

@Straightpool
Author

With just 22% of nodes upgraded to 1.26.1 the issue was greatly improved on the last run through this:

image

This was the epoch before on 1.25.1

image

I consider this thus fixed and close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mmahut@AndrewWestberg@JaredCorduan@MarcelKlammer@papacarp

        Issue actions