Skip to content

bug: 2nd validator node errors and does not recover #172

@Anmol1696

Description

@Anmol1696

Overview

If there is a bug on a validator node (after the genesis node), then it does not seem to recover and get into a state of CrashLoopBackOff, specially from postStartHook which performs the create-validator txn

Proposal

Inorder to make a robust setup, we need to make the nodes self-healing, using the primitives of k8s itself.

We can utilize the liveliness and readiness probes, to check the state and as well force validator nodes to restart properly.

Option 1: Clean start on failure

Delete ~/.<chain> after it fails

Option 2: PostStartHook fallback

Since we use postStartHook for registring the validator node, we can make the post startup hook more robust, and be aware of the failure

Problem

Validator node can be failing for multiple reasons, and one way of recovery can cause issues in other types of transient errors. We need a more robust way of recovering failing nodes.

Nodes can also be manually shut down, in that case the postStartHook should not run itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions