-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Overview
If there is a bug on a validator node (after the genesis node), then it does not seem to recover and get into a state of CrashLoopBackOff
, specially from postStartHook
which performs the create-validator
txn
Proposal
Inorder to make a robust setup, we need to make the nodes self-healing, using the primitives of k8s itself.
We can utilize the liveliness and readiness probes, to check the state and as well force validator nodes to restart properly.
Option 1: Clean start on failure
Delete ~/.<chain>
after it fails
Option 2: PostStartHook fallback
Since we use postStartHook for registring the validator node, we can make the post startup hook more robust, and be aware of the failure
Problem
Validator node can be failing for multiple reasons, and one way of recovery can cause issues in other types of transient errors. We need a more robust way of recovering failing nodes.
Nodes can also be manually shut down, in that case the postStartHook
should not run itself.