-
Notifications
You must be signed in to change notification settings - Fork 297
duplicate validator protection #1915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I've started drafting a design document that we can discuss in Discord, expand over time and eventually share with our users: https://github.com/status-im/nimbus-eth2/wiki/Accidental-slashing-protection-techniques Few points from the document relevant for this PR:
We also have to decide what would be the best default value for |
|
Another option is whether both sides probe, or only one side. It just has to be at least one side probing. But if that could be guaranteed, then at some level none of this would be necessary to begin with; that's why I have it hard-coded to both-sides-probe for now. Or, conceivably, there are three, four, etc beacon nodes validating with that validator. I didn't put the probability calculations in for that, and the general case of It might be worth documenting that though, the calculation motivating at least some of the variations of the 2 or 3-beacon node cases, with overlapping-or-not assumptions. |
|
Apropos "Upon quitting, a supervisor is likely to restart the process.", with the current exit code of 1, that's true. https://freedesktop.org/software/systemd/man/systemd.exec.html#Process%20exit%20codes and https://refspecs.linuxbase.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html document several other quasi-standard exit codes which might signal to systemd, or other supervisors, that it's not an ordinary crash or other failure, but something to avoid simply restarting. It'd also be possible to add a "flag" file and check for its presence, but it seems like that's duplicating the job of a supervisor. If one's not using a supervisor, it's not a problem, and if one is, then the supervisor's approach should be used. Since we're supporting mainly systemd, that's all we have to focus on currently. https://manpages.debian.org/stretch/systemd/systemd.service.5.en.html has a useful table of "Table 1. Exit causes and the effect of the Restart= settings on them". "no" and "always" aren't interesting and don't depend on exit code. "on-watchdog" isn't relevant either. "on-failure" doesn't distinguish between any kind of failures, so (probably) isn't the best choice. I'd be looking at "on-abnormal" or "on-abort", catch the usual signal-based exits, but not to restart when an unclean (nonzero) exit code is passed. |
96328a1 to
163525b
Compare
To pick this up again, it doesn't care, as written, whether the validator client's running as a separate process. Not sure how how integrating with the slashing protection database would help? We don't care here about the content of the attestation, just the index of the validator sending an attestation. It's not really a stateful operation in that way. There are edge cases around, if we had waited part of the probing period in question, crashed, and came back, maybe don't start waiting from the beginning, but even then, I'd be wary of that. |
What I meant is that the functionality won't work as expected when a validator client is used. The validator client has its own polling mechanism for determining its duties and it selects on its own the time when attestations are broadcasted. It won't execute |
Sure, but it can't actually send out the replies if the beacon node has quit. So this only matters in the disable-validators case. For that, it looks like the most reasonable thing to do is filter the responses from |
|
873915e covers the validator client aspects and allows just disabling validators. |
096b5db to
ff70138
Compare
|
00b3080 effectively feature-gates this by setting the default guard epochs to 0. |
rather than waiting N epochs, an option is to scan the last N epochs for attestations and compare it with the slashing protection database: if there are attestations in the blocks in the blocks that are not in the slashdb, something is wrong and we should now wait N epochs. The N-epoch wait generally solves the auto-restart problem: the app will keep detecting the other validator which turns the restart into a harmless annoyance rather than anything else. Regarding being selective about the validators and doing complicated things for partial recovery, this seems foolish from an economic point of view - trying to microoptimize a tiny reward but risking a large misconfiguration penalty seems to be the wrong way to go - providing ways / making it easier to monitor rewards seems overall like a better option. Also, waiting 1 epoch with attestations is fairly cheap - waiting with block production less so, but more risky: there's no random per-node delay like with attestations. |
beacon_chain/beacon_node_types.nim
Outdated
| index*: Option[ValidatorIndex] | ||
|
|
||
| # When a validator is detected elsewhere on the network | ||
| disabled*: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not at all convinced this is the right tradeoff - specially not at the introduction of the feature - this looks more like a microoptimization than anything else - there is large risk that more than one validator is affected when the setup in general is wrong, but the chance of detecting this condition on time is low - with multiple validators running in the same node, the chance for detection increases - specially for block production, so I'd say it's a lot more prudent to shut down processing completely in the face of this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beacon_chain/eth2_processor.nim
Outdated
| if self.validatorPool[].getValidator(validatorPubkey) != | ||
| default(AttachedValidator): | ||
| self.validatorPool[].disableValidator(validatorPubkey) | ||
| notice "Found another validator using same public key; would be slashed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, let's reuse language from the option: Duplicate validator detected or something along those lines - users should be taught to look out for this / set up alerts etc.
We should also have a metric so it can be picked up by grafana
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels too subtle - it's not something that's obvious - this feels more like an internal tuning parameter, it's somewhat questionable if it should be exposed to users at all.. instead, it would be nice with a single |
|
edit: er oops I didn't note my comment above already highliting some of the things here. I knew I had thought about it already :) Regarding restart strategies, https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart= For example, we could recommend For the ux, like mentioned above I'd probably like to see an option like: I also wouldn't promote the slot distance as a user option really - I mean , we can keep it there for debugging and testing, but this feels like something for which we shouldn't have to provide an option of that sort. Also, even if the service stops and then restarts, it's not a problem really - it should re-detect the faulty attestations during startup / sync (attestations are in block but not in valdiation protection database and act accordingly) |
…eth2_network_simulation and local network sim to disable protection
Yup, or there is also the |
Something needs to be there to allow CI finalization tests and local simulations not to waste multiple epochs doing nothing. |
|
confutils seems to strip out any explicit newlines and, generally, word-wraps the help text to the terminal width it detects. It could use: desc: "What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future). warn: print a warning to the log. stop: stop the client."But that's not much better. |
? |
Sure, that works well enough (it needs it trailing "." removed, because confutils appends a period to all the descriptions automatically). |
This doesn't catch simultaneous starts of two nodes, neither of which had been attesting or proposing (because both were either down due to Internet issues, not running, or otherwise inaccessible), and then both start. The waiting N epochs scheme is meant to detect that. |
|
|
Suggestions:
Explanatory message We believe you are currently running another instance of the same validator. We've disconnected you from the network as this presents a significant slashing risk. Possible next steps are: A. Make sure you've disconnected your validator from your old machine before restarting the client. If you've already done so , wait a little longer before restarting the client.
B. Run the client again with the |
|
|
|
agree.
|
…message upon detection when enabled
It also adds the message explaining next actions on quitting. The default already is disabled ( I modified the phrasing of the explanatory message a bit. In general, and maybe this could/should change, but I wanted to keep things consistent: (a) multiline messages with paragraph formatting aren't typical from the logging output and don't tend to be preserved in any useful way; and (b) the voice used hasn't typically had a "we [the software? the developers?] think X" kind of tone, so I'm a bit cautious there. |
|
Rebased here: I've also introduced an enum type in the configuration to detect invalid values for the option. |
One take on #1834.
It's not that useful to detect the 2nd or third time there's a slashing condition, because at that point one's already lost one's Gwei or more (practically, all 32Gwei for phases 0/1). This provides a adjustable, probabilistic balance of asset risk.
It doesn't store anything to disk, because that'd only be useful for less, not more, conservative takes on this approach.
It remains to remove/disable specific local validators rather than shut down the client as a whole, which I agree is a better approach.