duplicate validator protection #1915

tersec · 2020-10-27T17:25:44Z

One take on #1834.

It's not that useful to detect the 2nd or third time there's a slashing condition, because at that point one's already lost one's Gwei or more (practically, all 32Gwei for phases 0/1). This provides a adjustable, probabilistic balance of asset risk.

It doesn't store anything to disk, because that'd only be useful for less, not more, conservative takes on this approach.

It remains to remove/disable specific local validators rather than shut down the client as a whole, which I agree is a better approach.

zah · 2020-11-13T12:09:52Z

I've started drafting a design document that we can discuss in Discord, expand over time and eventually share with our users:

https://github.com/status-im/nimbus-eth2/wiki/Accidental-slashing-protection-techniques

Few points from the document relevant for this PR:

It would be good if the thinking here is extended to the scenario of running a beacon node with a validator client
You may consider possible integrations with the slashing protection database
It would be relatively trivial to add a disabled flag to the AttachedValidator object to implement the alternative strategy of not restarting the node.

We also have to decide what would be the best default value for selfSlashingDetectionEpochs for the initial mainnet release.

tersec · 2020-11-13T12:30:46Z

Another option is whether both sides probe, or only one side. It just has to be at least one side probing. But if that could be guaranteed, then at some level none of this would be necessary to begin with; that's why I have it hard-coded to both-sides-probe for now.

Or, conceivably, there are three, four, etc beacon nodes validating with that validator. I didn't put the probability calculations in for that, and the general case of $n$ validators, all started simultaneously, gets pretty unfortunate in terms of selfSlashingDetectionEpochs required.

It might be worth documenting that though, the calculation motivating at least some of the variations of the 2 or 3-beacon node cases, with overlapping-or-not assumptions.

tersec · 2020-11-13T12:46:51Z

Apropos "Upon quitting, a supervisor is likely to restart the process.", with the current exit code of 1, that's true. https://freedesktop.org/software/systemd/man/systemd.exec.html#Process%20exit%20codes and https://refspecs.linuxbase.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html document several other quasi-standard exit codes which might signal to systemd, or other supervisors, that it's not an ordinary crash or other failure, but something to avoid simply restarting.

It'd also be possible to add a "flag" file and check for its presence, but it seems like that's duplicating the job of a supervisor. If one's not using a supervisor, it's not a problem, and if one is, then the supervisor's approach should be used. Since we're supporting mainly systemd, that's all we have to focus on currently.

https://manpages.debian.org/stretch/systemd/systemd.service.5.en.html has a useful table of "Table 1. Exit causes and the effect of the Restart= settings on them". "no" and "always" aren't interesting and don't depend on exit code. "on-watchdog" isn't relevant either. "on-failure" doesn't distinguish between any kind of failures, so (probably) isn't the best choice. I'd be looking at "on-abnormal" or "on-abort", catch the usual signal-based exits, but not to restart when an unclean (nonzero) exit code is passed.

tersec · 2020-11-27T16:44:22Z

It would be good if the thinking here is extended to the scenario of running a beacon node with a validator client

You may consider possible integrations with the slashing protection database

It would be relatively trivial to add a disabled flag to the AttachedValidator object to implement the alternative strategy of not restarting the node.

We also have to decide what would be the best default value for selfSlashingDetectionEpochs for the initial mainnet release.

To pick this up again, it doesn't care, as written, whether the validator client's running as a separate process.

Not sure how how integrating with the slashing protection database would help? We don't care here about the content of the attestation, just the index of the validator sending an attestation. It's not really a stateful operation in that way. There are edge cases around, if we had waited part of the probing period in question, crashed, and came back, maybe don't start waiting from the beginning, but even then, I'd be wary of that.

zah · 2020-11-27T18:04:14Z

To pick this up again, it doesn't care, as written, whether the validator client's running as a separate process.

What I meant is that the functionality won't work as expected when a validator client is used. The validator client has its own polling mechanism for determining its duties and it selects on its own the time when attestations are broadcasted. It won't execute checkForPotentialSelfSlashing.

tersec · 2020-11-29T15:22:33Z

What I meant is that the functionality won't work as expected when a validator client is used. The validator client has its own polling mechanism for determining its duties and it selects on its own the time when attestations are broadcasted. It won't execute checkForPotentialSelfSlashing.

Sure, but it can't actually send out the replies if the beacon node has quit. So this only matters in the disable-validators case. For that, it looks like the most reasonable thing to do is filter the responses from rpc/validator_api.

tersec · 2020-11-30T07:06:39Z

873915e covers the validator client aspects and allows just disabling validators.

tersec · 2020-12-07T14:48:58Z

00b3080 effectively feature-gates this by setting the default guard epochs to 0.

arnetheduck · 2020-12-18T09:18:51Z

https://github.com/status-im/nimbus-eth2/wiki/Accidental-slashing-protection-techniques
...waiting N epochs...

Not sure how how integrating with the slashing protection database would help?

rather than waiting N epochs, an option is to scan the last N epochs for attestations and compare it with the slashing protection database: if there are attestations in the blocks in the blocks that are not in the slashdb, something is wrong and we should now wait N epochs.

The N-epoch wait generally solves the auto-restart problem: the app will keep detecting the other validator which turns the restart into a harmless annoyance rather than anything else.

Regarding being selective about the validators and doing complicated things for partial recovery, this seems foolish from an economic point of view - trying to microoptimize a tiny reward but risking a large misconfiguration penalty seems to be the wrong way to go - providing ways / making it easier to monitor rewards seems overall like a better option.

Also, waiting 1 epoch with attestations is fairly cheap - waiting with block production less so, but more risky: there's no random per-node delay like with attestations.

arnetheduck · 2020-12-18T09:21:38Z

beacon_chain/beacon_node_types.nim

    index*: Option[ValidatorIndex]

+    # When a validator is detected elsewhere on the network
+    disabled*: bool


I'm not at all convinced this is the right tradeoff - specially not at the introduction of the feature - this looks more like a microoptimization than anything else - there is large risk that more than one validator is affected when the setup in general is wrong, but the chance of detecting this condition on time is low - with multiple validators running in the same node, the chance for detection increases - specially for block production, so I'd say it's a lot more prudent to shut down processing completely in the face of this error.

arnetheduck · 2020-12-18T09:22:37Z

beacon_chain/eth2_processor.nim

+      if self.validatorPool[].getValidator(validatorPubkey) !=
+          default(AttachedValidator):
+        self.validatorPool[].disableValidator(validatorPubkey)
+        notice "Found another validator using same public key; would be slashed",


also, let's reuse language from the option: Duplicate validator detected or something along those lines - users should be taught to look out for this / set up alerts etc.

We should also have a metric so it can be picked up by grafana

arnetheduck · 2020-12-18T09:28:34Z

00b3080 effectively feature-gates this by setting the default guard epochs to 0.

this feels too subtle - it's not something that's obvious - this feels more like an internal tuning parameter, it's somewhat questionable if it should be exposed to users at all..

instead, it would be nice with a single --dup-protection=[warn|disable-validator|quit] option that we would initially set to warn as default, then to quit - we'd run it for a week or two on warn to see that it doesn't bomb, then we'd switch over to quit, but still allowing the user to disable it in case it turns out to be too aggressive for whatever reason - we'd then use a single observation period across the board which is far easier to support

arnetheduck · 2021-01-08T10:44:37Z

edit: er oops I didn't note my comment above already highliting some of the things here. I knew I had thought about it already :)

Regarding restart strategies, systemd has many - it would be a matter of aligning our recommended systemd setup with whatever the service produces:

https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

For example, we could recommend Restart=on-failure, designate a particular return code to this error and then add it to SuccessExitStatus such that systemd treats it as a success - it's merely a matter of documenting it correctly.

For the ux, like mentioned above I'd probably like to see an option like:

--duplicate-validator=[warn|disable|stop] What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future)
  warn: print a warning to the log
  disable: disable the one validator but keep running (unsafe)
  stop: stop the client

I also wouldn't promote the slot distance as a user option really - I mean , we can keep it there for debugging and testing, but this feels like something for which we shouldn't have to provide an option of that sort.

Also, even if the service stops and then restarts, it's not a problem really - it should re-detect the faulty attestations during startup / sync (attestations are in block but not in valdiation protection database and act accordingly)

…o cover validator client

…eth2_network_simulation and local network sim to disable protection

kdeme · 2021-01-08T11:27:18Z

For example, we could recommend Restart=on-failure, designate a particular return code to this error and then add it to SuccessExitStatus such that systemd treats it as a success - it's merely a matter of documenting it correctly.

Yup, or there is also the RestartPreventExitStatus option which could be used: https://www.freedesktop.org/software/systemd/man/systemd.service.html#RestartPreventExitStatus=

tersec · 2021-01-08T12:16:27Z

I also wouldn't promote the slot distance as a user option really - I mean , we can keep it there for debugging and testing, but this feels like something for which we shouldn't have to provide an option of that sort.

Something needs to be there to allow CI finalization tests and local simulations not to waste multiple epochs doing nothing.

tersec · 2021-01-08T13:25:40Z

this feels too subtle - it's not something that's obvious - this feels more like an internal tuning parameter, it's somewhat questionable if it should be exposed to users at all..

instead, it would be nice with a single --dup-protection=[warn|disable-validator|quit] option that we would initially set to warn as default, then to quit - we'd run it for a week or two on warn to see that it doesn't bomb, then we'd switch over to quit, but still allowing the user to disable it in case it turns out to be too aggressive for whatever reason - we'd then use a single observation period across the board which is far easier to support

4dee5c4

tersec · 2021-01-08T14:02:14Z

need to list the valid options here though

confutils seems to strip out any explicit newlines and, generally, word-wraps the help text to the terminal width it detects. It could use:

  desc: "What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future). warn: print a warning to the log. stop: stop the client."

But that's not much better.

arnetheduck · 2021-01-08T14:10:56Z

desc: "What to do when another validator is detected to be running the same validator keys (default warn, will become stop in the future). warn: print a warning to the log. stop: stop the client."

  desc: "[=warn*|stop] What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future)."

?

tersec · 2021-01-08T14:24:50Z

desc: "What to do when another validator is detected to be running the same validator keys (default warn, will become stop in the future). warn: print a warning to the log. stop: stop the client."
  desc: "[=warn*|stop] What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future)."
?

Sure, that works well enough (it needs it trailing "." removed, because confutils appends a period to all the descriptions automatically).

tersec · 2021-01-08T14:57:42Z

rather than waiting N epochs, an option is to scan the last N epochs for attestations and compare it with the slashing protection database: if there are attestations in the blocks in the blocks that are not in the slashdb, something is wrong and we should now wait N epochs.

This doesn't catch simultaneous starts of two nodes, neither of which had been attesting or proposing (because both were either down due to Internet issues, not running, or otherwise inaccessible), and then both start. The waiting N epochs scheme is meant to detect that.

tersec · 2021-01-08T14:58:45Z

desc: "What to do when another validator is detected to be running the same validator keys (default warn, will become stop in the future). warn: print a warning to the log. stop: stop the client."
  desc: "[=warn*|stop] What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future)."
?
Sure, that works well enough (it needs it trailing "." removed, because confutils appends a period to all the descriptions automatically).

8519b64

unixpi · 2021-01-11T08:54:45Z

Suggestions:

let's call it additional-slashing-protection and make it clear it's an experimental feature
disabled by default for the time being (until we've tested it more thoroughly)
if enabled, it should always quit on detection with an explanatory message that clearly outlines possible next steps

Explanatory message

We believe you are currently running another instance of the same validator. We've disconnected you from the network as this presents a significant slashing risk.

Possible next steps are:

A. Make sure you've disconnected your validator from your old machine before restarting the client. If you've already done so , wait a little longer before restarting the client.

not sure if the last sentence is needed?

B. Run the client again with the additional-slashing-protection option disabled (only do this if you are absolutely sure this is the only instance of your validator client that is running) and report this false positive to us here: https://github.com/status-im/nimbus-eth2/issues.

arnetheduck · 2021-01-11T09:27:30Z

additional sounds weird though - additional on top of what, and what if we get one more feature doing slashing protection?

remote, network, p2p, gossip slashing protection could all work - ideally the word would embody its non-deterministic nature: it will detect the cases based somewhat on luck and what's received over wire and in what order

unixpi · 2021-01-11T09:29:39Z

agree.

remote, network, p2p, gossip slashing protection could all work

gossip is the most specific here? If so, let's go with that

…message upon detection when enabled

tersec · 2021-01-11T09:56:22Z

agree.

remote, network, p2p, gossip slashing protection could all work

gossip is the most specific here? If so, let's go with that

4eea984

It also adds the message explaining next actions on quitting.

The default already is disabled (warn). dontcheck isn't meant for general use.

I modified the phrasing of the explanatory message a bit. In general, and maybe this could/should change, but I wanted to keep things consistent: (a) multiline messages with paragraph formatting aren't typical from the logging output and don't tend to be preserved in any useful way; and (b) the voice used hasn't typically had a "we [the software? the developers?] think X" kind of tone, so I'm a bit cautious there.

zah · 2021-01-18T20:17:05Z

Rebased here:
#2244

I've also introduced an enum type in the configuration to detect invalid values for the option.

tersec changed the title ~~pre-emptive duplicate validator detection heuristic~~ [WIP] pre-emptive duplicate validator detection heuristic Oct 27, 2020

tersec force-pushed the ede branch from 89f5a67 to 44c4760 Compare October 28, 2020 07:59

tersec force-pushed the ede branch from 44c4760 to beee015 Compare November 13, 2020 21:21

tersec force-pushed the ede branch 2 times, most recently from 96328a1 to 163525b Compare November 25, 2020 17:17

tersec force-pushed the ede branch from 163525b to e121850 Compare November 29, 2020 14:35

potuz mentioned this pull request Nov 29, 2020

Check on-chain for attestations at validator launch OffchainLabs/prysm#7985

Closed

tersec changed the title ~~[WIP] pre-emptive duplicate validator detection heuristic~~ pre-emptive duplicate validator detection heuristic Nov 30, 2020

tersec force-pushed the ede branch 2 times, most recently from 096b5db to ff70138 Compare December 3, 2020 08:39

tersec changed the title ~~pre-emptive duplicate validator detection heuristic~~ duplicate validator protection Dec 3, 2020

tersec added the merge ready label Dec 7, 2020

tersec changed the base branch from devel to unstable December 17, 2020 07:36

arnetheduck reviewed Dec 18, 2020

View reviewed changes

tersec force-pushed the ede branch from 00b3080 to e5009ba Compare December 22, 2020 16:09

tersec removed the merge ready label Dec 23, 2020

tersec added 2 commits January 8, 2021 12:08

pre-emptive duplicate validator detection heuristic

b4c2acf

allow choice of quit or disable would-be-slashed validators; extend t…

02032d8

…o cover validator client

tersec added 5 commits January 8, 2021 12:10

fix slot variable name in network start

2861ec5

increase default epochs from 2 to 3

e5ee031

tweak probe-epoch-selection heuristic; rename to dup protection; fix …

46ea0ff

…eth2_network_simulation and local network sim to disable protection

disable dup protection/detection by default

f4a7a6e

define ATTESTATION_PROPAGATION_SLOT_RANGE

8b3bc7a

tersec force-pushed the ede branch from e8cd97d to 8b3bc7a Compare January 8, 2021 11:14

remove validator disabling and add metric

aa989b9

simplify duplicate validator detection UI

4dee5c4

improve help text description

8519b64

tersec added 2 commits January 8, 2021 16:27

change dontcheck to only apply to executing validator duties

d5cc37c

switch warn mode not to wait at all, only stop mode

563c386

update option name to gossip slashing protection and add explanatory …

4eea984

…message upon detection when enabled

zah closed this Jan 18, 2021

tersec deleted the ede branch January 21, 2021 17:36

duplicate validator protection #1915

duplicate validator protection #1915

Uh oh!

Conversation

tersec commented Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zah commented Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tersec commented Nov 13, 2020

Uh oh!

tersec commented Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tersec commented Nov 27, 2020

Uh oh!

zah commented Nov 27, 2020

Uh oh!

tersec commented Nov 29, 2020

Uh oh!

tersec commented Nov 30, 2020

Uh oh!

tersec commented Dec 7, 2020

Uh oh!

arnetheduck commented Dec 18, 2020

Uh oh!

arnetheduck Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

tersec Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

arnetheduck Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

arnetheduck Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

tersec Jan 8, 2021

Choose a reason for hiding this comment

Uh oh!

arnetheduck commented Dec 18, 2020

Uh oh!

arnetheduck commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdeme commented Jan 8, 2021

Uh oh!

tersec commented Jan 8, 2021

Uh oh!

tersec commented Jan 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tersec commented Jan 8, 2021

Uh oh!

arnetheduck commented Jan 8, 2021

Uh oh!

tersec commented Jan 8, 2021

Uh oh!

tersec commented Jan 8, 2021

Uh oh!

tersec commented Jan 8, 2021

Uh oh!

unixpi commented Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arnetheduck commented Jan 11, 2021

Uh oh!

unixpi commented Jan 11, 2021

Uh oh!

tersec commented Jan 11, 2021

Uh oh!

zah commented Jan 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

tersec commented Oct 27, 2020 •

edited

Loading

zah commented Nov 13, 2020 •

edited

Loading

tersec commented Nov 13, 2020 •

edited

Loading

arnetheduck commented Jan 8, 2021 •

edited

Loading

tersec commented Jan 8, 2021 •

edited

Loading

unixpi commented Jan 11, 2021 •

edited

Loading