Skip to content

Conversation

@tersec
Copy link
Contributor

@tersec tersec commented Oct 27, 2020

One take on #1834.

It's not that useful to detect the 2nd or third time there's a slashing condition, because at that point one's already lost one's Gwei or more (practically, all 32Gwei for phases 0/1). This provides a adjustable, probabilistic balance of asset risk.

It doesn't store anything to disk, because that'd only be useful for less, not more, conservative takes on this approach.

It remains to remove/disable specific local validators rather than shut down the client as a whole, which I agree is a better approach.

@tersec tersec changed the title pre-emptive duplicate validator detection heuristic [WIP] pre-emptive duplicate validator detection heuristic Oct 27, 2020
@zah
Copy link
Contributor

zah commented Nov 13, 2020

I've started drafting a design document that we can discuss in Discord, expand over time and eventually share with our users:

https://github.com/status-im/nimbus-eth2/wiki/Accidental-slashing-protection-techniques

Few points from the document relevant for this PR:

  • It would be good if the thinking here is extended to the scenario of running a beacon node with a validator client
  • You may consider possible integrations with the slashing protection database
  • It would be relatively trivial to add a disabled flag to the AttachedValidator object to implement the alternative strategy of not restarting the node.

We also have to decide what would be the best default value for selfSlashingDetectionEpochs for the initial mainnet release.

@tersec
Copy link
Contributor Author

tersec commented Nov 13, 2020

Another option is whether both sides probe, or only one side. It just has to be at least one side probing. But if that could be guaranteed, then at some level none of this would be necessary to begin with; that's why I have it hard-coded to both-sides-probe for now.

Or, conceivably, there are three, four, etc beacon nodes validating with that validator. I didn't put the probability calculations in for that, and the general case of $n$ validators, all started simultaneously, gets pretty unfortunate in terms of selfSlashingDetectionEpochs required.

It might be worth documenting that though, the calculation motivating at least some of the variations of the 2 or 3-beacon node cases, with overlapping-or-not assumptions.

@tersec
Copy link
Contributor Author

tersec commented Nov 13, 2020

Apropos "Upon quitting, a supervisor is likely to restart the process.", with the current exit code of 1, that's true. https://freedesktop.org/software/systemd/man/systemd.exec.html#Process%20exit%20codes and https://refspecs.linuxbase.org/LSB_5.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html document several other quasi-standard exit codes which might signal to systemd, or other supervisors, that it's not an ordinary crash or other failure, but something to avoid simply restarting.

It'd also be possible to add a "flag" file and check for its presence, but it seems like that's duplicating the job of a supervisor. If one's not using a supervisor, it's not a problem, and if one is, then the supervisor's approach should be used. Since we're supporting mainly systemd, that's all we have to focus on currently.

https://manpages.debian.org/stretch/systemd/systemd.service.5.en.html has a useful table of "Table 1. Exit causes and the effect of the Restart= settings on them". "no" and "always" aren't interesting and don't depend on exit code. "on-watchdog" isn't relevant either. "on-failure" doesn't distinguish between any kind of failures, so (probably) isn't the best choice. I'd be looking at "on-abnormal" or "on-abort", catch the usual signal-based exits, but not to restart when an unclean (nonzero) exit code is passed.

@tersec
Copy link
Contributor Author

tersec commented Nov 27, 2020

  • It would be good if the thinking here is extended to the scenario of running a beacon node with a validator client
  • You may consider possible integrations with the slashing protection database
  • It would be relatively trivial to add a disabled flag to the AttachedValidator object to implement the alternative strategy of not restarting the node.

We also have to decide what would be the best default value for selfSlashingDetectionEpochs for the initial mainnet release.

To pick this up again, it doesn't care, as written, whether the validator client's running as a separate process.

Not sure how how integrating with the slashing protection database would help? We don't care here about the content of the attestation, just the index of the validator sending an attestation. It's not really a stateful operation in that way. There are edge cases around, if we had waited part of the probing period in question, crashed, and came back, maybe don't start waiting from the beginning, but even then, I'd be wary of that.

@zah
Copy link
Contributor

zah commented Nov 27, 2020

To pick this up again, it doesn't care, as written, whether the validator client's running as a separate process.

What I meant is that the functionality won't work as expected when a validator client is used. The validator client has its own polling mechanism for determining its duties and it selects on its own the time when attestations are broadcasted. It won't execute checkForPotentialSelfSlashing.

@tersec
Copy link
Contributor Author

tersec commented Nov 29, 2020

What I meant is that the functionality won't work as expected when a validator client is used. The validator client has its own polling mechanism for determining its duties and it selects on its own the time when attestations are broadcasted. It won't execute checkForPotentialSelfSlashing.

Sure, but it can't actually send out the replies if the beacon node has quit. So this only matters in the disable-validators case. For that, it looks like the most reasonable thing to do is filter the responses from rpc/validator_api.

@tersec tersec changed the title [WIP] pre-emptive duplicate validator detection heuristic pre-emptive duplicate validator detection heuristic Nov 30, 2020
@tersec
Copy link
Contributor Author

tersec commented Nov 30, 2020

873915e covers the validator client aspects and allows just disabling validators.

@tersec tersec force-pushed the ede branch 2 times, most recently from 096b5db to ff70138 Compare December 3, 2020 08:39
@tersec tersec changed the title pre-emptive duplicate validator detection heuristic duplicate validator protection Dec 3, 2020
@tersec
Copy link
Contributor Author

tersec commented Dec 7, 2020

00b3080 effectively feature-gates this by setting the default guard epochs to 0.

@tersec tersec changed the base branch from devel to unstable December 17, 2020 07:36
@arnetheduck
Copy link
Member

https://github.com/status-im/nimbus-eth2/wiki/Accidental-slashing-protection-techniques
...waiting N epochs...

Not sure how how integrating with the slashing protection database would help?

rather than waiting N epochs, an option is to scan the last N epochs for attestations and compare it with the slashing protection database: if there are attestations in the blocks in the blocks that are not in the slashdb, something is wrong and we should now wait N epochs.

The N-epoch wait generally solves the auto-restart problem: the app will keep detecting the other validator which turns the restart into a harmless annoyance rather than anything else.

Regarding being selective about the validators and doing complicated things for partial recovery, this seems foolish from an economic point of view - trying to microoptimize a tiny reward but risking a large misconfiguration penalty seems to be the wrong way to go - providing ways / making it easier to monitor rewards seems overall like a better option.

Also, waiting 1 epoch with attestations is fairly cheap - waiting with block production less so, but more risky: there's no random per-node delay like with attestations.

index*: Option[ValidatorIndex]

# When a validator is detected elsewhere on the network
disabled*: bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not at all convinced this is the right tradeoff - specially not at the introduction of the feature - this looks more like a microoptimization than anything else - there is large risk that more than one validator is affected when the setup in general is wrong, but the chance of detecting this condition on time is low - with multiple validators running in the same node, the chance for detection increases - specially for block production, so I'd say it's a lot more prudent to shut down processing completely in the face of this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.validatorPool[].getValidator(validatorPubkey) !=
default(AttachedValidator):
self.validatorPool[].disableValidator(validatorPubkey)
notice "Found another validator using same public key; would be slashed",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, let's reuse language from the option: Duplicate validator detected or something along those lines - users should be taught to look out for this / set up alerts etc.

We should also have a metric so it can be picked up by grafana

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arnetheduck
Copy link
Member

00b3080 effectively feature-gates this by setting the default guard epochs to 0.

this feels too subtle - it's not something that's obvious - this feels more like an internal tuning parameter, it's somewhat questionable if it should be exposed to users at all..

instead, it would be nice with a single --dup-protection=[warn|disable-validator|quit] option that we would initially set to warn as default, then to quit - we'd run it for a week or two on warn to see that it doesn't bomb, then we'd switch over to quit, but still allowing the user to disable it in case it turns out to be too aggressive for whatever reason - we'd then use a single observation period across the board which is far easier to support

@arnetheduck
Copy link
Member

arnetheduck commented Jan 8, 2021

edit: er oops I didn't note my comment above already highliting some of the things here. I knew I had thought about it already :)

Regarding restart strategies, systemd has many - it would be a matter of aligning our recommended systemd setup with whatever the service produces:

https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

For example, we could recommend Restart=on-failure, designate a particular return code to this error and then add it to SuccessExitStatus such that systemd treats it as a success - it's merely a matter of documenting it correctly.

For the ux, like mentioned above I'd probably like to see an option like:

--duplicate-validator=[warn|disable|stop] What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future)
  warn: print a warning to the log
  disable: disable the one validator but keep running (unsafe)
  stop: stop the client

I also wouldn't promote the slot distance as a user option really - I mean , we can keep it there for debugging and testing, but this feels like something for which we shouldn't have to provide an option of that sort.

Also, even if the service stops and then restarts, it's not a problem really - it should re-detect the faulty attestations during startup / sync (attestations are in block but not in valdiation protection database and act accordingly)

@kdeme
Copy link
Contributor

kdeme commented Jan 8, 2021

For example, we could recommend Restart=on-failure, designate a particular return code to this error and then add it to SuccessExitStatus such that systemd treats it as a success - it's merely a matter of documenting it correctly.

Yup, or there is also the RestartPreventExitStatus option which could be used: https://www.freedesktop.org/software/systemd/man/systemd.service.html#RestartPreventExitStatus=

@tersec
Copy link
Contributor Author

tersec commented Jan 8, 2021

I also wouldn't promote the slot distance as a user option really - I mean , we can keep it there for debugging and testing, but this feels like something for which we shouldn't have to provide an option of that sort.

Something needs to be there to allow CI finalization tests and local simulations not to waste multiple epochs doing nothing.

@tersec
Copy link
Contributor Author

tersec commented Jan 8, 2021

this feels too subtle - it's not something that's obvious - this feels more like an internal tuning parameter, it's somewhat questionable if it should be exposed to users at all..

instead, it would be nice with a single --dup-protection=[warn|disable-validator|quit] option that we would initially set to warn as default, then to quit - we'd run it for a week or two on warn to see that it doesn't bomb, then we'd switch over to quit, but still allowing the user to disable it in case it turns out to be too aggressive for whatever reason - we'd then use a single observation period across the board which is far easier to support

4dee5c4

@tersec
Copy link
Contributor Author

tersec commented Jan 8, 2021

need to list the valid options here though

confutils seems to strip out any explicit newlines and, generally, word-wraps the help text to the terminal width it detects. It could use:

  desc: "What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future). warn: print a warning to the log. stop: stop the client."

But that's not much better.

@arnetheduck
Copy link
Member

desc: "What to do when another validator is detected to be running the same validator keys (default warn, will become stop in the future). warn: print a warning to the log. stop: stop the client."

  desc: "[=warn*|stop] What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future)."

?

@tersec
Copy link
Contributor Author

tersec commented Jan 8, 2021

desc: "What to do when another validator is detected to be running the same validator keys (default warn, will become stop in the future). warn: print a warning to the log. stop: stop the client."

  desc: "[=warn*|stop] What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future)."

?

Sure, that works well enough (it needs it trailing "." removed, because confutils appends a period to all the descriptions automatically).

@tersec
Copy link
Contributor Author

tersec commented Jan 8, 2021

rather than waiting N epochs, an option is to scan the last N epochs for attestations and compare it with the slashing protection database: if there are attestations in the blocks in the blocks that are not in the slashdb, something is wrong and we should now wait N epochs.

This doesn't catch simultaneous starts of two nodes, neither of which had been attesting or proposing (because both were either down due to Internet issues, not running, or otherwise inaccessible), and then both start. The waiting N epochs scheme is meant to detect that.

@tersec
Copy link
Contributor Author

tersec commented Jan 8, 2021

desc: "What to do when another validator is detected to be running the same validator keys (default warn, will become stop in the future). warn: print a warning to the log. stop: stop the client."

  desc: "[=warn*|stop] What to do when another validator is detected to be running the same validator keys (default `warn`, will become `stop` in the future)."

?

Sure, that works well enough (it needs it trailing "." removed, because confutils appends a period to all the descriptions automatically).

8519b64

@unixpi
Copy link
Contributor

unixpi commented Jan 11, 2021

Suggestions:

  • let's call it additional-slashing-protection and make it clear it's an experimental feature

  • disabled by default for the time being (until we've tested it more thoroughly)

  • if enabled, it should always quit on detection with an explanatory message that clearly outlines possible next steps


Explanatory message

We believe you are currently running another instance of the same validator. We've disconnected you from the network as this presents a significant slashing risk.

Possible next steps are:

A. Make sure you've disconnected your validator from your old machine before restarting the client. If you've already done so , wait a little longer before restarting the client.

not sure if the last sentence is needed?

B. Run the client again with the additional-slashing-protection option disabled (only do this if you are absolutely sure this is the only instance of your validator client that is running) and report this false positive to us here: https://github.com/status-im/nimbus-eth2/issues.

@arnetheduck
Copy link
Member

additional sounds weird though - additional on top of what, and what if we get one more feature doing slashing protection?

remote, network, p2p, gossip slashing protection could all work - ideally the word would embody its non-deterministic nature: it will detect the cases based somewhat on luck and what's received over wire and in what order

@unixpi
Copy link
Contributor

unixpi commented Jan 11, 2021

agree.

remote, network, p2p, gossip slashing protection could all work

gossip is the most specific here? If so, let's go with that

@tersec
Copy link
Contributor Author

tersec commented Jan 11, 2021

agree.

remote, network, p2p, gossip slashing protection could all work

gossip is the most specific here? If so, let's go with that

4eea984

It also adds the message explaining next actions on quitting.

The default already is disabled (warn). dontcheck isn't meant for general use.

I modified the phrasing of the explanatory message a bit. In general, and maybe this could/should change, but I wanted to keep things consistent: (a) multiline messages with paragraph formatting aren't typical from the logging output and don't tend to be preserved in any useful way; and (b) the voice used hasn't typically had a "we [the software? the developers?] think X" kind of tone, so I'm a bit cautious there.

@zah
Copy link
Contributor

zah commented Jan 18, 2021

Rebased here:
#2244

I've also introduced an enum type in the configuration to detect invalid values for the option.

@zah zah closed this Jan 18, 2021
@tersec tersec deleted the ede branch January 21, 2021 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants