Seed Agent Cluster Auto-Configuration

### Specification

The seed node cluster is what is behind `mainnet.polykey.io` and `testnet.polykey.io` requires some auto-configuration to gain knowledge of each other so that they can share their DHT workload which include signalling and relaying.

Currently seed nodes are launched without knowledge of any seed nodes. This makes sense, since they are the first seed nodes. However as we scale the number of seed nodes, it would make sense that seed nodes can automatically discover each other and establish connections. This would make easier to launch clusters of seed nodes.

There are several challenges here and questions we must work out:

* Does it mean it is possible to run multiple seed nodes with the same NodeID?
* If we need to have multiple seed nodes, we must then pregenerate their root keys and preserve their recovery codes MatrixAI/Polykey#285
* If multiple seed nodes have different NodeIDs, are their root keys connected to each other in a trust relationship (either hierarchically via PKI, or a loose-mesh via the gestalt graph + root chain)
  - How does this impact how this trust information is propagated eventually-consistently across the network?
  - How does this deal with attacks/impersonation/DHT poisoning/sybil...?
  - How does this deal with revocation?
  - What does this mean for our default seed node list that is configured in the PK software distribution
* If seed nodes are scaled up and down, how do they acquire their recovery keys securely and without conflict?
  - See: https://gitlab.com/MatrixAI/Engineering/Polykey/Polykey-Infrastructure/-/issues/6
* When seed nodes need to discover each other automatically, we have to use one of the auto-configuration networking technologies.
  - Multicast DNS - MatrixAI/js-mdns#1 
  - AWS service discovery
  - Should support IPv6 MatrixAI/Polykey#400 
  - https://en.wikipedia.org/wiki/Zero-configuration_networking
* If the seed cluster are all behind 1 IP address/hostname (like our NLB) this means:
  - Multiple node ids - multiple host names - multiple IP addresses
  - Multiple node ids to 1 IP address
  - Multiple node ids to 1 host name
  - 1 hostname can resolve to multiple IP addresses (randomly too)
  - The same node id on multiple IP addresses and multiple host names
  - https://github.com/MatrixAI/js-polykey/pull/396#issuecomment-1175200064 - discussion on the multi-level complexity of AWS
* Using a network load balancer means we need to preserve stickiness for "flows", we must ensure that this doesn't break down our network connections mid-flight and mid-conversation.
  - AWS sets this to 120s timeout for a UDP flow, this is not configurable.
  - AWS load balances according to origin IP address, and maintains the stickiness for the lifetime of a flow
  - The stickiness must be preserved between NLB to multiple listeners, from listener to multiple target groups, from target group to multiple targets.
  - https://github.com/MatrixAI/js-polykey/pull/396#issuecomment-1175040632 - discussion about how stickiness works on NLB
  - ![image](https://user-images.githubusercontent.com/640797/178093887-b3c941a6-b8f9-4c32-95de-3f410cde71de.png)
* Load balancing introduces network proxies. These network proxies **must** preserve the client IP address, otherwise NAT-busting signalling will not work.
  - We've enabled this option for NLB
  - There's a special protocol for UDP/TCP for preserving client IPs in case it's not possible to be done at the IP-layer, but must be done on the UDP/TCP layer
    * https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#proxy-protocol
    * https://github.com/haproxy/haproxy/blob/master/doc/proxy-protocol.txt
    * We could integrate this into our `Proxy` class

### Additional context

* MatrixAI/Polykey#396 - initial automation of the testnet.polykey.io discovered these challenges in deploying in AWS
* https://gitlab.com/MatrixAI/Engineering/Polykey/Polykey-Infrastructure/-/issues/6 - recovery code injection from secret managers
* MatrixAI/Polykey#285 - maintaining recovery keys for the testnet
* https://adam-p.ca/blog/2022/03/x-forwarded-for/ - getting the real IP on layer 7 (note that we are preserving the client IP by default right now, but not all systems do this)
* Cloudflare is becoming more used as the gateway to all polykey services, it's interesting to see that they are becoming that API gateway, and then do add-on services on top... and they skipped the VM and containers and went straight to serverless with WASM. WASM with WASI is the new unikernel system

### Tasks

1. Research DNS load balancing as an alternative
2. Work out how distributed PK with multiple nodes sharing the same IP address will work
3. Answer every question above


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seed Agent Cluster Auto-Configuration #403

Specification

Additional context

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Seed Agent Cluster Auto-Configuration #403

Description

Specification

Additional context

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions