Skip to content

Conversation

tbagrel1
Copy link
Contributor

Co-authored-by: nbacquey [email protected]

Description

Please include a meaningful description of the PR and link the relevant issues
this PR might resolve.

Also note that:

  • New code should be properly tested (even if it does not add new features).
  • The fix for a regression should include a test that reproduces said regression.

@tbagrel1 tbagrel1 changed the title Peras/experimental-object-diffusion-inbound-v2 [WIP] Experimental ObjectDiffusion v2 (just inbound side changes) for efficient vote retrieval Sep 29, 2025
@tbagrel1 tbagrel1 force-pushed the peras/experimental-object-diffusion-inbound-v2 branch 2 times, most recently from 9de5d64 to 443cfc8 Compare October 13, 2025 14:03
@tbagrel1 tbagrel1 force-pushed the peras/experimental-object-diffusion-inbound-v2 branch from 96b7016 to 8bd6007 Compare October 15, 2025 12:00
@tbagrel1 tbagrel1 changed the base branch from main-pr/object-diffusion to peras-staging October 15, 2025 12:03
@tbagrel1
Copy link
Contributor Author

@nbacquey nbacquey force-pushed the peras/experimental-object-diffusion-inbound-v2 branch from d28fe49 to f794d6f Compare October 20, 2025 18:08
- the inbound peer that submitted the object to pool might have acked it already at the moment the object is rejected by the pool, but the rejection indicates that the outbound peer which sent us the object is adversarial, and we should disconnect from it anyway. So there is no harm done by having acked the object to the adversarial outbound peer, as we won't want to re-download this object from it again (or any other object whatsoever).
- any other inbound peer that has this ID available from its outbound peer won't be able to ack it because this ID isn't in **their** `dpsObjectsOwtPool`, and is not in the pool either, so we will be able to download it from these other peers until we find a valid one.

As in TxSubmission, acknowledgement is done by indicating to the outbound peer the length of the (longest) prefix of the oustanding FIFO that we no longer care about (i.e. for which all IDs are eligible to acknowledgment by the definition above). The field `dpsOutstandingFifo` on the inbound peer is supposed to mirror exactly the state of the FIFO of the outbound peer, bar eventual discrepancies due to in-flight information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably missed something along the way, but what happens when a peer receives all but the first object it requests from an outbound peer? Can that object be re-requested from the same peer, or do we rely on some other peer sending it successfully and thus allowing us to move ahead in the outstanding FIFO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outbound peer is not allowed to reply partially to one of our requests. It could (maybe) reply out of order to different requests we made, but if request A asks for 30 specific objects, reply to request A should contain exactly the 30 objects, otherwise we disconnect immediately from that peer for protocol fault.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I had missed the part that the requested objects would have to be sent all in the same reply and not in individual ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a requirement of the typed-protocol stuff that a single request lead to a single reply. Because we count (as a type-level integer) how many requests have been pipelined to know how many replies we can expect.

When making decisions, we first divide the peers in two groups:

- Those who are currently executing a decision, i.e., those for which the (previous) decision in the decision channel verifies `pdStatus == DecisionBeingActedUpon`. These are further called _frozen peers_.
- Those who are not currently executing a decision, i.e., those for which the (previous) decision in the decision channel verifies `pdStatus == DecisionUnread || pdStatus == DecisionCompleted`. The former are the ones who didn't have time to read the previous decision yet, so it makes sense to recompute a more up-to-date decision for them. The latter are the ones who have completed executing the previous decision, so it also makes sense to compute a new decision for them. These two categories of peers are further called _active peers_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former are the ones who didn't have time to read the previous decision yet, so it makes sense to recompute a more up-to-date decision for them.

Could this somehow lead to starvation? I.e. the peer never leaving the DecisionUnread state because the decision thread keeps recomputing a new one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, as soon a peer has completed a decision, it will start again in go and ask for a new one. As soon as psaReadDecision returns, the decision is marked as DecisionBeingActedUpon. And because the decision thread is rate-limited, there is always a window in which the "lock" (not sure how exactly the STM operates to ensure atomic mutations of shared mutable variables) on the decision is not being held by the decision thread, allowing for the peer to read it and mark it as being acted upon.

The only case where a peer would not ask for a decision immediately after completing one is when it has made a blocking request for IDs, in which case it will block and collect the reply before starting at the top of go again. This is the only case I see where the decision logic could update the decision (with status DecisionUnread) a significant amount of times before the peer asks for it, but in that case the decision should be trivial (i.e. do nothing because there is no IDs in the FIFO, as this is the precondition for making a blocking request for IDs).

I'm not sure I answered your question actually. Feel free to ask for more details.
Also if you see how the documentation could be improved to make it clearer for future readers, please suggest an edit ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, I think this answers my question.

Maybe a follow up question: let's say that all our inbound peers are blocked waiting for IDs. Would this make the decision thread enter a temporary busy waiting loop where it does nothing but to update its decisions over and over, or would it block as well until some condition flag switches state?

Sorry in advance if this is nonsensical :)

Copy link
Contributor Author

@tbagrel1 tbagrel1 Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, at the beginning of the protocol, all inbound peers would block waiting for IDs, so the decision loop would always update trivial decisions by a new trivial decision.

It could make sense to pre-filter-out the peers that are waiting for IDs, so that they are not considered by the decision thread. Even simpler actually would be to only call psaOnDecisionExecuted when the IDs have been received in the blocking case, so they would be filtered out automatically (considered as "frozen" peers). If we do this change, there will be theoretically no case where a decision would be updated a significant amount of time before being read and frozen in.

NOTE: a peer can only block requesting for IDs when there are no longer any IDs in its FIFO, so the decision for this peer must be really trivial, as there is nothing we can ask it to do.

- They are not already in the pool
- They are available from the peer, and won't be acked by the peer on its next request for IDs according to the partial decision computed by `computeAck` at the previous step (NOTE: theoretically, this second condition is redundant with other constraints and invariants of the current implementation).

Then we "reverse" this mapping to obtain a map of object IDs to the set of active peers that have the corresponding interesting objects available (according to the criteria above), further called _potential providers_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any notion of quality of service that one could use to prioritize certain potential providers over others at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could try to balance the load, e.g. prioritize peers with the fewer objects in flight at a given time.

Given that any invalid object triggers protocol termination for a peer, we can't have a metric of "quality" for the objects provided by an outbound peer. As it must be perfect or we disconnect.

It could make sense to use information from the network layer maybe, to see how fast the peers are to distribute objects and prioritize the fastest providers. But that would probably require some architectural changes; or require that we measure the time taken between a request and reply ourselves, at the application layer, which might be inconvenient/inaccurate. Really don't know at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this should be part of the optimization effort for voting logic, which is not due for several months.

@nbacquey nbacquey force-pushed the peras/experimental-object-diffusion-inbound-v2 branch from a459164 to 7fa719c Compare October 23, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Figure out non-trivial inbound logic for votes retrieval

3 participants