-
Notifications
You must be signed in to change notification settings - Fork 110
remove FindPeer error propagation in RemoteFetch call
#1523
Conversation
…turn behaviour - RemoteFetch does not return error if no peers found - add self node id to debug loglines - pass node id to netstore in test - remove Sleeps in pushsync simulation code
| // * downloader downloads the chunk | ||
| // Trials are run concurrently | ||
| func TestPushSyncSimulation(t *testing.T) { | ||
| nodeCnt := 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there is still some issue with the pushsync simulation tests being very slow. Especially for a low(!) number of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a hard time understanding what we are simulating with 4 nodes... most probably depth is 0 or 1 with 4 nodes, so I don't think this simulation makes sense with less than 16 nodes. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say you are right if it was not the case that here small network showed the problem and likely still have issues, so we need both
| log.Trace(err.Error(), "ref", ref) | ||
| osp.LogFields(olog.String("err", err.Error())) | ||
| osp.Finish() | ||
| return ErrNoSuitablePeer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After setting the loglevel higher you see that the simulation errors with no peers available coming from RemoteFetch. This is already suspicious since such an error should not reach the user.
With some debugging you can see which FindPeer case results in an error. Then grepping the address in the log, you can easily track the series of events:
- requesting node calls remote fetch which calls
FindPeerto find a peer - the peer starts serving the request
- before delivery, search timeout passes and the following call to
FindPeerresults in an error - RemoteFetch propagates this error to the NetStore, then downloader, then simulation.
It is not correct to return this error, instead we should return when the request times out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErrNoSuitablePeer should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this. If n.RemoteGet errors, what are we waiting for... obviously we haven't made any request to remote peers, so?
If we don't care about any errors from n.RemoteGet, then we should change the interface. However intuitively if the chunk we are looking for does not exist, why should we hang until the global timeout expires (i.e. the timeout set to the context when the user calls NetStore.Get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no peers available means that all suitable peers have been tried and they did not respond within the searchTimeout. if this happens before the global timeout, then NetStore gets this error, and not context deadline exceeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to clarify.
Your line of thinking is based on the assumption that
- requests prompts you to actively look for the chunk
- if these searches are exhausted, you can conclude the chunk is not found
However, this is not what your incentive tells you to do.
- requestor tells you they require a chunk and willing to pay for it (up to the request TTL, the proxy for this is currently the request timeout)
- therefore you really want to be prepared to deliver the chunk to requestor upto TTL expires
and really dont care how it gets there. Simply put, you are not incentivised to close a request before you can no longer earn from the response.
Surely, it does make sense to leave the request open even if you find no further suitable peers to ask. (Note that if there are no suitable peers to ask for a request, then the node should stop receiving retrieval requests.) This is because there are actually multiple ways the chunk can still appear in our store before the deadline:
- a peer asked earlier delivers it (later than searchtimeout)
- new peers are connected which are suitable to ask and they successfully deliver
- the chunk is pull synced to us as a result of neighbourhood reorganisation
- the chunk is pull synced or push synced to us as a result of it being recently uploaded
- the chunk is actually uploaded by our own node
This interpretation is much less motivated in traditional server based retrieval, since an asset is either found or not found at the unique place it is meant to be.
Note also that your early abort scheme could be useful if a distinct 404 response would be sent back. However, if this was incentivised, peers could just frivolously respond with it and earn without work. A better way to implement this functionality (tell me if chunk x is currently found in the network) is by asking for proof of custody with a short TTL.)
Hope this helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation of your thinking. Maybe I find it hard to wrap my head around since we have very little / close to none of the incentivisation or peers paying or peers willing to pay for requests at the moment.
> Surely, it does make sense to leave the request open even if you find no further suitable peers to ask. (Note that if there are no suitable peers to ask for a request, then the node should stop receiving retrieval requests.)
If there are no suitable peers for 1 chunk, it doesn't mean that there are no suitable peers for another request. So I am not sure why the node would stop receiving retrieval requests, if they can't seem to find peers that can deliver a particular chunk.
If chunk accounting was part of our tests, and we're optimising for nodes to get the highest amount of currency for chunk deliveries, then this would make sense, but we have none of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Viktor seems to be thinking about an edge case - the chunk isn't found, but we should not say that it isn't found because there is a small chance that we encounter the chunk after we would have sent a not-found message but before the request times out on its own.
The risk is that this degrades the performance of the entire network (slow/sluggish responses) at minimal gain (if any).
@zelig: The incentive of the storer is to maximise total revenue, not maximise revenue of a particular retrieval request. It only makes sense to get rid of 404 messages and entirely and let every request time-out, if this behaviour doesn't degrade the performance of the network overall to a degree that that there are fewer retrieval requests to serve (fewer users) and thus total income is reduced.
To answer this we would really have to investigate the relative frequency of the chunk-appearing-just-in-time phenomenon. I suspect it occurs far more frequently in the tests than in real life.
|
Adding @homotopycolimit here as well, so that after this PR is reviewed and merged we are all on the same page on how retrieve requests work in Swarm. It seems to me that we have different expectations on how the code should behave. |
|
@zelig please submit these changes towards |
| func TestPushSyncSimulation(t *testing.T) { | ||
| nodeCnt := 4 | ||
| chunkCnt := 500 | ||
| trials := 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing. trial means that you try one and the same thing multiple times... a synonym to attempt.
However what this test is doing is running 10 independent uploads and downloads, and makes sure that none of them fails. So I suggest we rename this to n or to testcases, but not trials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in the context of Swarm, when I reads trial I am thinking "we are probably uploading and trying to download immediately, this is why we need to retry multiple times, as chunks might not be synced" - which is not at all what is happening here.
- storer needs to take netstore not localstore to put the chunk so that fetchers created earlier could respond
| } | ||
|
|
||
| // setup netstore | ||
| netStore := storage.NewNetStore(lstore, enode.HexID(hexutil.Encode(kad.BaseAddr()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nonsense so this was the actual problem, mea culpa.
- pss prox works ok and delivers the chunk in the right places
- but the chunk is synced after only one receipt
- a request can therefore reach a node before push sync reaches it and not asked again until peerskip delay times out.
- the pushsync storer mistakenly used localstore not netstore to put the chunk, therefore no existing fetcher on the node was notified and therefore, the request was only served when the same peer was asked again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> a request can therefore reach a node before push sync reaches it and not asked again until peerskip delay times out.
How can a request reach a node before push sync, when we are explicitly waiting for push sync receipts before triggering any downloads?
> the pushsync storer mistakenly used localstore not netstore to put the chunk, therefore no existing fetcher on the node was notified and therefore, the request was only served when the same peer was asked again.
good catch I didn't notice that.
4ad628b to
2e6b9fc
Compare
8cc9bda to
2be809c
Compare
|
what's the status on this? can we have a rebase? |
|
@acud this PR was triggered by the failing tests on the Viktor also found a bug that The only reason this PR is still open as far as I know is so that we decide if we want to change the way retrieve requests timeout, which should have been discussed yesterday, but I don't know what the outcome is. |
8c07b70 to
b1bab1e
Compare
|
@nonsense, @zelig, this is going out of date, has conflicts, and IMO going nowhere. |
5811f12 to
fdbcae7
Compare
62c04ce to
1eb077e
Compare
|
Closing this, as it has diverged from |
This PR fixes the incorrect return logic in the
RemoteFetchfunction call.In short,
RemoteFetchshould not propagate ano peer founderror up toNetStorebecause it will abort a correctly served request which would deliver before the request times out.After this simple fix, all pushsync tests pass. I tried simulation tests with upto 64 nodes numerous times, all successful.
The PR also fixes the random unique id generation for tags in the
chunkpackage. For some reason tests kept failing withtag already existswhich seems now fixed by eliminating therngfield, and just useranddirectly instead.