remove `FindPeer` error propagation in `RemoteFetch` call #1523

zelig · 2019-06-27T12:23:21Z

This PR fixes the incorrect return logic in the RemoteFetch function call.
In short, RemoteFetch should not propagate a no peer found error up to NetStore because it will abort a correctly served request which would deliver before the request times out.

After this simple fix, all pushsync tests pass. I tried simulation tests with upto 64 nodes numerous times, all successful.

The PR also fixes the random unique id generation for tags in the chunk package. For some reason tests kept failing with tag already exists which seems now fixed by eliminating the rng field, and just use rand directly instead.

…ion wait delay

…turn behaviour - RemoteFetch does not return error if no peers found - add self node id to debug loglines - pass node id to netstore in test - remove Sleeps in pushsync simulation code

zelig · 2019-06-27T12:24:42Z

storage/pushsync/pushsync_simulation_test.go

 // * downloader downloads the chunk
 // Trials are run concurrently
 func TestPushSyncSimulation(t *testing.T) {
 	nodeCnt := 4


Note there is still some issue with the pushsync simulation tests being very slow. Especially for a low(!) number of nodes.

I have a hard time understanding what we are simulating with 4 nodes... most probably depth is 0 or 1 with 4 nodes, so I don't think this simulation makes sense with less than 16 nodes. What do you think?

I would say you are right if it was not the case that here small network showed the problem and likely still have issues, so we need both

zelig · 2019-06-27T12:25:32Z

storage/netstore.go

 			log.Trace(err.Error(), "ref", ref)
 			osp.LogFields(olog.String("err", err.Error()))
 			osp.Finish()
-			return ErrNoSuitablePeer


After setting the loglevel higher you see that the simulation errors with no peers available coming from RemoteFetch. This is already suspicious since such an error should not reach the user.
With some debugging you can see which FindPeer case results in an error. Then grepping the address in the log, you can easily track the series of events:

requesting node calls remote fetch which calls FindPeer to find a peer

the peer starts serving the request

before delivery, search timeout passes and the following call to FindPeer results in an error

RemoteFetch propagates this error to the NetStore, then downloader, then simulation.

It is not correct to return this error, instead we should return when the request times out.

ErrNoSuitablePeer should be removed

I don't really understand this. If n.RemoteGet errors, what are we waiting for... obviously we haven't made any request to remote peers, so?

If we don't care about any errors from n.RemoteGet, then we should change the interface. However intuitively if the chunk we are looking for does not exist, why should we hang until the global timeout expires (i.e. the timeout set to the context when the user calls NetStore.Get?

no peers available means that all suitable peers have been tried and they did not respond within the searchTimeout. if this happens before the global timeout, then NetStore gets this error, and not context deadline exceeded.

Let me try to clarify.
Your line of thinking is based on the assumption that

requests prompts you to actively look for the chunk

if these searches are exhausted, you can conclude the chunk is not found

However, this is not what your incentive tells you to do.

requestor tells you they require a chunk and willing to pay for it (up to the request TTL, the proxy for this is currently the request timeout)

therefore you really want to be prepared to deliver the chunk to requestor upto TTL expires
and really dont care how it gets there. Simply put, you are not incentivised to close a request before you can no longer earn from the response.

Surely, it does make sense to leave the request open even if you find no further suitable peers to ask. (Note that if there are no suitable peers to ask for a request, then the node should stop receiving retrieval requests.) This is because there are actually multiple ways the chunk can still appear in our store before the deadline:

a peer asked earlier delivers it (later than searchtimeout)

new peers are connected which are suitable to ask and they successfully deliver

the chunk is pull synced to us as a result of neighbourhood reorganisation

the chunk is pull synced or push synced to us as a result of it being recently uploaded

the chunk is actually uploaded by our own node

This interpretation is much less motivated in traditional server based retrieval, since an asset is either found or not found at the unique place it is meant to be.

Note also that your early abort scheme could be useful if a distinct 404 response would be sent back. However, if this was incentivised, peers could just frivolously respond with it and earn without work. A better way to implement this functionality (tell me if chunk x is currently found in the network) is by asking for proof of custody with a short TTL.)

Hope this helps

Thanks for the detailed explanation of your thinking. Maybe I find it hard to wrap my head around since we have very little / close to none of the incentivisation or peers paying or peers willing to pay for requests at the moment.

> Surely, it does make sense to leave the request open even if you find no further suitable peers to ask. (Note that if there are no suitable peers to ask for a request, then the node should stop receiving retrieval requests.)

If there are no suitable peers for 1 chunk, it doesn't mean that there are no suitable peers for another request. So I am not sure why the node would stop receiving retrieval requests, if they can't seem to find peers that can deliver a particular chunk.

If chunk accounting was part of our tests, and we're optimising for nodes to get the highest amount of currency for chunk deliveries, then this would make sense, but we have none of that.

Viktor seems to be thinking about an edge case - the chunk isn't found, but we should not say that it isn't found because there is a small chance that we encounter the chunk after we would have sent a not-found message but before the request times out on its own.

The risk is that this degrades the performance of the entire network (slow/sluggish responses) at minimal gain (if any).

@zelig: The incentive of the storer is to maximise total revenue, not maximise revenue of a particular retrieval request. It only makes sense to get rid of 404 messages and entirely and let every request time-out, if this behaviour doesn't degrade the performance of the network overall to a degree that that there are fewer retrieval requests to serve (fewer users) and thus total income is reduced.
To answer this we would really have to investigate the relative frequency of the chunk-appearing-just-in-time phenomenon. I suspect it occurs far more frequently in the tests than in real life.

nonsense · 2019-06-27T13:14:47Z

Adding @homotopycolimit here as well, so that after this PR is reviewed and merged we are all on the same page on how retrieve requests work in Swarm. It seems to me that we have different expectations on how the code should behave.

nonsense · 2019-06-27T13:16:50Z

@zelig please submit these changes towards master so that tests run. I don't mind if we move to this PR and not pushsync-anton branch, but merging things without having feedback on the tests is not a good idea.

nonsense · 2019-06-27T13:28:54Z

storage/pushsync/pushsync_simulation_test.go

 func TestPushSyncSimulation(t *testing.T) {
 	nodeCnt := 4
 	chunkCnt := 500
 	trials := 10


I find this confusing. trial means that you try one and the same thing multiple times... a synonym to attempt.

However what this test is doing is running 10 independent uploads and downloads, and makes sure that none of them fails. So I suggest we rename this to n or to testcases, but not trials.

Also in the context of Swarm, when I reads trial I am thinking "we are probably uploading and trying to download immediately, this is why we need to retry multiple times, as chunks might not be synced" - which is not at all what is happening here.

- storer needs to take netstore not localstore to put the chunk so that fetchers created earlier could respond

zelig · 2019-06-29T11:18:12Z

storage/pushsync/pushsync_simulation_test.go

 	}
-
+	// setup netstore
+	netStore := storage.NewNetStore(lstore, enode.HexID(hexutil.Encode(kad.BaseAddr())))


@nonsense so this was the actual problem, mea culpa.

pss prox works ok and delivers the chunk in the right places

but the chunk is synced after only one receipt

a request can therefore reach a node before push sync reaches it and not asked again until peerskip delay times out.

the pushsync storer mistakenly used localstore not netstore to put the chunk, therefore no existing fetcher on the node was notified and therefore, the request was only served when the same peer was asked again.

> a request can therefore reach a node before push sync reaches it and not asked again until peerskip delay times out.

How can a request reach a node before push sync, when we are explicitly waiting for push sync receipts before triggering any downloads?

> the pushsync storer mistakenly used localstore not netstore to put the chunk, therefore no existing fetcher on the node was notified and therefore, the request was only served when the same peer was asked again.

good catch I didn't notice that.

acud · 2019-07-19T11:12:09Z

what's the status on this? can we have a rebase?

nonsense · 2019-07-19T11:34:30Z

@acud this PR was triggered by the failing tests on the pushsync PR. Since then we've found that in small networks, tests were failing because of nodes having depth 0 and responding with a receipt for a chunk that they are not really responsible to. With such a definition for returning receipts, those receipts give no guarantee to the uploader apart from the fact that some node got their chunk. So this cannot be used to determine if push syncing is complete and chunks are synced. We changed push syncing to then send back receipts only if a chunk falls in the NN && there is no closer peer to the chunk - a stricter rule for sending back receipts, that should give better guarantees on when syncing is actually complete.

Viktor also found a bug that pushsync was using LocalStore and not NetStore to save chunks.

The only reason this PR is still open as far as I know is so that we decide if we want to change the way retrieve requests timeout, which should have been discussed yesterday, but I don't know what the outcome is.

acud · 2019-07-30T06:21:34Z

@nonsense, @zelig, this is going out of date, has conflicts, and IMO going nowhere.
@zelig, if i were to support your approach, it would be after seeing empiric proof that this approach is indeed necessary and most of all - beneficial. IMHO right now it degrades network performance and user experience. the claim for maximizing profits of some supposed node is irrelevant at this point since we don't have incentivisation enabled. please close the PR if it is not actively being worked on.

nonsense · 2019-09-30T10:44:52Z

Closing this, as it has diverged from master a lot and relies on braches that are also not aligned with master for months - such as pushsync-anton.

nonsense and others added 5 commits June 18, 2019 15:39

swarm: push sync

e654e4d

storage/pushsync: update NetStore api from master

6c18345

storage/pushsync: add sleeps and a bit more tracing, change subscript…

7bb5a26

…ion wait delay

chunk; fix "already exists" error tags.Uid no need for rng obj

e340717

network/stream, network/netsore, storage/pushsync: fix RemoteFetch re…

e8c5b44

…turn behaviour - RemoteFetch does not return error if no peers found - add self node id to debug loglines - pass node id to netstore in test - remove Sleeps in pushsync simulation code

zelig requested review from acud and nonsense June 27, 2019 12:23

zelig self-assigned this Jun 27, 2019

zelig added push-sync stability labels Jun 27, 2019

zelig commented Jun 27, 2019

View reviewed changes

nonsense requested a review from cobordism June 27, 2019 13:12

nonsense reviewed Jun 27, 2019

View reviewed changes

nonsense requested a review from nagydani June 28, 2019 13:27

storage/pushsync: fix storer initialisation

2f5bc28

- storer needs to take netstore not localstore to put the chunk so that fetchers created earlier could respond

zelig commented Jun 29, 2019

View reviewed changes

nonsense force-pushed the pushsync-anton branch 11 times, most recently from 4ad628b to 2e6b9fc Compare July 10, 2019 13:58

nonsense force-pushed the pushsync-anton branch 6 times, most recently from 8cc9bda to 2be809c Compare July 17, 2019 09:43

nonsense force-pushed the pushsync-anton branch 3 times, most recently from 8c07b70 to b1bab1e Compare July 29, 2019 10:03

nonsense force-pushed the pushsync-anton branch from 3f8041c to 2f15765 Compare July 31, 2019 12:48

acud closed this Aug 4, 2019

acud reopened this Aug 4, 2019

zelig added the on hold label Aug 14, 2019

zelig force-pushed the pushsync-anton branch from 2f15765 to 5811f12 Compare August 27, 2019 10:53

zelig mentioned this pull request Aug 30, 2019

pushsync #1392

Closed

2 tasks

zelig force-pushed the pushsync-anton branch from 5811f12 to fdbcae7 Compare September 2, 2019 15:00

zelig force-pushed the pushsync-anton branch from 62c04ce to 1eb077e Compare September 13, 2019 09:56

nonsense closed this Sep 30, 2019

remove FindPeer error propagation in RemoteFetch call #1523

remove FindPeer error propagation in RemoteFetch call #1523

Uh oh!

Conversation

zelig commented Jun 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nonsense commented Jun 27, 2019

Uh oh!

nonsense commented Jun 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acud commented Jul 19, 2019

Uh oh!

nonsense commented Jul 19, 2019

Uh oh!

acud commented Jul 30, 2019

Uh oh!

nonsense commented Sep 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

remove `FindPeer` error propagation in `RemoteFetch` call #1523

remove `FindPeer` error propagation in `RemoteFetch` call #1523