Skip to content
This repository was archived by the owner on Aug 2, 2021. It is now read-only.

Conversation

@zelig
Copy link
Member

@zelig zelig commented Jun 27, 2019

This PR fixes the incorrect return logic in the RemoteFetch function call.
In short, RemoteFetch should not propagate a no peer found error up to NetStore because it will abort a correctly served request which would deliver before the request times out.

After this simple fix, all pushsync tests pass. I tried simulation tests with upto 64 nodes numerous times, all successful.

The PR also fixes the random unique id generation for tags in the chunk package. For some reason tests kept failing with tag already exists which seems now fixed by eliminating the rng field, and just use rand directly instead.

nonsense and others added 5 commits June 18, 2019 15:39
…turn behaviour

- RemoteFetch does not return error if no peers found
- add self node id to debug loglines
- pass node id to netstore in test
- remove Sleeps in pushsync simulation code
@zelig zelig requested review from acud and nonsense June 27, 2019 12:23
@zelig zelig self-assigned this Jun 27, 2019
// * downloader downloads the chunk
// Trials are run concurrently
func TestPushSyncSimulation(t *testing.T) {
nodeCnt := 4
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there is still some issue with the pushsync simulation tests being very slow. Especially for a low(!) number of nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time understanding what we are simulating with 4 nodes... most probably depth is 0 or 1 with 4 nodes, so I don't think this simulation makes sense with less than 16 nodes. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say you are right if it was not the case that here small network showed the problem and likely still have issues, so we need both

log.Trace(err.Error(), "ref", ref)
osp.LogFields(olog.String("err", err.Error()))
osp.Finish()
return ErrNoSuitablePeer
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After setting the loglevel higher you see that the simulation errors with no peers available coming from RemoteFetch. This is already suspicious since such an error should not reach the user.
With some debugging you can see which FindPeer case results in an error. Then grepping the address in the log, you can easily track the series of events:

  • requesting node calls remote fetch which calls FindPeer to find a peer
  • the peer starts serving the request
  • before delivery, search timeout passes and the following call to FindPeer results in an error
  • RemoteFetch propagates this error to the NetStore, then downloader, then simulation.

It is not correct to return this error, instead we should return when the request times out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrNoSuitablePeer should be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this. If n.RemoteGet errors, what are we waiting for... obviously we haven't made any request to remote peers, so?

If we don't care about any errors from n.RemoteGet, then we should change the interface. However intuitively if the chunk we are looking for does not exist, why should we hang until the global timeout expires (i.e. the timeout set to the context when the user calls NetStore.Get?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no peers available means that all suitable peers have been tried and they did not respond within the searchTimeout. if this happens before the global timeout, then NetStore gets this error, and not context deadline exceeded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to clarify.
Your line of thinking is based on the assumption that

  • requests prompts you to actively look for the chunk
  • if these searches are exhausted, you can conclude the chunk is not found

However, this is not what your incentive tells you to do.

  • requestor tells you they require a chunk and willing to pay for it (up to the request TTL, the proxy for this is currently the request timeout)
  • therefore you really want to be prepared to deliver the chunk to requestor upto TTL expires
    and really dont care how it gets there. Simply put, you are not incentivised to close a request before you can no longer earn from the response.

Surely, it does make sense to leave the request open even if you find no further suitable peers to ask. (Note that if there are no suitable peers to ask for a request, then the node should stop receiving retrieval requests.) This is because there are actually multiple ways the chunk can still appear in our store before the deadline:

  • a peer asked earlier delivers it (later than searchtimeout)
  • new peers are connected which are suitable to ask and they successfully deliver
  • the chunk is pull synced to us as a result of neighbourhood reorganisation
  • the chunk is pull synced or push synced to us as a result of it being recently uploaded
  • the chunk is actually uploaded by our own node

This interpretation is much less motivated in traditional server based retrieval, since an asset is either found or not found at the unique place it is meant to be.

Note also that your early abort scheme could be useful if a distinct 404 response would be sent back. However, if this was incentivised, peers could just frivolously respond with it and earn without work. A better way to implement this functionality (tell me if chunk x is currently found in the network) is by asking for proof of custody with a short TTL.)

Hope this helps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation of your thinking. Maybe I find it hard to wrap my head around since we have very little / close to none of the incentivisation or peers paying or peers willing to pay for requests at the moment.

> Surely, it does make sense to leave the request open even if you find no further suitable peers to ask. (Note that if there are no suitable peers to ask for a request, then the node should stop receiving retrieval requests.)

If there are no suitable peers for 1 chunk, it doesn't mean that there are no suitable peers for another request. So I am not sure why the node would stop receiving retrieval requests, if they can't seem to find peers that can deliver a particular chunk.

If chunk accounting was part of our tests, and we're optimising for nodes to get the highest amount of currency for chunk deliveries, then this would make sense, but we have none of that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Viktor seems to be thinking about an edge case - the chunk isn't found, but we should not say that it isn't found because there is a small chance that we encounter the chunk after we would have sent a not-found message but before the request times out on its own.

The risk is that this degrades the performance of the entire network (slow/sluggish responses) at minimal gain (if any).

@zelig: The incentive of the storer is to maximise total revenue, not maximise revenue of a particular retrieval request. It only makes sense to get rid of 404 messages and entirely and let every request time-out, if this behaviour doesn't degrade the performance of the network overall to a degree that that there are fewer retrieval requests to serve (fewer users) and thus total income is reduced.
To answer this we would really have to investigate the relative frequency of the chunk-appearing-just-in-time phenomenon. I suspect it occurs far more frequently in the tests than in real life.

@nonsense nonsense requested a review from cobordism June 27, 2019 13:12
@nonsense
Copy link
Contributor

Adding @homotopycolimit here as well, so that after this PR is reviewed and merged we are all on the same page on how retrieve requests work in Swarm. It seems to me that we have different expectations on how the code should behave.

@nonsense
Copy link
Contributor

@zelig please submit these changes towards master so that tests run. I don't mind if we move to this PR and not pushsync-anton branch, but merging things without having feedback on the tests is not a good idea.

func TestPushSyncSimulation(t *testing.T) {
nodeCnt := 4
chunkCnt := 500
trials := 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing. trial means that you try one and the same thing multiple times... a synonym to attempt.

However what this test is doing is running 10 independent uploads and downloads, and makes sure that none of them fails. So I suggest we rename this to n or to testcases, but not trials.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in the context of Swarm, when I reads trial I am thinking "we are probably uploading and trying to download immediately, this is why we need to retry multiple times, as chunks might not be synced" - which is not at all what is happening here.

@nonsense nonsense requested a review from nagydani June 28, 2019 13:27
- storer needs to take netstore not localstore to put the chunk
  so that fetchers created earlier could respond
}

// setup netstore
netStore := storage.NewNetStore(lstore, enode.HexID(hexutil.Encode(kad.BaseAddr())))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nonsense so this was the actual problem, mea culpa.

  • pss prox works ok and delivers the chunk in the right places
  • but the chunk is synced after only one receipt
  • a request can therefore reach a node before push sync reaches it and not asked again until peerskip delay times out.
  • the pushsync storer mistakenly used localstore not netstore to put the chunk, therefore no existing fetcher on the node was notified and therefore, the request was only served when the same peer was asked again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> a request can therefore reach a node before push sync reaches it and not asked again until peerskip delay times out.

How can a request reach a node before push sync, when we are explicitly waiting for push sync receipts before triggering any downloads?

> the pushsync storer mistakenly used localstore not netstore to put the chunk, therefore no existing fetcher on the node was notified and therefore, the request was only served when the same peer was asked again.

good catch I didn't notice that.

@nonsense nonsense force-pushed the pushsync-anton branch 11 times, most recently from 4ad628b to 2e6b9fc Compare July 10, 2019 13:58
@nonsense nonsense force-pushed the pushsync-anton branch 6 times, most recently from 8cc9bda to 2be809c Compare July 17, 2019 09:43
@acud
Copy link
Contributor

acud commented Jul 19, 2019

what's the status on this? can we have a rebase?

@nonsense
Copy link
Contributor

@acud this PR was triggered by the failing tests on the pushsync PR. Since then we've found that in small networks, tests were failing because of nodes having depth 0 and responding with a receipt for a chunk that they are not really responsible to. With such a definition for returning receipts, those receipts give no guarantee to the uploader apart from the fact that some node got their chunk. So this cannot be used to determine if push syncing is complete and chunks are synced. We changed push syncing to then send back receipts only if a chunk falls in the NN && there is no closer peer to the chunk - a stricter rule for sending back receipts, that should give better guarantees on when syncing is actually complete.

Viktor also found a bug that pushsync was using LocalStore and not NetStore to save chunks.


The only reason this PR is still open as far as I know is so that we decide if we want to change the way retrieve requests timeout, which should have been discussed yesterday, but I don't know what the outcome is.

@nonsense nonsense force-pushed the pushsync-anton branch 3 times, most recently from 8c07b70 to b1bab1e Compare July 29, 2019 10:03
@acud
Copy link
Contributor

acud commented Jul 30, 2019

@nonsense, @zelig, this is going out of date, has conflicts, and IMO going nowhere.
@zelig, if i were to support your approach, it would be after seeing empiric proof that this approach is indeed necessary and most of all - beneficial. IMHO right now it degrades network performance and user experience. the claim for maximizing profits of some supposed node is irrelevant at this point since we don't have incentivisation enabled. please close the PR if it is not actively being worked on.

@acud acud closed this Aug 4, 2019
@acud acud reopened this Aug 4, 2019
@zelig zelig added the on hold label Aug 14, 2019
@zelig zelig mentioned this pull request Aug 30, 2019
2 tasks
@nonsense
Copy link
Contributor

Closing this, as it has diverged from master a lot and relies on braches that are also not aligned with master for months - such as pushsync-anton.

@nonsense nonsense closed this Sep 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants