Skip to content

Conversation

@AnnaShaleva
Copy link
Member

Note: this PR is not for merge, since I've just discovered that this problem is already solved on Geth master (although the fix itself is not for our problem originally) and fresh bane-main.

Problem

Chain skeleton sync (batched headers sync) is impossible with the current block size. Once the new stale peer is trying to sync its headers, the following errors occurs during skeleton filling (some useless logs are omitted):

DEBUG[02-20|14:00:46.499] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=1  fromnum=4       skip=0  reverse=false
DEBUG[02-20|14:00:46.908] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=1  fromnum=2       skip=0  reverse=false
DEBUG[02-20|14:00:47.153] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=1  fromnum=1       skip=0  reverse=false
DEBUG[02-20|14:00:47.426] Found common ancestor                    peer=8205ce1a number=0       hash=000000..000000
DEBUG[02-20|14:00:47.426] Directing header downloads               peer=8205ce1a origin=1
TRACE[02-20|14:00:47.427] Fetching skeleton headers                peer=8205ce1a count=192 from=1
DEBUG[02-20|14:00:47.427] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=128 fromnum=192     skip=191 reverse=false
DEBUG[02-20|14:00:47.427] Downloading receipts                     origin=1
DEBUG[02-20|14:00:47.427] Downloading block bodies                 origin=1
DEBUG[02-20|14:00:48.255] Filling up skeleton                      from=1
TRACE[02-20|14:00:48.256] Requesting new batch of headers          peer=8205ce1a from=1
DEBUG[02-20|14:00:48.256] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=192 fromnum=1       skip=0   reverse=false
...
TRACE[02-20|14:00:48.657] Skeleton filling not accepted            peer=8205ce1a7d69b885 from=1
DEBUG[02-20|14:00:48.657] Failed to deliver retrieved headers      peer=8205ce1a         err="delivery not accepted"
TRACE[02-20|14:00:48.657] Requesting new batch of headers          peer=8205ce1a         from=193
DEBUG[02-20|14:00:48.657] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=192 fromnum=193     skip=0   reverse=false
...

So there's a "delivery not accepted" error on attempt to build the skeleton from received batch of received headers. Extended logs give us that every batch of blocks contain only one header (the last from requested batch):

TRACE[02-20|14:00:48.656] Delivering headers                       accepted=false len(headers)=1 headers[0].Hash=5bb77b..06997d headers[0].Number=192 MaxHeaderFetch=192

So based on this information it's impossible to build the skeleton

Explaination

The problem is traced down to the node's peer. During headers retrieval peer fetches the last block of the batch from its cache and tries to retrieve the rest of them from ancients. Given the old maximum RLP header size constraint of 700 bytes per header, the peer can't retrieve the whole set of requested headers (MaxHeaderFetch, 192 headers each at least of 1006 bytes). Since the number of retrieved headers doesn't match the requested one, chain accessor returns only the first header that was retrieved from cache:

	// read remaining from ancients
	max := count * 700
	data, err := db.AncientRange(ChainFreezerHeaderTable, i+1-count, count, max)
	if err == nil && uint64(len(data)) == count {
		// the data is on the order [h, h+1, .., n] -- reordering needed
		for i := range data {
			rlpHeaders = append(rlpHeaders, data[len(data)-1-i])
		}
	}
	return rlpHeaders

Solution

Since ReadHeaderRange function specifies that the caller should limit the number of requested headers to prevent DoS attack, it's safe to remove the maximum bytes constraint.

Additional notes

After the fix I found that almost the same code change is present in the original Geth commit ported to NeoX node implementation in #130: 447945e. And thus, the problem should not be reproducible on the post-sync NeoX node.

**Problem**

Chain skeleton sync (batched headers sync) is impossible with
the current block size. Once the new stale peer is trying to sync its
headers, the following errors occurs during skeleton filling (some useless
logs are omitted):
```
DEBUG[02-20|14:00:46.499] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=1  fromnum=4       skip=0  reverse=false
DEBUG[02-20|14:00:46.908] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=1  fromnum=2       skip=0  reverse=false
DEBUG[02-20|14:00:47.153] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=1  fromnum=1       skip=0  reverse=false
DEBUG[02-20|14:00:47.426] Found common ancestor                    peer=8205ce1a number=0       hash=000000..000000
DEBUG[02-20|14:00:47.426] Directing header downloads               peer=8205ce1a origin=1
TRACE[02-20|14:00:47.427] Fetching skeleton headers                peer=8205ce1a count=192 from=1
DEBUG[02-20|14:00:47.427] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=128 fromnum=192     skip=191 reverse=false
DEBUG[02-20|14:00:47.427] Downloading receipts                     origin=1
DEBUG[02-20|14:00:47.427] Downloading block bodies                 origin=1
DEBUG[02-20|14:00:48.255] Filling up skeleton                      from=1
TRACE[02-20|14:00:48.256] Requesting new batch of headers          peer=8205ce1a from=1
DEBUG[02-20|14:00:48.256] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=192 fromnum=1       skip=0   reverse=false
...
TRACE[02-20|14:00:48.657] Skeleton filling not accepted            peer=8205ce1a7d69b885 from=1
DEBUG[02-20|14:00:48.657] Failed to deliver retrieved headers      peer=8205ce1a         err="delivery not accepted"
TRACE[02-20|14:00:48.657] Requesting new batch of headers          peer=8205ce1a         from=193
DEBUG[02-20|14:00:48.657] Fetching batch of headers                id=8205ce1a7d69b885 conn=staticdial      count=192 fromnum=193     skip=0   reverse=false
...
```

So there's a "delivery not accepted" error on attempt to build the skeleton
from received batch of received headers. Extended logs give us that every batch
of blocks contain only one header (the last from requested batch):
```
TRACE[02-20|14:00:48.656] Delivering headers                       accepted=false len(headers)=1 headers[0].Hash=5bb77b..06997d headers[0].Number=192 MaxHeaderFetch=192
```
So based on this information it's impossible to build the skeleton

**Explaination**

The problem is traced down to the node's peer. During headers retrieval
peer fetches the last block of the batch from its cache and tries to
retrieve the rest of them from ancients. Given the old maximum RLP header
size constraint of 700 bytes per header, the peer can't retrieve the whole
set of requested headers (MaxHeaderFetch, 192 headers each at least of
1006 bytes). Since the number of retrieved headers doesn't match the
requested one, chain accessor returns only the first header that was
retrieved from cache:
```
	// read remaining from ancients
	max := count * 700
	data, err := db.AncientRange(ChainFreezerHeaderTable, i+1-count, count, max)
	if err == nil && uint64(len(data)) == count {
		// the data is on the order [h, h+1, .., n] -- reordering needed
		for i := range data {
			rlpHeaders = append(rlpHeaders, data[len(data)-1-i])
		}
	}
	return rlpHeaders
```

**Solution**

Since `ReadHeaderRange` function specifies that the caller should limit the
number of requested headers to prevent DoS attack, it's safe to remove the maximum
bytes constraint.

**Additional notes**

After the fix I found that almost the same code change is present in the original
Geth commit ported to NeoX node implementation in #130:
447945e.
And thus, the problem should not be reproducible on the post-sync NeoX node.

Signed-off-by: Anna Shaleva <[email protected]>
@AnnaShaleva
Copy link
Member Author

@roman-khimov agree to roll this branch on T2? It's needed to be able to sync new node from genesis.

@roman-khimov
Copy link
Contributor

There is not a lot of options that we have. OK for T2. Maybe we need a temporary branch for it?

@AnnaShaleva
Copy link
Member Author

Maybe we need a temporary branch for it?

Also thought about it, here we have the same problem like in NeoGo. Let's create a separate branch.

@AnnaShaleva AnnaShaleva changed the base branch from bane-main to t2 February 20, 2024 12:46
@AnnaShaleva
Copy link
Member Author

AnnaShaleva commented Feb 20, 2024

Changed the base, let's use t2 branch for T2 testnet. PR can be merged then. t2 is based on 81ea1ce.

@roman-khimov
Copy link
Contributor

It'll go away with T3, no big deal.

@roman-khimov roman-khimov merged commit 62c42b3 into t2 Feb 20, 2024
@roman-khimov roman-khimov deleted the fix-skeleton-sync branch February 20, 2024 13:00
@AnnaShaleva AnnaShaleva added the bug Something isn't working label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants