-
Notifications
You must be signed in to change notification settings - Fork 934
Description
Description
In #2833 we reduce the frequency with which full states are stored in the hot database. However, this just works around the underlying issue that database writes take substantial time for full states.
In lieu of more drastic database restructuring I think we might be able to take the database write time off the critical path by serializing all our I/O and completing it on a dedicated background thread. We want to avoid a situation where out of order writes violate an invariant of the database like block in db --> block's state in db or block in fork choice --> block in db. I think we're in a good position to guarantee this by hooking HotColdDB::do_atomically to run in the background. For example during block processing we would push the storage ops for the state and block in a single batch to the background thread. Later we may push fork choice to the background thread in a separate batch. Because do_atomically serializes requests (completes them in order), there's no way for the fork choice write to commit before the block/state write. In case of a crash (or shutdown) any incomplete I/O ops will just get dropped and the on-disk database will revert to whatever was most recently written successfully.
The key part of this scheme is a background thread within the store which keeps a queue of Vec<StoreOp<E>> for completion. We should bound the size of this queue to keep memory usage under control in case of I/O saturation (at which point we block and performance returns to what it is currently). The other important thing to keep in mind is that writes need to be observable by other threads as soon as do_atomically returns. In order to achieve this I think we can cache the to-be-written blocks and states in memory, and return them from get_block, get_state. Other writes are trickier to make observable, because we just see generic key -> value mappings. We could limit the background writing to just apply to batches of blocks and states, and continue blocking for every other write (clearing the pending block/state queue before doing so). Or we could push the I/O queuing down a level into the key-value store, so that it caches the raw key -> value mappings in memory (à la MemoryStore) and the higher-level DB code doesn't need to change... This may actually be cleanest + most generic 🤔 Potential downsides of the KV-queuing approach are:
- We pay a serialization/deserialization cost for writes/reads because the KV-store caches the bytes in memory rather than the objects.
- We can't take advantage of in-memory de-duplication of
BeaconStates (if Persistent copy-on-write beacon states #2806 is implemented). - We duplicate what fast KV-stores try to do anyway: write to memory (mem-mapped file) first and flush to disk later (on eviction from OS page cache). This isn't so much of a downside, as actually switching the BN's KV-store to MDBX would be a lot of work and require a breaking schema change (unlike a queue on top of LevelDB).
There are also potentially other issues with synchronising the cold DB and hot DB during migrations. We may need to block in such cases.