Skip to content

How to represent a sequence of bytes #39

@spl

Description

@spl

I'm opening up this issue to discuss the appropriate representation for a buffer (i.e. an arbitrary contiguous sequence of bytes) in terminus-store. This discussion will help me to get an understanding for the motivation and mechanics of the current approach and to probe for reactions to an alternative approach, which I propose at the end. Please feel free to comment on anything or to correct my understanding if necessary.

Currently, the predominant view of a buffer appears to be M: AsRef<[u8]>. This type implies two things:

  1. A given data: M has the operation data.as_ref() that returns &[u8]. This gives a read-only view of a buffer that can be shared between threads without the option of writing to it.
  2. The struct containing the data: M owns the value referencing the buffer. There is no borrowing of references here.

This appears to have been changed from a previously predominant view of a buffer as a slice: data: &'a [u8] (1deedbf, bf6416b, ad7dd42, e5a50a0, c6a14f9). This view meant:

  1. The data: &'a [u8] cannot be shared between threads.
  2. The view into the data lasts no longer than the buffer's owner, who has the 'a lifetime.

Now, given that the buffers currently seem to be backed by one of the two following structs:

  • pub struct SharedVec(pub Arc<Vec<u8>>);
  • pub struct SharedMmap(Option<Arc<FileBacking>>);

which both have Arc, I presume that the data is being shared read-only between threads. (I'm actually not yet clear on where the sharing is occurring, so if you want to enlighten me, I'd appreciate it!) If there was no sharing, I think the slice approach is better, since (a) there is less runtime work to manage usage of the buffers and (b) the type system keeps track of the lifetimes.

I think using M: AsRef<[u8]> is somewhat painful as schema for typing a buffer. It's too general and leads to trait bounds such as M: 'static + AsRef<[u8]> + Clone + Send + Sync in many places.

After doing some research, I think something like Bytes from the bytes crate would work better. Bytes is a thread-shareable container representing a contiguous sequence of bytes. It satisfies 'static + AsRef<[u8]> + Clone + Send + Sync. It also supports operations like split_to and split_off, which I think would work well when you want to segment a buffer into different representations. Replacing data: M with data: Bytes would make many of the trait bounds disappear.

Unfortunately, Bytes does not support memmap::Mmap, which means it would not suit terminus-store's current usage of AsRef<[u8]>. However, I've already implemented an adaptation of Bytes that does support memmap::Mmap. Others have, too. See tokio-rs/bytes#359.

Here are some questions prompted by the above:

  • What's the best way to represent a contiguous sequence of bytes in terminus-store?
  • Does it need to be read-only?
  • Does it need to be shared between threads?
  • Would it be useful to use a less general type than AsRef<[u8]>? Could that type be a struct instead of a set of trait bounds?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions