Skip to content

Proposal for ostree pull support (and object id maps) #141

@alexlarsson

Description

@alexlarsson

Background

Ostree updates are of two forms, deltas or generic:

For a delta update, you go from a specific commit to a new one. When downloading such a delta you address it by the source commit, so it is assumed by the delta that all the objects in the source commit are already available, and only (but all) the remaining objects are in the delta. Additionally, if some objects are new and/or large they can be stored as "fallback" objects which are downloaded by the generic mechanism. (In practice, fallbacks are only used for file objects.)

Deltas can also be "from nothing", which means we don't have a source, so the delta will contain everything (except the fallbacks).

For "generic" updates, we only have the commit id and we download each object separately CAS-style, recursing into other referenced objects. Normally we have existing objects stored locally, so we can stop recursing whenever we reach an object we already have locally available (modulo some .partialcommit bullshit that we need not care about here). This is in theory efficient, since we can stop early, but in practice it is very chatty over HTTP with many small connections, so it is not great.

Proposal for composefs-rs

For composefs-rs, the goal is to "support" everything, but focus on the delta case. I.e. the generic approach need to work, but doesn't have to perform optimally.

At a high level, the ideal effect of a "composefs-rs pull ostree" operation would look something like this:

  • Each file object is stored as a separate object, stored by fs-verity
  • One splitstream that has all the metadata for the commit (commit, dirmeta,
    dirtree, file metadata objects) and references to the content object for each file
    metadata.

From this we can efficiently generate an erofs/composefs image. Additionally, given such a splitstream, we have all that we need to implement a delta pull. We can parse the original splitstream to reconstruct the ostree objects from the source commit, and then reconstruct the remainder from the delta.

In this above operation we never download any objects that are in the previous version of the image, however for new files we still always download them, whereas a regular ostree pull operation would notice the cross-image sharing even at pull time and avoid downloads. This is not a critical, as everything still works and ends up with the same on-disk result, but is not ideal, so we should have some kind of way to optimize this that doesn't need to be 100% perfect but should be simple and robust.

To handle this, I propose adding objects to the composefs repo that are "object id maps". These would be used for file objects only (remember, those are the only fallback objects in deltas), and contain a mapping from the ostree file object checksum to the fs-verity digest of the content and some extra data, containing the file metadata (uid, gid, mode, xattrs). Such a file format should be efficient to mmap and perform object id lookups in. I have a more detailed proposal below.

Now, such a map file is really what we need for the ostree image splitstream as well. So, this implies we want to have this structure for the ostree commit in the repo:

  • One content object per file in the commit
  • One object id map that maps ostree file commit id to content object + metadata. This is not a splitstream, so doesn't really reference the objects (in the repo gc sense), they are kept alive by the splitstream.
  • One splitstream, that contains:
    • all the metadata objects (commit, dirtree, dirmeta).
    • by-fs-verity references to all the file content objects (for GC keepalive).
    • by-fs-verity reference to an object-id-map.
  • One splitstream ref, like streams/ref/ostree/the/ostree/ref referencing the splitstream.

Given the above it is easy to see how we can use this to generate an erofs, or to the non-optimal delta update. But, suppose that we also have one or more "cache" object id maps of the same format as the one referenced by the splitstream. These could be used to implement opportunistic handling of already downloaded files outside the image we are updated. They need not be perfect, but the better they are the better we do. Such caches would be stored in the composefs repo and will (intentionally) not keep the targets alive, so we have to check after a lookup that the target object really exists.

For a regular (i.e. non-delta) update, the proposal is that we start with the current image version of the ref being updated, then download all the metadata objects that were not in the current version. Then we can use the cache object map to see which file objects we can avoid downloading, and download the rest. This is potentially more http-chatty than what ostree would be, as non-file objects are never shared between non-related commits. However, in practice, I think it is less common for these to be shared than the files themselves, and also the primary usecase is supposed to be the delta case anyway.

Cache management

To manage the caches, I propose that we have a splitstream image for the cache, named by the type of things it caches. Say streams/refs/caches/ostree. Such a splitstream would reference N cache files, and contain the list of the splitstreams digests that are indexed. It also has a bloom filter to efficiently see if an object-id is contained in some of the cache files.

Such caches can be easily and cheaply be incrementally updated, you just keep the old cache files, and add a new one containing all the images that were not listed in the previous version of the cache. The bloom filter can also be extended. Updating is easy too, we just need to combine the pre-existing object maps referenced by the ostree splitstream images.

An incremental update will replace a splitstream ref with a new one, but not mutate any existing object file or splitstreams. It is safe to do even if it is potentially concurrent (last update wins, the other is just wasted time). One would then semi-regularly, say during normal gc operations, create a non-incremental version of the cache file to avoid it referencing a bunch of old objects that no longer exists, andto limit the number of cache files it references.

Object map format

Header:

[64bit offset to bucket] * 256

Each bucket, at some offset in the file referenced by the header looks like:

[64bit bucket_count] 
  {
    [32byte object_id]
    [32byte fs_verity_hash]
    [64bit offset to extra_data]
    [64bit size of extra_data]
  } * bucket_count

The extra_data chunks would just be the ostree file metadata gvariant.

The buckets are split by first byte of object_id, and each bucket is sorted by object_id so we can binary search it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions