Skip to content

Conversation

@derrickstolee
Copy link

@derrickstolee derrickstolee commented Oct 29, 2024

Here is a full submission of the --path-walk feature for 'git pack-objects' and 'git repack'. It's been discussed in an RFC [1], as a future application for the path walk API [2], and is updated now that --name-hash-version=2 exists (as a replacement for the --full-name-hash option from the RFC) [3].

[1] https://lore.kernel.org/git/[email protected]/

[2] https://lore.kernel.org/git/[email protected]

[3] https://lore.kernel.org/git/[email protected]

This patch series does the following:

  1. Add a new '--path-walk' option to 'git pack-objects' that uses the path-walk API instead of the revision API to collect objects for delta compression.

  2. Add a new '--path-walk' option to 'git repack' to pass this option along to 'git pack-objects'.

  3. Add a new 'pack.usePathWalk' config option to opt into this option implicitly, such as in 'git push'.

  4. Optimize the '--path-walk' option using threading so it better competes with the existing multi-threaded delta compression mechanism.

  5. Update the path-walk API with a new 'edge_aggressive' option that pairs close to the --edge-aggressive option in the revision API. This is useful when creating thin packs inside shallow clones.

This feature works by using the path-walk API to emit groups of objects that appear at the same path. These groups are tracked so they can be tested for delta compression with each other, and then after those groups are tested a second pass using the name-hash attempts to find better (or first time) deltas across path boundaries. This second pass is much faster than a fresh pass since the existing deltas are used as a limit for the size of potentially new deltas, short-circuiting the checks when the delta size exceeds the current-best.

The benefits of the --path-walk feature first come into play when the name hash functions have many collisions, so sorting by name hash value leads to unhelpful groupings of objects. Many of these benefits are improved by --name-hash-version=2, but collisions still exist with any hash-based approach. There are also performance benefits in some cases due to the isolation of delta compression testing within path groups.

All of the benefits of the --path-walk feature are less dramatic when compared to --name-hash-version=2, but they can still exist in many cases. I have also seen some cases where --name-hash-version=2 compresses better than --path-walk with --name-hash-version=1, but these options can be combined to get the best of both worlds.

Detailed statistics are provided within patch messages, but a few are highlighted here:

The microsoft/fluentui is a public Javascript repo that suffers from many of the name hash collisions as internal repositories I've worked with. Here is a comparison of the compressed size and end-to-end time of the repack:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      28.16s

Less dramatic, but perhaps more standardly structured is the nodejs/node repository, with these stats:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                739.9M      71.18s
Hash v2                764.6M      67.82s
Path Walk              698.0M      75.10s

Even the Linux kernel repository gains some benefits, even though the number of hash collisions is relatively low due to a preference for short filenames:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                  2.5G     554.41s
Hash v2                  2.5G     549.62s
Path Walk                2.2G     559.00s

The drawbacks of the --path-walk feature is that it will be harder to integrate it with bitmap features, specifically delta islands. This is not insurmountable, but would require more work, such as a revision walk to paint objects with reachability information before using that during delta computations.

However, there should still be significant benefits to Git clients trying to save space and improve local performance.

This feature was shipped with similar features in microsoft/git as of v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo that had significant repository growth due to constructing a batch of beachball [5] CHANGELOG.[md|json] files and pushing them to a release branch. These pushes were frequently 70-200 MB due to poor delta compression. Using the 'pack.usePathWalk=true' config, these pushes dropped in size by 100x while improving performance. Since these CI machines were working with a shallow clone, the 'edge_aggressive' changes were required to enable the path-walk option.

[4] https://github.com/microsoft/git/releases/tag/v2.47.0.vfs.0.3

[5] https://github.com/microsoft/beachball

Updates in v2

  • Re-added a dropped comment when moving code in patch 1.
  • Updated documentation to include interaction with --use-bitmap-index.
  • An UNUSED parameter is now used, reducing the use of global variables slightly.

Updates in v3

Thanks for the review, Taylor. Sorry for my delay in getting back to your feedback.

  • Documentation has been edited slightly for simplicity.
  • is_oid_interesting() was swapped to is_oid_uninteresting()
  • sub_list_size renamed to sub_list_nr
  • Several uint32_t and uint64_t variables were converted to size_t.
  • Several 'unsigned int' variables were required to stay as-is, for now, until a refactor can be done.
  • An unnecessary update of tag_objects was removed.
  • The logic and error message around incompatible options is simpler.
  • Tests are expanded, especially around config options.
  • Fixed commit message typos.
  • Extra care around ALLOC_ARRAY() to avoid a zero- or negative-length array.

Thanks,
-Stolee

cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]

@derrickstolee derrickstolee self-assigned this Oct 29, 2024
@derrickstolee derrickstolee force-pushed the api-upstream branch 3 times, most recently from 781b2ea to ef54342 Compare December 18, 2024 16:13
@derrickstolee derrickstolee changed the base branch from api-upstream to master March 3, 2025 19:40
@derrickstolee derrickstolee force-pushed the path-walk-upstream branch 3 times, most recently from 26e1afb to 2eb9250 Compare March 9, 2025 21:55
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 10, 2025

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1819/derrickstolee/path-walk-upstream-v1

To fetch this version to local tag pr-1819/derrickstolee/path-walk-upstream-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1819/derrickstolee/path-walk-upstream-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 10, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> ... deltas across path boundaries. This second pass is much faster than a fresh
> pass since the existing deltas are used as a limit for the size of
> potentially new deltas, short-circuiting the checks when the delta size
> exceeds the current-best.

Very nice.

> The microsoft/fluentui is a public Javascript repo that suffers from many of
> the name hash collisions as internal repositories I've worked with. Here is
> a comparison of the compressed size and end-to-end time of the repack:
>
> Repack Method    Pack Size       Time
> ---------------------------------------
> Hash v1             439.4M      87.24s
> Hash v2             161.7M      21.51s
> Path Walk           142.5M      28.16s
>
>
> Less dramatic, but perhaps more standardly structured is the nodejs/node
> repository, with these stats:
>
> Repack Method       Pack Size       Time
> ------------------------------------------
> Hash v1                739.9M      71.18s
> Hash v2                764.6M      67.82s
> Path Walk              698.0M      75.10s
>
>
> Even the Linux kernel repository gains some benefits, even though the number
> of hash collisions is relatively low due to a preference for short
> filenames:
>
> Repack Method       Pack Size       Time
> ------------------------------------------
> Hash v1                  2.5G     554.41s
> Hash v2                  2.5G     549.62s
> Path Walk                2.2G     559.00s

This third one, v2 not performing much better than v1, is quite
surprising.

> The drawbacks of the --path-walk feature is that it will be harder to
> integrate it with bitmap features, specifically delta islands. This is not
> insurmountable, but would require more work, such as a revision walk to
> paint objects with reachability information before using that during delta
> computations.
>
> However, there should still be significant benefits to Git clients trying to
> save space and improve local performance.

Sure.  More experiments and more approaches will eventually give us
overall improvement.  I am hoping that we will be able to condense
the result of these different approaches and their combinations into
easy-to-choose-from canned choices (as opposed to a myriad of little
knobs the users need to futz with without really understanding what
they are tweaking).

> This feature was shipped with similar features in microsoft/git as of
> v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo
> that had significant repository growth due to constructing a batch of
> beachball [5] CHANGELOG.[md|json] files and pushing them to a release
> branch. These pushes were frequently 70-200 MB due to poor delta
> compression. Using the 'pack.usePathWalk=true' config, these pushes dropped
> in size by 100x while improving performance. Since these CI machines were
> working with a shallow clone, the 'edge_aggressive' changes were required to
> enable the path-walk option.

Nice, thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 10, 2025

This patch series was integrated into seen via git@e51880c.

@gitgitgadget gitgitgadget bot added the seen label Mar 10, 2025
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 11, 2025

This branch is now known as ds/path-walk-2.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 11, 2025

This patch series was integrated into seen via git@28416f0.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 11, 2025

This patch series was integrated into seen via git@4fc875f.

@gitgitgadget
Copy link

gitgitgadget bot commented May 30, 2025

This patch series was integrated into seen via git@fe28f74.

@gitgitgadget
Copy link

gitgitgadget bot commented May 31, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Comments?
source: <[email protected]>

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 2, 2025

This patch series was integrated into seen via git@ed40d39.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 2, 2025

This patch series was integrated into seen via git@e78edc7.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 3, 2025

This patch series was integrated into seen via git@e24b3f8.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 4, 2025

This patch series was integrated into seen via git@4aae12c.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 5, 2025

This patch series was integrated into seen via git@1c6c6c0.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 5, 2025

This patch series was integrated into next via git@e59d4b1.

@gitgitgadget gitgitgadget bot added the next label Jun 5, 2025
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 5, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 6, 2025

This patch series was integrated into seen via git@0481447.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 7, 2025

This patch series was integrated into seen via git@20b0ce2.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 7, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 8, 2025

This patch series was integrated into seen via git@258d7b6.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 9, 2025

This patch series was integrated into seen via git@1476d75.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 9, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 10, 2025

This patch series was integrated into seen via git@4864b2c.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 10, 2025

This patch series was integrated into seen via git@0654674.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 11, 2025

This patch series was integrated into seen via git@b4ef194.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 12, 2025

This patch series was integrated into seen via git@3683f76.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 12, 2025

This patch series was integrated into seen via git@d189351.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 13, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 13, 2025

This patch series was integrated into seen via git@bd70b9d.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 16, 2025

This patch series was integrated into seen via git@a210b57.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 16, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2025

This patch series was integrated into seen via git@88134a8.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2025

This patch series was integrated into master via git@88134a8.

@gitgitgadget gitgitgadget bot added the master label Jun 17, 2025
@gitgitgadget gitgitgadget bot closed this Jun 17, 2025
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2025

Closed via 88134a8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant