Skip to content

Conversation

Kobzol
Copy link
Member

@Kobzol Kobzol commented Dec 12, 2023

This commit adds a cache that remembers whether a given path is a file or a directory, based on the results of std::fs::read_dir. This reduces the number of executed syscalls and improves the performance of the library.

Here is a simple benchmark that uses glob to find the amount of Rust files in the tests directory of a rustc checkout.

fn main() {
    let count = glob::glob("<rustc-root>/tests/**/*.rs")
        .unwrap()
        .count();
    println!("File count: {count}");
}

Results on my PC (approximately 19k Rust files are in that directory):

Version Syscall count statx syscall count Time
Before 41586 34468 ~130ms
After 7131 11 ~70ms

Syscalls were measured with strace <program> 2> out.txt && cat out.txt | wc -l and time was measured using hyperfine.

Fixes: #79

This pull request was created in cooperation with students of the Rust course on the VSB-TUO university.

This commit adds a cache that remembers whether a given path is a file or a directory, based on the results of `std::fs::read_dir`. This reduces the number of executed syscalls and improves the performance of the library.
@the8472 the8472 merged commit 4172399 into rust-lang:master Jan 2, 2024
@Kobzol Kobzol deleted the cache-dir branch January 2, 2024 22:35
osiewicz added a commit to osiewicz/glob that referenced this pull request Apr 27, 2024
Background:
While working with cargo, I've noticed that it takes ~30s to cargo clean -p with large enough target directory (~200GB).
With a profiler, it turned out that most of the time was spent retrieving paths for removal in https://github.com/rust-lang/cargo/blob/eee4ea2f5a5fa1ae184a44675315548ec932a15c/src/cargo/ops/cargo_clean.rs#L319 (and not actually removing the files).

Change description:
In call to .sort_by, we repetitively parse the paths to obtain file names for comparison. This commit caches file names in PathWrapper object, akin to rust-lang#135 that did so for dir info.

For my use case, a cargo build using that branch takes ~14s to clean files instead of previous 30s (I've measured against main branch of this directory, to account for changes made since 0.3.1). Still not ideal, but hey, we're shaving 50% of time off for a bit heavier memory use.
osiewicz added a commit to osiewicz/glob that referenced this pull request Dec 30, 2024
Background:
While working with cargo, I've noticed that it takes ~30s to cargo clean -p with large enough target directory (~200GB).
With a profiler, it turned out that most of the time was spent retrieving paths for removal in https://github.com/rust-lang/cargo/blob/eee4ea2f5a5fa1ae184a44675315548ec932a15c/src/cargo/ops/cargo_clean.rs#L319 (and not actually removing the files).

Change description:
In call to .sort_by, we repetitively parse the paths to obtain file names for comparison. This commit caches file names in PathWrapper object, akin to rust-lang#135 that did so for dir info.

For my use case, a cargo build using that branch takes ~14s to clean files instead of previous 30s (I've measured against main branch of this directory, to account for changes made since 0.3.1). Still not ideal, but hey, we're shaving 50% of time off for a bit heavier memory use.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Excessive stat syscalls on linux

2 participants