POC for DefaultListFilesCache #18855

BlakeOrth · 2025-11-21T00:50:08Z

Which issue does this PR close?

POC for:

Add a default ListFilesCache implementation for the ListingTable #18827

Rationale for this change

This is a POC for initial feedback and is not intended for merge at this time.

What changes are included in this PR?

Implements a POC version of a default ListFilesCache
Refactors the existing ListFilesCache to mirror the MetadataCache by defining a new trait instead of a fixed type wrapping a trait
Bounds the size of the cache based on number of entries
Expires entries in the cache after a default timeout duration

Are these changes tested?

This code is functional and some tests clearly show a reduction in Object Store requests! However, existing tests are broken around INSERT commands, which is a key point of discussion that needs to be covered.

Are there any user-facing changes?

Yes, this work will likely break the existing ListFilesCache public API.

Additional Context

This PR is a basic functional implementation which heavily mirrors the existing MetadataCache and its semantics. One very key omission here that needs to be discussed is how INSERT statements are handled. On the surface it seems like there are two options:

Invalidate the cache key(s) associated with the table that corresponds to the INSERT statement
Intercept the object(s) that correspond to an INSERT statement and add them to the cache

The first option here seems much easier, but the 2nd option seems more ideal since a user is likely to issue a query against newly inserted data. Any input here, or other strategies I haven't thought of to handle inserts, would be great!

I will also leave some inline comments around some TODO items that I think should be discussed.

cc @alamb @alchemist51

- Implements a POC version of a default ListFilesCache - Refactors the existing ListFilesCache to mirror the MetadataCache by defining a new trait instead of a fixed type wrapping a trait - Bounds the size of the cache based on number of entries - Expires entries in the cache after a default timeout duration

BlakeOrth · 2025-11-21T00:52:10Z

datafusion/execution/src/cache/cache_manager.rs

+                    // TODO: config
+                    512 * 1024,
+                    Duration::new(600, 0),


This POC doesn't implement any of the user configuration. This seems like a good opportunity to divide the work on this effort! We could get the base DefaultListFilesCache approved for merge without user configuration, and leave it disabled, and user configuration could be added by anyone who wants to contribute.

I agree

Another thing we should do is some way to introspect the cache as part of datafusion-cli

Yes, I think we could also do some of the more complex desires for this cache, like being "prefix aware" in some way, as follow-on issues once this initial scaffolding exists.

BlakeOrth · 2025-11-21T00:54:27Z

datafusion/execution/src/cache/cache_unit.rs

+pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects
+pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min


The actual default values here probably need to be discussed. These seemed relatively sane to me, but any input here on what values these should have to best accommodate a variety of workflows would be useful feedback.

My personal opinion is that we should set TTL much longer (maybe even infinite) by default. Users who know their files are changing are likely going to have to crank it down from a default anyways, so we might as well make the default behavior more deterministic

Hmmm, my initial thought here was that here are certainly going to be users who don't read the release notes and their expected workflow of new objects being picked up on every query will be broken. With a relatively short TTL their data will at least appear eventually, which may be preferable to not-at-all. However, I really like the strategy of using the most deterministic solution as the default (infinite), especially as the long-term solution.

their expected workflow of new objects being picked up on every query will be broken.

I agree with this

On the other hand, I am not sure what your patience level is but it is very unlikely I would wait 10 minutes... If it doesn't work within a few seconds, I would probably go start looking for problems.

BlakeOrth · 2025-11-21T00:56:35Z

datafusion/execution/src/cache/cache_unit.rs

+
+pub struct DefaultListFilesCacheState {
+    lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>,
+    capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"?


I didn't feel like limiting this cache by "bytes" really made sense because the data stored in the cache is generally very uniform in size, perhaps aside from the path. I felt that it was probably small enough that simply limiting it by the number of entries should suffice, and "entries" seems like it would be easier for users to configure.

I think that bytes makes more sense, mostly because that is the resource that is actually used. count of items is a proxy for resource usage

BlakeOrth · 2025-11-21T00:59:30Z

datafusion/execution/src/cache/mod.rs

+// TODO: driveby-cleanup
 /// The cache accessor, users usually working on this interface while manipulating caches.


I noticed this doc comment could be edited for additional clarity, so I figured while we were in this area of code we could improve this!

alchemist51 · 2025-11-21T04:23:09Z

datafusion/execution/src/cache/cache_manager.rs

 /// See [`crate::runtime_env::RuntimeEnv`] for more details
-pub type ListFilesCache =
-    Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>;
+pub trait ListFilesCache:


I think one of the things I would like to do in the Cache Manager caches is to segregate the cache eviction policy. Personally I think the user should be given an option on what is the eviction behaviour they want. wdyt @alamb @BlakeOrth ? I can work on getting some draft out this weekend on it.

I think this is a reasonable suggestion, and it looks like there may be some ongoing discussion around this topic here:

Decoupling Cache and Eviction Strategies #18405

While that work would undoubtedly impact this (and other) default cache implementations, I think it should probably be discussed separately to this effort.

I am personally in favor of keeping what is in DataFusion as simple to use / understand as possible (which i think this and the metadata cache do)

In terms of customizable eviction strategy, as we have mentioned elsewhere that is already possible today, but it requires effectively copying/forking the entire cache implementation which adds to the maintenance burden of downstream projects

However, adding more APIs to DataFusion increases the maintenance burden on the core project

So I see customizable eviction strategies as a tradeoff. If there are multiple users who are likely to use a customizable eviction strategy, I agree it makes sense to put it in the core repo. If there are not that many, I think it would be better to keep DataFusion simpler and move the burden downstream for those users who need it.

@alamb After reading your thoughts on this I'm wondering if a more complex cache infrastructure would be a good project to start in https://github.com/datafusion-contrib and if it gains enough traction perhaps it could be promoted to the core repo?

Yes, that sounds like a great idea to me -- i think there is a wide wonderful world of complex caching strategies that people might want

alchemist51 · 2025-11-21T04:25:12Z

datafusion/execution/src/cache/cache_manager.rs

            table_files_statistics_cache: Default::default(),
            list_files_cache: Default::default(),
+            list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT,
+            list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL,


Some usecases don't need a TTL, we should provide a way to keep that disable as well.

Well thinking more, I understand why you have kept it.. I feel it diverges from the metadata cache and could confuse the end users somewhat

I actually agree with your first assessment that TTL can, and likely should, be optional if a user has a use case where they know the underlying objects are immutable once written (my personal use of DataFusion falls into this category). In either case, we may have to accept some differences between this cache and the Metadata cache. Unlike the Metadata cache, which can issue HEAD requests against objects to detect modification, there's no mechanism in an object store to inquire whether or not paths/sub-paths have been changed, so this cache will need to make some concessions around that limitation.

I do agree having a "infinite TTL" is an important usecase

I don't have a strong opinion on how that is expressed in the config options

alamb

This looks great to me, FWIW

Thank you @BlakeOrth and @alchemist51

alamb · 2025-11-21T18:57:08Z

datafusion/execution/src/cache/cache_manager.rs

 /// See [`crate::runtime_env::RuntimeEnv`] for more details
-pub type ListFilesCache =
-    Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>;
+pub trait ListFilesCache:


I am personally in favor of keeping what is in DataFusion as simple to use / understand as possible (which i think this and the metadata cache do)

In terms of customizable eviction strategy, as we have mentioned elsewhere that is already possible today, but it requires effectively copying/forking the entire cache implementation which adds to the maintenance burden of downstream projects

However, adding more APIs to DataFusion increases the maintenance burden on the core project

So I see customizable eviction strategies as a tradeoff. If there are multiple users who are likely to use a customizable eviction strategy, I agree it makes sense to put it in the core repo. If there are not that many, I think it would be better to keep DataFusion simpler and move the burden downstream for those users who need it.

alamb · 2025-11-21T18:57:48Z

datafusion/execution/src/cache/cache_manager.rs

+    // Returns the cache's object ttl.
+    fn cache_ttl(&self) -> Duration;
+
+    // Updates the cache with a new boject limit.


Suggested change

// Updates the cache with a new boject limit.

// Updates the cache with a new object limit.

alamb · 2025-11-21T18:59:13Z

datafusion/core/tests/datasource/object_store_access.rs

    RequestCountingObjectStore()
-    Total Requests: 4
-    - LIST prefix=data
+    Total Requests: 3


it is pretty neat to see this in action

alamb · 2025-11-21T19:00:10Z

datafusion/core/tests/datasource/object_store_access.rs

    RequestCountingObjectStore()
-    Total Requests: 2
-    - LIST prefix=data
+    Total Requests: 1


Could we also add a test that shows that once the cache eviction happens, a subsquent query does actually make a new LIST request?

Yes, I think that's a good test case, and obviously any tests have more or less been ignored thus far. The only tests I actually focused on initially were these tests since they both show (and helped me validate) this code was actually performing internal caching as expected.

alamb · 2025-11-21T19:01:16Z

datafusion/execution/src/cache/cache_manager.rs

 /// command on the local filesystem. This operation can be expensive,
 /// especially when done over remote object stores.
 ///
 /// See [`crate::runtime_env::RuntimeEnv`] for more details


This is technically a breaking API change (a good one in my mind). I am just pointing it out

alamb · 2025-11-21T19:05:24Z

datafusion/execution/src/cache/cache_manager.rs

            table_files_statistics_cache: Default::default(),
            list_files_cache: Default::default(),
+            list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT,
+            list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL,


I do agree having a "infinite TTL" is an important usecase

I don't have a strong opinion on how that is expressed in the config options

alamb · 2025-11-21T19:06:57Z

datafusion/execution/src/cache/cache_unit.rs

+pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects
+pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min


My personal opinion is that we should set TTL much longer (maybe even infinite) by default. Users who know their files are changing are likely going to have to crank it down from a default anyways, so we might as well make the default behavior more deterministic

alamb · 2025-11-21T19:07:39Z

datafusion/execution/src/cache/cache_unit.rs

+
+pub struct DefaultListFilesCacheState {
+    lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>,
+    capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"?


I think that bytes makes more sense, mostly because that is the resource that is actually used. count of items is a proxy for resource usage

alamb · 2025-11-21T19:09:24Z

datafusion/execution/src/cache/cache_unit.rs

+        }
+    }
+
+    fn get(&mut self, key: &Path) -> Option<Arc<Vec<ObjectMeta>>> {


I think we just changes some of these APIs to take &self rather than &mut self -- I am not sure if we want to do the same thing here.

Change CacheAccessor::remove to take &self rather than &mut self #18726

Ah, thanks for pointing this out. I ran into and resolved one merge conflict related to that PR and didn't realize it may have had a subtly wider impact than just the method that caused a conflict on my rebase. I will make sure to revisit this.

alamb · 2025-11-21T19:12:43Z

datafusion/execution/src/cache/cache_unit.rs

-        _k: &Path,
-        _e: &Self::Extra,
-    ) -> Option<Arc<Vec<ObjectMeta>>> {
-        panic!("Not supported DefaultListFilesCache get_with_extra")


I can't remember what the rationale for this panic was. It seems to have come in via

[feat] Support cache ListFiles result cache in session level #7620

Maybe @Ted-Jiang or @suremarc have some thoughts here

BlakeOrth · 2025-11-21T19:54:10Z

@alamb Thanks for the initial review and comments! I think there's one key decision that needs to be made before moving forward in earnest here, which is what action we should take with the cache for INSERT statements.

One very key omission here that needs to be discussed is how INSERT statements are handled. On the surface it seems like there are two options:

Invalidate the cache key(s) associated with the table that corresponds to the INSERT statement

Intercept the object(s) that correspond to an INSERT statement and add them to the cache

The first option here seems much easier, but the 2nd option seems more ideal since a user is likely to issue a query against newly inserted data.

Do you have any thoughts or preferences on this?

github-actions bot added core Core DataFusion crate execution Related to the execution crate labels Nov 21, 2025

BlakeOrth commented Nov 21, 2025

View reviewed changes

BlakeOrth marked this pull request as draft November 21, 2025 01:01

alchemist51 reviewed Nov 21, 2025

View reviewed changes

alamb mentioned this pull request Nov 21, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-17 #18711

Open

44 tasks

alamb reviewed Nov 21, 2025

View reviewed changes

		pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects
		pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min

		// TODO: driveby-cleanup
		/// The cache accessor, users usually working on this interface while manipulating caches.

	// Updates the cache with a new boject limit.
	// Updates the cache with a new object limit.

POC for DefaultListFilesCache #18855

Are you sure you want to change the base?

POC for DefaultListFilesCache #18855

Conversation

BlakeOrth commented Nov 21, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Additional Context

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlakeOrth commented Nov 21, 2025

alamb Nov 21, 2025 •

edited

Loading