Skip to content

Conversation

@BlakeOrth
Copy link
Contributor

Which issue does this PR close?

POC for:

Rationale for this change

This is a POC for initial feedback and is not intended for merge at this time.

What changes are included in this PR?

  • Implements a POC version of a default ListFilesCache
  • Refactors the existing ListFilesCache to mirror the MetadataCache by defining a new trait instead of a fixed type wrapping a trait
  • Bounds the size of the cache based on number of entries
  • Expires entries in the cache after a default timeout duration

Are these changes tested?

This code is functional and some tests clearly show a reduction in Object Store requests! However, existing tests are broken around INSERT commands, which is a key point of discussion that needs to be covered.

Are there any user-facing changes?

Yes, this work will likely break the existing ListFilesCache public API.

Additional Context

This PR is a basic functional implementation which heavily mirrors the existing MetadataCache and its semantics. One very key omission here that needs to be discussed is how INSERT statements are handled. On the surface it seems like there are two options:

  1. Invalidate the cache key(s) associated with the table that corresponds to the INSERT statement
  2. Intercept the object(s) that correspond to an INSERT statement and add them to the cache

The first option here seems much easier, but the 2nd option seems more ideal since a user is likely to issue a query against newly inserted data. Any input here, or other strategies I haven't thought of to handle inserts, would be great!

I will also leave some inline comments around some TODO items that I think should be discussed.

cc @alamb @alchemist51

 - Implements a POC version of a default ListFilesCache
 - Refactors the existing ListFilesCache to mirror the MetadataCache
   by defining a new trait instead of a fixed type wrapping a trait
 - Bounds the size of the cache based on number of entries
 - Expires entries in the cache after a default timeout duration
@github-actions github-actions bot added core Core DataFusion crate execution Related to the execution crate labels Nov 21, 2025
Comment on lines +164 to +166
// TODO: config
512 * 1024,
Duration::new(600, 0),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This POC doesn't implement any of the user configuration. This seems like a good opportunity to divide the work on this effort! We could get the base DefaultListFilesCache approved for merge without user configuration, and leave it disabled, and user configuration could be added by anyone who wants to contribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Another thing we should do is some way to introspect the cache as part of datafusion-cli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we could also do some of the more complex desires for this cache, like being "prefix aware" in some way, as follow-on issues once this initial scaffolding exists.

Comment on lines +145 to +146
pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects
pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual default values here probably need to be discussed. These seemed relatively sane to me, but any input here on what values these should have to best accommodate a variety of workflows would be useful feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal opinion is that we should set TTL much longer (maybe even infinite) by default. Users who know their files are changing are likely going to have to crank it down from a default anyways, so we might as well make the default behavior more deterministic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, my initial thought here was that here are certainly going to be users who don't read the release notes and their expected workflow of new objects being picked up on every query will be broken. With a relatively short TTL their data will at least appear eventually, which may be preferable to not-at-all. However, I really like the strategy of using the most deterministic solution as the default (infinite), especially as the long-term solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

their expected workflow of new objects being picked up on every query will be broken.

I agree with this

On the other hand, I am not sure what your patience level is but it is very unlikely I would wait 10 minutes... If it doesn't work within a few seconds, I would probably go start looking for problems.


pub struct DefaultListFilesCacheState {
lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>,
capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't feel like limiting this cache by "bytes" really made sense because the data stored in the cache is generally very uniform in size, perhaps aside from the path. I felt that it was probably small enough that simply limiting it by the number of entries should suffice, and "entries" seems like it would be easier for users to configure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that bytes makes more sense, mostly because that is the resource that is actually used. count of items is a proxy for resource usage

Comment on lines +22 to 23
// TODO: driveby-cleanup
/// The cache accessor, users usually working on this interface while manipulating caches.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this doc comment could be edited for additional clarity, so I figured while we were in this area of code we could improve this!

@BlakeOrth BlakeOrth marked this pull request as draft November 21, 2025 01:01
/// See [`crate::runtime_env::RuntimeEnv`] for more details
pub type ListFilesCache =
Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>;
pub trait ListFilesCache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the things I would like to do in the Cache Manager caches is to segregate the cache eviction policy. Personally I think the user should be given an option on what is the eviction behaviour they want. wdyt @alamb @BlakeOrth ? I can work on getting some draft out this weekend on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable suggestion, and it looks like there may be some ongoing discussion around this topic here:

While that work would undoubtedly impact this (and other) default cache implementations, I think it should probably be discussed separately to this effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally in favor of keeping what is in DataFusion as simple to use / understand as possible (which i think this and the metadata cache do)

In terms of customizable eviction strategy, as we have mentioned elsewhere that is already possible today, but it requires effectively copying/forking the entire cache implementation which adds to the maintenance burden of downstream projects

However, adding more APIs to DataFusion increases the maintenance burden on the core project

So I see customizable eviction strategies as a tradeoff. If there are multiple users who are likely to use a customizable eviction strategy, I agree it makes sense to put it in the core repo. If there are not that many, I think it would be better to keep DataFusion simpler and move the burden downstream for those users who need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb After reading your thoughts on this I'm wondering if a more complex cache infrastructure would be a good project to start in https://github.com/datafusion-contrib and if it gains enough traction perhaps it could be promoted to the core repo?

Copy link
Contributor

@alamb alamb Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds like a great idea to me -- i think there is a wide wonderful world of complex caching strategies that people might want

table_files_statistics_cache: Default::default(),
list_files_cache: Default::default(),
list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT,
list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some usecases don't need a TTL, we should provide a way to keep that disable as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well thinking more, I understand why you have kept it.. I feel it diverges from the metadata cache and could confuse the end users somewhat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually agree with your first assessment that TTL can, and likely should, be optional if a user has a use case where they know the underlying objects are immutable once written (my personal use of DataFusion falls into this category). In either case, we may have to accept some differences between this cache and the Metadata cache. Unlike the Metadata cache, which can issue HEAD requests against objects to detect modification, there's no mechanism in an object store to inquire whether or not paths/sub-paths have been changed, so this cache will need to make some concessions around that limitation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree having a "infinite TTL" is an important usecase

I don't have a strong opinion on how that is expressed in the config options

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, FWIW

Thank you @BlakeOrth and @alchemist51

/// See [`crate::runtime_env::RuntimeEnv`] for more details
pub type ListFilesCache =
Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>;
pub trait ListFilesCache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally in favor of keeping what is in DataFusion as simple to use / understand as possible (which i think this and the metadata cache do)

In terms of customizable eviction strategy, as we have mentioned elsewhere that is already possible today, but it requires effectively copying/forking the entire cache implementation which adds to the maintenance burden of downstream projects

However, adding more APIs to DataFusion increases the maintenance burden on the core project

So I see customizable eviction strategies as a tradeoff. If there are multiple users who are likely to use a customizable eviction strategy, I agree it makes sense to put it in the core repo. If there are not that many, I think it would be better to keep DataFusion simpler and move the burden downstream for those users who need it.

// Returns the cache's object ttl.
fn cache_ttl(&self) -> Duration;

// Updates the cache with a new boject limit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Updates the cache with a new boject limit.
// Updates the cache with a new object limit.

RequestCountingObjectStore()
Total Requests: 4
- LIST prefix=data
Total Requests: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is pretty neat to see this in action

RequestCountingObjectStore()
Total Requests: 2
- LIST prefix=data
Total Requests: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add a test that shows that once the cache eviction happens, a subsquent query does actually make a new LIST request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a good test case, and obviously any tests have more or less been ignored thus far. The only tests I actually focused on initially were these tests since they both show (and helped me validate) this code was actually performing internal caching as expected.

/// command on the local filesystem. This operation can be expensive,
/// especially when done over remote object stores.
///
/// See [`crate::runtime_env::RuntimeEnv`] for more details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a breaking API change (a good one in my mind). I am just pointing it out

table_files_statistics_cache: Default::default(),
list_files_cache: Default::default(),
list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT,
list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree having a "infinite TTL" is an important usecase

I don't have a strong opinion on how that is expressed in the config options

Comment on lines +145 to +146
pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects
pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal opinion is that we should set TTL much longer (maybe even infinite) by default. Users who know their files are changing are likely going to have to crank it down from a default anyways, so we might as well make the default behavior more deterministic


pub struct DefaultListFilesCacheState {
lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>,
capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that bytes makes more sense, mostly because that is the resource that is actually used. count of items is a proxy for resource usage

}
}

fn get(&mut self, key: &Path) -> Option<Arc<Vec<ObjectMeta>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just changes some of these APIs to take &self rather than &mut self -- I am not sure if we want to do the same thing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing this out. I ran into and resolved one merge conflict related to that PR and didn't realize it may have had a subtly wider impact than just the method that caused a conflict on my rebase. I will make sure to revisit this.

_k: &Path,
_e: &Self::Extra,
) -> Option<Arc<Vec<ObjectMeta>>> {
panic!("Not supported DefaultListFilesCache get_with_extra")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember what the rationale for this panic was. It seems to have come in via

Maybe @Ted-Jiang or @suremarc have some thoughts here

@BlakeOrth
Copy link
Contributor Author

@alamb Thanks for the initial review and comments! I think there's one key decision that needs to be made before moving forward in earnest here, which is what action we should take with the cache for INSERT statements.

One very key omission here that needs to be discussed is how INSERT statements are handled. On the surface it seems like there are two options:

  1. Invalidate the cache key(s) associated with the table that corresponds to the INSERT statement
  2. Intercept the object(s) that correspond to an INSERT statement and add them to the cache

The first option here seems much easier, but the 2nd option seems more ideal since a user is likely to issue a query against newly inserted data.

Do you have any thoughts or preferences on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate execution Related to the execution crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants