-
Notifications
You must be signed in to change notification settings - Fork 1.8k
POC for DefaultListFilesCache #18855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Implements a POC version of a default ListFilesCache - Refactors the existing ListFilesCache to mirror the MetadataCache by defining a new trait instead of a fixed type wrapping a trait - Bounds the size of the cache based on number of entries - Expires entries in the cache after a default timeout duration
| // TODO: config | ||
| 512 * 1024, | ||
| Duration::new(600, 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This POC doesn't implement any of the user configuration. This seems like a good opportunity to divide the work on this effort! We could get the base DefaultListFilesCache approved for merge without user configuration, and leave it disabled, and user configuration could be added by anyone who wants to contribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
Another thing we should do is some way to introspect the cache as part of datafusion-cli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we could also do some of the more complex desires for this cache, like being "prefix aware" in some way, as follow-on issues once this initial scaffolding exists.
| pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects | ||
| pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual default values here probably need to be discussed. These seemed relatively sane to me, but any input here on what values these should have to best accommodate a variety of workflows would be useful feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that we should set TTL much longer (maybe even infinite) by default. Users who know their files are changing are likely going to have to crank it down from a default anyways, so we might as well make the default behavior more deterministic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, my initial thought here was that here are certainly going to be users who don't read the release notes and their expected workflow of new objects being picked up on every query will be broken. With a relatively short TTL their data will at least appear eventually, which may be preferable to not-at-all. However, I really like the strategy of using the most deterministic solution as the default (infinite), especially as the long-term solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
their expected workflow of new objects being picked up on every query will be broken.
I agree with this
On the other hand, I am not sure what your patience level is but it is very unlikely I would wait 10 minutes... If it doesn't work within a few seconds, I would probably go start looking for problems.
|
|
||
| pub struct DefaultListFilesCacheState { | ||
| lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>, | ||
| capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't feel like limiting this cache by "bytes" really made sense because the data stored in the cache is generally very uniform in size, perhaps aside from the path. I felt that it was probably small enough that simply limiting it by the number of entries should suffice, and "entries" seems like it would be easier for users to configure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that bytes makes more sense, mostly because that is the resource that is actually used. count of items is a proxy for resource usage
| // TODO: driveby-cleanup | ||
| /// The cache accessor, users usually working on this interface while manipulating caches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this doc comment could be edited for additional clarity, so I figured while we were in this area of code we could improve this!
| /// See [`crate::runtime_env::RuntimeEnv`] for more details | ||
| pub type ListFilesCache = | ||
| Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>; | ||
| pub trait ListFilesCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the things I would like to do in the Cache Manager caches is to segregate the cache eviction policy. Personally I think the user should be given an option on what is the eviction behaviour they want. wdyt @alamb @BlakeOrth ? I can work on getting some draft out this weekend on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable suggestion, and it looks like there may be some ongoing discussion around this topic here:
While that work would undoubtedly impact this (and other) default cache implementations, I think it should probably be discussed separately to this effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally in favor of keeping what is in DataFusion as simple to use / understand as possible (which i think this and the metadata cache do)
In terms of customizable eviction strategy, as we have mentioned elsewhere that is already possible today, but it requires effectively copying/forking the entire cache implementation which adds to the maintenance burden of downstream projects
However, adding more APIs to DataFusion increases the maintenance burden on the core project
So I see customizable eviction strategies as a tradeoff. If there are multiple users who are likely to use a customizable eviction strategy, I agree it makes sense to put it in the core repo. If there are not that many, I think it would be better to keep DataFusion simpler and move the burden downstream for those users who need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb After reading your thoughts on this I'm wondering if a more complex cache infrastructure would be a good project to start in https://github.com/datafusion-contrib and if it gains enough traction perhaps it could be promoted to the core repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds like a great idea to me -- i think there is a wide wonderful world of complex caching strategies that people might want
| table_files_statistics_cache: Default::default(), | ||
| list_files_cache: Default::default(), | ||
| list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT, | ||
| list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some usecases don't need a TTL, we should provide a way to keep that disable as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well thinking more, I understand why you have kept it.. I feel it diverges from the metadata cache and could confuse the end users somewhat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually agree with your first assessment that TTL can, and likely should, be optional if a user has a use case where they know the underlying objects are immutable once written (my personal use of DataFusion falls into this category). In either case, we may have to accept some differences between this cache and the Metadata cache. Unlike the Metadata cache, which can issue HEAD requests against objects to detect modification, there's no mechanism in an object store to inquire whether or not paths/sub-paths have been changed, so this cache will need to make some concessions around that limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree having a "infinite TTL" is an important usecase
I don't have a strong opinion on how that is expressed in the config options
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, FWIW
Thank you @BlakeOrth and @alchemist51
| /// See [`crate::runtime_env::RuntimeEnv`] for more details | ||
| pub type ListFilesCache = | ||
| Arc<dyn CacheAccessor<Path, Arc<Vec<ObjectMeta>>, Extra = ObjectMeta>>; | ||
| pub trait ListFilesCache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally in favor of keeping what is in DataFusion as simple to use / understand as possible (which i think this and the metadata cache do)
In terms of customizable eviction strategy, as we have mentioned elsewhere that is already possible today, but it requires effectively copying/forking the entire cache implementation which adds to the maintenance burden of downstream projects
However, adding more APIs to DataFusion increases the maintenance burden on the core project
So I see customizable eviction strategies as a tradeoff. If there are multiple users who are likely to use a customizable eviction strategy, I agree it makes sense to put it in the core repo. If there are not that many, I think it would be better to keep DataFusion simpler and move the burden downstream for those users who need it.
| // Returns the cache's object ttl. | ||
| fn cache_ttl(&self) -> Duration; | ||
|
|
||
| // Updates the cache with a new boject limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Updates the cache with a new boject limit. | |
| // Updates the cache with a new object limit. |
| RequestCountingObjectStore() | ||
| Total Requests: 4 | ||
| - LIST prefix=data | ||
| Total Requests: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is pretty neat to see this in action
| RequestCountingObjectStore() | ||
| Total Requests: 2 | ||
| - LIST prefix=data | ||
| Total Requests: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add a test that shows that once the cache eviction happens, a subsquent query does actually make a new LIST request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's a good test case, and obviously any tests have more or less been ignored thus far. The only tests I actually focused on initially were these tests since they both show (and helped me validate) this code was actually performing internal caching as expected.
| /// command on the local filesystem. This operation can be expensive, | ||
| /// especially when done over remote object stores. | ||
| /// | ||
| /// See [`crate::runtime_env::RuntimeEnv`] for more details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a breaking API change (a good one in my mind). I am just pointing it out
| table_files_statistics_cache: Default::default(), | ||
| list_files_cache: Default::default(), | ||
| list_files_cache_limit: DEFAULT_LIST_FILES_CACHE_LIMIT, | ||
| list_files_cache_ttl: DEFAULT_LIST_FILES_CACHE_TTL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree having a "infinite TTL" is an important usecase
I don't have a strong opinion on how that is expressed in the config options
| pub(super) const DEFAULT_LIST_FILES_CACHE_LIMIT: usize = 128 * 1024; // ~130k objects | ||
| pub(super) const DEFAULT_LIST_FILES_CACHE_TTL: Duration = Duration::new(600, 0); // 10min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal opinion is that we should set TTL much longer (maybe even infinite) by default. Users who know their files are changing are likely going to have to crank it down from a default anyways, so we might as well make the default behavior more deterministic
|
|
||
| pub struct DefaultListFilesCacheState { | ||
| lru_queue: LruQueue<Path, (Arc<Vec<ObjectMeta>>, Instant)>, | ||
| capacity: usize, // TODO: do "bytes" matter here, or should we stick with "entries"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that bytes makes more sense, mostly because that is the resource that is actually used. count of items is a proxy for resource usage
| } | ||
| } | ||
|
|
||
| fn get(&mut self, key: &Path) -> Option<Arc<Vec<ObjectMeta>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just changes some of these APIs to take &self rather than &mut self -- I am not sure if we want to do the same thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing this out. I ran into and resolved one merge conflict related to that PR and didn't realize it may have had a subtly wider impact than just the method that caused a conflict on my rebase. I will make sure to revisit this.
| _k: &Path, | ||
| _e: &Self::Extra, | ||
| ) -> Option<Arc<Vec<ObjectMeta>>> { | ||
| panic!("Not supported DefaultListFilesCache get_with_extra") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember what the rationale for this panic was. It seems to have come in via
Maybe @Ted-Jiang or @suremarc have some thoughts here
|
@alamb Thanks for the initial review and comments! I think there's one key decision that needs to be made before moving forward in earnest here, which is what action we should take with the cache for
Do you have any thoughts or preferences on this? |
Which issue does this PR close?
POC for:
ListFilesCacheimplementation for theListingTable#18827Rationale for this change
This is a POC for initial feedback and is not intended for merge at this time.
What changes are included in this PR?
Are these changes tested?
This code is functional and some tests clearly show a reduction in Object Store requests! However, existing tests are broken around
INSERTcommands, which is a key point of discussion that needs to be covered.Are there any user-facing changes?
Yes, this work will likely break the existing
ListFilesCachepublic API.Additional Context
This PR is a basic functional implementation which heavily mirrors the existing
MetadataCacheand its semantics. One very key omission here that needs to be discussed is howINSERTstatements are handled. On the surface it seems like there are two options:INSERTstatementINSERTstatement and add them to the cacheThe first option here seems much easier, but the 2nd option seems more ideal since a user is likely to issue a query against newly inserted data. Any input here, or other strategies I haven't thought of to handle inserts, would be great!
I will also leave some inline comments around some
TODOitems that I think should be discussed.cc @alamb @alchemist51