-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[df] Change default Snapshot compression settings #20030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[df] Change default Snapshot compression settings #20030
Conversation
cbda4a1
to
81c2ac0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Test Results 22 files 22 suites 3d 22h 59m 32s ⏱️ Results for commit 7c5c04f. ♻️ This comment has been updated with latest results. |
81c2ac0
to
7c5c04f
Compare
@pcanal I am now slightly in doubt about this current approach. For the information message to be actually visible, a user would have to explicitly activate the "info" logging in their application by writing #include <ROOT/RLogger.hxx>
// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
auto verbosity = ROOT::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::ELogLevel::kInfo); Whereas I believe our goal here is to show the message once per application anytime the user is writing at least one Snapshot call in their program. By turning the "info" to a "warning", we would get that behaviour. |
I think a Warning is fine, considering that you are warning the user about the change in behavior |
Thanks for adding it to the RN, which do not reach all users, but at least make the change officially documented in writing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I agree with the suggestion to turn this into a warning. I added one suggestion to the RN to add some additional context, but it's also okay for me to leave them as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favour of changing the default, and also think that it's good to notify users, but I think the way how the message gets out could be improved:
- An entry in the release notes is warranted
- The
Info
message should show as info, so we might have to move toTError.h
'sInfo
. (A warning would be too high in my opinion). - And I think the message should be silenceable using an entry in TEnv.
How about:
The default compression settings of Snapshot have been changed from 101 (ZLIB with compression level 1) to 505 (ZSTD with compression level 5).
See https://root.cern/
Silence this message addingRDF_SNAPSHOT_INFO=0
to the process environment or .rootrc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Vincenzo!
The default Snapshot compression setting has always been 101. Historically, this was done for simplicity reasons and following the principle of least surprise. TTree was the only output format available with Snapshot, so the operation was defaulting to the same value used by TTree. Now that Snapshot supports more than one output format, this reason is less strong than before. It has been established that 505 is a better default compression setting overall, so RDataFrame should follow that. The main disadvantage is that this change is hard to communicate. This commit proposes to introduce an information message being shown once per program execution, and only if the program is calling Snapshot. This message is supposed to help the users detect changes in their output file size with the next development cycle (6.38.x) and should be removed afterwards.
7c5c04f
to
09afe35
Compare
I have found a way to show The remaining discussion point is about how long this info message should stay around. I believe it should only stay in 6.38 as leaving it in an LTS release has the concrete risk of sticking around in user code for many years. |
The default Snapshot compression setting has always been 101. Historically, this was done for simplicity reasons and following the principle of least surprise. TTree was the only output format available with Snapshot, so the operation was defaulting to the same value used by TTree. Now that Snapshot supports more than one output format, this reason is less strong than before. It has been established that 505 is a better default compression setting overall, so RDataFrame should follow that. The main disadvantage is that this change is hard to communicate. This commit proposes to introduce an information message being shown once per program execution, and only if the program is calling Snapshot. This message is supposed to help the users detect changes in their output file size with the next development cycle (6.38.x) and should be removed afterwards.
As discussed during the ROOT I/O meeting