-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20084][Core] Remove internal.metrics.updatedBlockStatuses from history files. #17412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20084][Core] Remove internal.metrics.updatedBlockStatuses from history files. #17412
Conversation
|
Test build #75173 has finished for PR 17412 at commit
|
|
I'm fine with this, assuming there's no need for this information anywhere. (Basically, the question I asked in #16714). |
|
@vanzin, I looked through the listeners in the Spark UI for uses of it and didn't find any, plus I haven't seen anything in the UI that is obviously based on it. As I said on the issue, the updates duplicate data in metrics so it would be odd for a component to go to accumulators to get this info. I'm deploying this internally today, maybe it makes sense to give it a couple days to see if we get complaints about something not working. |
|
@vanzin, we've been running this in production for a few days now and haven't had any problems. I think it should be safe to merge. Could you take another look? Thanks! |
|
LGTM. Merging to master / 2.1. |
… history files. ## What changes were proposed in this pull request? Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks. ## How was this patch tested? Current History UI tests cover use of the history file. Author: Ryan Blue <[email protected]> Closes #17412 from rdblue/SPARK-20084-remove-block-accumulator-info. (cherry picked from commit c4c03ee) Signed-off-by: Marcelo Vanzin <[email protected]>
|
@rdblue Hi, why not the blockstatusupdates are not filtering out in executorMetricsUpdate? This line https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L245 |
…ecutorMetricsUpdate and TaskEndReason ### What changes were proposed in this pull request? This PR updates `JsonProtocol` so that its `accumulableExcludeList` is considered when logging accumulable in TaskEndReasons and ExecutorMetricUpdate events. This exclusion list was originally added in #17412 and originally only applied to StageInfo and TaskInfo logging. ### Why are the changes needed? Reduce event log size and improve event logger throughput by avoiding the need to process possibly large / expensive accumulators. This gap for ExecutorMetricsUpdate was noted in a comment (#17412 (comment)); I recently rediscovered it in my own Spark usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50141 from JoshRosen/apply-accumulableExcludeList-to-other-event-log-fields. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ecutorMetricsUpdate and TaskEndReason ### What changes were proposed in this pull request? This PR updates `JsonProtocol` so that its `accumulableExcludeList` is considered when logging accumulable in TaskEndReasons and ExecutorMetricUpdate events. This exclusion list was originally added in #17412 and originally only applied to StageInfo and TaskInfo logging. ### Why are the changes needed? Reduce event log size and improve event logger throughput by avoiding the need to process possibly large / expensive accumulators. This gap for ExecutorMetricsUpdate was noted in a comment (#17412 (comment)); I recently rediscovered it in my own Spark usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50141 from JoshRosen/apply-accumulableExcludeList-to-other-event-log-fields. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 62f7f14) Signed-off-by: Hyukjin Kwon <[email protected]>
…ecutorMetricsUpdate and TaskEndReason ### What changes were proposed in this pull request? This PR updates `JsonProtocol` so that its `accumulableExcludeList` is considered when logging accumulable in TaskEndReasons and ExecutorMetricUpdate events. This exclusion list was originally added in apache#17412 and originally only applied to StageInfo and TaskInfo logging. ### Why are the changes needed? Reduce event log size and improve event logger throughput by avoiding the need to process possibly large / expensive accumulators. This gap for ExecutorMetricsUpdate was noted in a comment (apache#17412 (comment)); I recently rediscovered it in my own Spark usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50141 from JoshRosen/apply-accumulableExcludeList-to-other-event-log-fields. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ecutorMetricsUpdate and TaskEndReason ### What changes were proposed in this pull request? This PR updates `JsonProtocol` so that its `accumulableExcludeList` is considered when logging accumulable in TaskEndReasons and ExecutorMetricUpdate events. This exclusion list was originally added in apache#17412 and originally only applied to StageInfo and TaskInfo logging. ### Why are the changes needed? Reduce event log size and improve event logger throughput by avoiding the need to process possibly large / expensive accumulators. This gap for ExecutorMetricsUpdate was noted in a comment (apache#17412 (comment)); I recently rediscovered it in my own Spark usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50141 from JoshRosen/apply-accumulableExcludeList-to-other-event-log-fields. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.
How was this patch tested?
Current History UI tests cover use of the history file.