Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Mar 24, 2017

What changes were proposed in this pull request?

Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.

How was this patch tested?

Current History UI tests cover use of the history file.

@SparkQA
Copy link

SparkQA commented Mar 24, 2017

Test build #75173 has finished for PR 17412 at commit 386846c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 24, 2017

I'm fine with this, assuming there's no need for this information anywhere. (Basically, the question I asked in #16714).

@rdblue
Copy link
Contributor Author

rdblue commented Mar 27, 2017

@vanzin, I looked through the listeners in the Spark UI for uses of it and didn't find any, plus I haven't seen anything in the UI that is obviously based on it. As I said on the issue, the updates duplicate data in metrics so it would be odd for a component to go to accumulators to get this info.

I'm deploying this internally today, maybe it makes sense to give it a couple days to see if we get complaints about something not working.

@rdblue
Copy link
Contributor Author

rdblue commented Mar 30, 2017

@vanzin, we've been running this in production for a few days now and haven't had any problems. I think it should be safe to merge. Could you take another look? Thanks!

@vanzin
Copy link
Contributor

vanzin commented Mar 31, 2017

LGTM. Merging to master / 2.1.

asfgit pushed a commit that referenced this pull request Mar 31, 2017
… history files.

## What changes were proposed in this pull request?

Remove accumulator updates for internal.metrics.updatedBlockStatuses from SparkListenerTaskEnd entries in the history file. These can cause history files to grow to hundreds of GB because the value of the accumulator contains all tracked blocks.

## How was this patch tested?

Current History UI tests cover use of the history file.

Author: Ryan Blue <[email protected]>

Closes #17412 from rdblue/SPARK-20084-remove-block-accumulator-info.

(cherry picked from commit c4c03ee)
Signed-off-by: Marcelo Vanzin <[email protected]>
@asfgit asfgit closed this in c4c03ee Mar 31, 2017
@zhouyejoe
Copy link
Contributor

@rdblue Hi, why not the blockstatusupdates are not filtering out in executorMetricsUpdate? This line https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L245
While I am working on SPARK-21961, I filtered those blockstatusupdates while reading from logs in Spark History Server, but it causing some unit test failure.
Should it not be filtered out in both executorMetricsUpdateFromJson and executorMetricsUpdateToJson?
Thanks.

HyukjinKwon pushed a commit that referenced this pull request Mar 4, 2025
…ecutorMetricsUpdate and TaskEndReason

### What changes were proposed in this pull request?

This PR updates `JsonProtocol` so that its `accumulableExcludeList` is considered when logging accumulable in TaskEndReasons and ExecutorMetricUpdate events.

This exclusion list was originally added in #17412 and originally only applied to StageInfo and TaskInfo logging.

### Why are the changes needed?

Reduce event log size and improve event logger throughput by avoiding the need to process possibly large / expensive accumulators.

This gap for ExecutorMetricsUpdate was noted in a comment (#17412 (comment)); I recently rediscovered it in my own Spark usage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50141 from JoshRosen/apply-accumulableExcludeList-to-other-event-log-fields.

Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Mar 4, 2025
…ecutorMetricsUpdate and TaskEndReason

### What changes were proposed in this pull request?

This PR updates `JsonProtocol` so that its `accumulableExcludeList` is considered when logging accumulable in TaskEndReasons and ExecutorMetricUpdate events.

This exclusion list was originally added in #17412 and originally only applied to StageInfo and TaskInfo logging.

### Why are the changes needed?

Reduce event log size and improve event logger throughput by avoiding the need to process possibly large / expensive accumulators.

This gap for ExecutorMetricsUpdate was noted in a comment (#17412 (comment)); I recently rediscovered it in my own Spark usage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50141 from JoshRosen/apply-accumulableExcludeList-to-other-event-log-fields.

Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 62f7f14)
Signed-off-by: Hyukjin Kwon <[email protected]>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…ecutorMetricsUpdate and TaskEndReason

### What changes were proposed in this pull request?

This PR updates `JsonProtocol` so that its `accumulableExcludeList` is considered when logging accumulable in TaskEndReasons and ExecutorMetricUpdate events.

This exclusion list was originally added in apache#17412 and originally only applied to StageInfo and TaskInfo logging.

### Why are the changes needed?

Reduce event log size and improve event logger throughput by avoiding the need to process possibly large / expensive accumulators.

This gap for ExecutorMetricsUpdate was noted in a comment (apache#17412 (comment)); I recently rediscovered it in my own Spark usage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50141 from JoshRosen/apply-accumulableExcludeList-to-other-event-log-fields.

Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
anoopj pushed a commit to anoopj/spark that referenced this pull request Mar 15, 2025
…ecutorMetricsUpdate and TaskEndReason

### What changes were proposed in this pull request?

This PR updates `JsonProtocol` so that its `accumulableExcludeList` is considered when logging accumulable in TaskEndReasons and ExecutorMetricUpdate events.

This exclusion list was originally added in apache#17412 and originally only applied to StageInfo and TaskInfo logging.

### Why are the changes needed?

Reduce event log size and improve event logger throughput by avoiding the need to process possibly large / expensive accumulators.

This gap for ExecutorMetricsUpdate was noted in a comment (apache#17412 (comment)); I recently rediscovered it in my own Spark usage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50141 from JoshRosen/apply-accumulableExcludeList-to-other-event-log-fields.

Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants