[SPARK-27394][WebUI]Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate #24303

zsxwing · 2019-04-04T21:41:09Z

What changes were proposed in this pull request?

This PR updates AppStatusListener to flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate. This will ensure the staleness of Spark UI doesn't last more than the executor heartbeat interval.

How was this patch tested?

The new unit test.

gatorsmile · 2019-04-04T21:42:47Z

cc @vanzin @gengliangwang @jiangxb1987

vanzin · 2019-04-04T23:33:21Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

+    // here to ensure the staleness of Spark UI doesn't last more that the executor heartbeat
+    // interval.
+    if (now - lastFlushTimeNs > liveUpdatePeriodNs) {
+      flush(maybeUpdate(_, now))


Hmm... in the bug you mention that job-level data is not being updated. Is that the only case? Because if that's it, then this looks like overkill. You could e.g. update the jobs in the code that handles event.accumUpdates above, or even just flush jobs specifically, instead of everything.

Doing a full flush here seems like overkill and a little expensive when you think about many heartbeats arriving in a short period (even when considering lastFlushTimeNs).

Hmm... in the bug you mention that job-level data is not being updated. Is that the only case?

I also noticed that executor active tasks sometimes could be wrong. That's why I decided to flush everything to make sure we don't miss any places. It's also hard to maintain if we need to manually flush in every place.

Ideally, we should flush periodically so that it doesn't depend on receiving a Spark event. But then I will need to add a new event type and post to the listener bus. That's overkilled.

when you think about many heartbeats arriving in a short period

At least there will be at least 100ms between each flush. As long as we process heart beats very fast, most of them won't trigger the flush.

If the goal is to use the hearbeats as some trigger for flushing, how about using some ratio of the heartbeat period instead of liveUpdatePeriodNs to control whether to flush everything?

Really large apps can get a little backed up when processing hearbeats from lots and lots of busy executors, and this would make it a little worse.

The update only happens in live UI, which should be fine in general. For real large apps, will it help by setting LIVE_ENTITY_UPDATE_PERIOD to a larger value? Setting a ratio of heartbeat period seems a bit complex.

only happens in live UI

The "don't write to the store all the time" thing was added specifically to speed up live UIs, because copying + writing the data (even to the memory store) becomes really expensive when you have event storms (think thousands of tasks starting and stopping in a very short period).

setting LIVE_ENTITY_UPDATE_PERIOD to a larger value

We should avoid requiring configuration tweaks for things not to break, when possible.

SparkQA · 2019-04-05T02:11:21Z

Test build #104308 has finished for PR 24303 at commit ee53708.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-05T02:24:12Z

Test build #104309 has finished for PR 24303 at commit 1f927a2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-04-05T04:26:59Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

      }
    }
+    // Flush updates if necessary. Executor heartbeat is an event that happens periodically. Flush
+    // here to ensure the staleness of Spark UI doesn't last more that the executor heartbeat


nit: more than?

jiangxb1987 · 2019-04-05T15:42:30Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

+    // Flush updates if necessary. Executor heartbeat is an event that happens periodically. Flush
+    // here to ensure the staleness of Spark UI doesn't last more that the executor heartbeat
+    // interval.
+    if (now - lastFlushTimeNs > liveUpdatePeriodNs) {


I'm also worried about the case when flush() takes a few milliseconds to finish, and you end up with always dealing with updating all live entities for each ExecutorMetricsUpdate event.
Is it possible to introduce a new config that specifies the live update period for ExecutorMetricsUpdate only? The default value can be the same as liveUpdatePeriodNs, while user can change it to a bigger value when the flush() function become a issue in processing the event.

zsxwing · 2019-04-05T17:19:25Z

@vanzin I added a new separate config for this. It's weird to use a ratio of the heartbeat period since using heartbeat is the implementation detail and we may use a different approach in future.

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

SparkQA · 2019-04-05T20:06:17Z

Test build #104328 has finished for PR 24303 at commit 289e996.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros

Running this test the thread audit detects two possible thread leaks (Executor task launch worker for task 0, Executor task launch worker for task 1) and it makes me wonder whether they are just red herrings and killed latter on after the test suite is stopped (in a separate thread like by TaskReaper) or should we take care of them:

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.ui.UISeleniumSuite, thread names: Keep-Alive-Timer, Executor task launch worker for task 0, Executor task launch worker for task 1 =====

attilapiros · 2019-04-05T21:10:43Z

core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala

+      val f = sc.parallelize(1 to 1000, 1000).foreachAsync { _ =>
+        // Make the task never finish so there won't be any task start/end events after the first 2
+        // tasks start.
+        Thread.sleep(300000)


Nit: what about a less than 5 minutes sleep here something comparable with the eventually, like:

Thread.sleep(20.seconds.toMillis)

I turned on SPARK_JOB_INTERRUPT_ON_CANCEL, so it's not needed to change the sleep time.

I have checked and the thread leaks are gone.

zsxwing · 2019-04-05T22:02:09Z

Running this test the thread audit detects two possible thread leaks (Executor task launch worker for task 0, Executor task launch worker for task 1) and it makes me wonder whether they are just red herrings and killed latter on after the test suite is stopped (in a separate thread like by TaskReaper) or should we take care of them:

Good catch. Forgot to set a flag...

SparkQA · 2019-04-05T22:15:52Z

Test build #104324 has finished for PR 24303 at commit 5a04be9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin

Some minor comments.

vanzin · 2019-04-05T22:42:49Z

core/src/main/scala/org/apache/spark/internal/config/Status.scala

+  val LIVE_ENTITY_UPDATE_STALENESS_LIMIT = ConfigBuilder("spark.ui.liveUpdate.stalenessLimit")
+    .internal()
+    .doc(
+      """A time limit before we force to flush all live entities. When the last flush doesn't past


Grammar: "doesn't past this limit"?

I think this would be easier to explain if you named the config "minFlushPeriod" or something. e.g. "Minimum time elapsed before stale UI data is flushed."

vanzin · 2019-04-05T22:43:09Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

  private val liveUpdatePeriodNs = if (live) conf.get(LIVE_ENTITY_UPDATE_PERIOD) else -1L

+  /**
+   * A time limit before we force to flush all live entities. When the last flush doesn't past


Same grammar issue. Can't really parse what you wrote.

vanzin · 2019-04-05T22:45:21Z

core/src/main/scala/org/apache/spark/internal/config/Status.scala

    .createWithDefaultString("100ms")

+  val LIVE_ENTITY_UPDATE_STALENESS_LIMIT = ConfigBuilder("spark.ui.liveUpdate.stalenessLimit")
+    .internal()


Why internal? Spark doesn't set it itself. If anyone is going to change it, it will be users.

SparkQA · 2019-04-06T01:11:29Z

Test build #104329 has finished for PR 24303 at commit 2645f35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-06T02:32:56Z

Test build #104332 has finished for PR 24303 at commit 39ea357.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-04-08T23:07:34Z

Looks good pending tests.

SparkQA · 2019-04-09T03:21:33Z

Test build #104408 has finished for PR 24303 at commit 1c53071.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2019-04-09T04:57:26Z

LGTM

attilapiros · 2019-04-09T09:32:00Z

There is typo in the description: "last more that" => ""last more than", otherwise LGTM.

vanzin · 2019-04-09T15:25:29Z

Merging to master / 2.4.

vanzin · 2019-04-09T15:32:42Z

Doesn't merge cleanly to 2.4, so gave up. Open a new PR blah blah blah...

…rkListenerExecutorMetricsUpdate This PR updates `AppStatusListener` to flush `LiveEntity` if necessary when receiving `SparkListenerExecutorMetricsUpdate`. This will ensure the staleness of Spark UI doesn't last more than the executor heartbeat interval. The new unit test. Closes apache#24303 from zsxwing/SPARK-27394. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

…rkListenerExecutorMetricsUpdate (backport 2.4) ## What changes were proposed in this pull request? This PR backports #24303 to 2.4. ## How was this patch tested? Jenkins Closes #24328 from zsxwing/SPARK-27394-2.4. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

…rkListenerExecutorMetricsUpdate (backport 2.4) ## What changes were proposed in this pull request? This PR backports apache#24303 to 2.4. ## How was this patch tested? Jenkins Closes apache#24328 from zsxwing/SPARK-27394-2.4. Authored-by: Shixiong Zhu <[email protected]> Signed-off-by: Shixiong Zhu <[email protected]>

fix

ee53708

zsxwing changed the title ~~[SPARK-27394]Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate~~ [SPARK-27394][WebUI]Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate Apr 4, 2019

avoid to traverse all entities too frequently

1f927a2

vanzin reviewed Apr 4, 2019

View reviewed changes

gengliangwang reviewed Apr 5, 2019

View reviewed changes

jiangxb1987 reviewed Apr 5, 2019

View reviewed changes

add a config for flushing all entities

5a04be9

attilapiros reviewed Apr 5, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala Outdated Show resolved Hide resolved

zsxwing added 2 commits April 5, 2019 12:57

nit

289e996

nit 2

2645f35

attilapiros reviewed Apr 5, 2019

View reviewed changes

interrupt task

39ea357

vanzin reviewed Apr 5, 2019

View reviewed changes

Address

1c53071

vanzin closed this in 5ff39cd Apr 9, 2019

zsxwing mentioned this pull request Apr 9, 2019

[SPARK-27394][WEBUI] Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate (backport 2.4) #24328

Closed

zsxwing deleted the SPARK-27394 branch July 26, 2019 07:44

[SPARK-27394][WebUI]Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate #24303

[SPARK-27394][WebUI]Flush LiveEntity if necessary when receiving SparkListenerExecutorMetricsUpdate #24303

Uh oh!

Conversation

zsxwing commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Apr 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 5, 2019

Uh oh!

SparkQA commented Apr 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Apr 5, 2019

Uh oh!

Uh oh!

SparkQA commented Apr 5, 2019

Uh oh!

attilapiros left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Apr 5, 2019

Uh oh!

SparkQA commented Apr 5, 2019

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 6, 2019

Uh oh!

SparkQA commented Apr 6, 2019

Uh oh!

vanzin commented Apr 8, 2019

Uh oh!

SparkQA commented Apr 9, 2019

Uh oh!

jiangxb1987 commented Apr 9, 2019

Uh oh!

attilapiros commented Apr 9, 2019

Uh oh!

vanzin commented Apr 9, 2019

Uh oh!

vanzin commented Apr 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zsxwing commented Apr 4, 2019 •

edited

Loading