[SPARK-52737][CORE] Pushdown predicate and number of apps to FsHistoryProvider when listing applications #51428

shardulm94 · 2025-07-09T21:17:33Z

What changes were proposed in this pull request?

SPARK-38896 modified how applications are listed from the KVStore to close the KVStore iterator eagerly Link. This meant that FsHistoryProvider.getListing now eagerly goes through every application in the KVStore before returning an iterator to the caller. In a couple of contexts where FsHistoryProvider.getListing is used, this is very detrimental. e.g. here, due to .exists() we would previously only need to go through a handful of applications before the condition is satisfied. This causes significant perf regression for the SHS homepage in our environment which contains ~10000 Spark apps in a single history server.

To fix the issue, while preserving the original intent of closing the iterator early, this PR proposes pushing down filter predicates and number of applications required to FsHistoryProvider.

Why are the changes needed?

To fix a perf regression in SHS due to SPARK-38896

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests for HistoryPage and ApplicationListResource

Tested performance on local SHS with a large number of apps (~75k) consistent with production.
Before:

smahadik@localhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length'
75061

smahadik@localhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done
Total time: 3.607995s
Total time: 3.564875s
Total time: 3.095895s
Total time: 3.153576s
Total time: 3.157186s
Total time: 3.251107s
Total time: 3.681727s
Total time: 4.622074s
Total time: 6.866931s
Total time: 3.523224s

smahadik@localhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done
Total time: 3.340698s
Total time: 3.206455s
Total time: 3.140326s
Total time: 4.704944s
Total time: 3.982831s
Total time: 7.375094s
Total time: 3.328329s
Total time: 3.264700s
Total time: 3.283851s
Total time: 3.456416s

After:

smahadik@localhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 36.7M    0 36.7M    0     0  7662k      0 --:--:--  0:00:04 --:--:-- 7663k
75077

smahadik@localhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done
Total time: 0.224714s
Total time: 0.012205s
Total time: 0.014709s
Total time: 0.008092s
Total time: 0.007284s
Total time: 0.006350s
Total time: 0.005414s
Total time: 0.006391s
Total time: 0.005668s
Total time: 0.004738s


smahadik@localhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done
Total time: 1.439507s
Total time: 0.015126s
Total time: 0.009085s
Total time: 0.007620s
Total time: 0.007692s
Total time: 0.007420s
Total time: 0.007152s
Total time: 0.010515s
Total time: 0.011493s
Total time: 0.007564s

Was this patch authored or co-authored using generative AI tooling?

No

shardulm94 · 2025-07-09T21:18:44Z

cc: @LuciferYang @mridulm @thejdeep

…er when listing applications

LuciferYang

The code changes are ok with me, but could you supplement the pr description with a comparison of the benefits? For example, a comparison of the access latency in scenarios where bad cases occur before and after the modification.

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala

mridulm

LGTM

shardulm94 · 2025-07-16T18:12:25Z

Just as an update here, I am having trouble setting the scenario up locally at a scale where we can reproduce the issue. We originally identified the issue using flamegraphs from our production instance. I am trying to see whats a good way to scale test this.

shardulm94 · 2025-07-22T19:11:33Z

@LuciferYang @mridulm Sorry it took a while, but I was able to scale test the change. I have added the performance numbers in the PR description.

mridulm · 2025-07-22T19:37:02Z

The numbers look good to me, I will let @LuciferYang review it/merge the PR.
Thanks Shardul !

dongjoon-hyun · 2025-07-23T14:16:43Z

core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala

@@ -109,7 +109,7 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("")
  }

  def shouldDisplayApplications(requestedIncomplete: Boolean): Boolean = {
-    parent.getApplicationList().exists(isApplicationCompleted(_) != requestedIncomplete)
+    parent.getApplicationInfoList(1)(isApplicationCompleted(_) != requestedIncomplete).nonEmpty


~~Why we check only one here, @shardulm94 and @mridulm ?~~

Ah, got it. The next one was the predicate.

dongjoon-hyun

+1, LGTM (except the existing unaddressed comment). Thank you. It looks like a nice improvement.

LuciferYang · 2025-07-23T17:08:38Z

I will merge this later.

thejdeep · 2025-07-25T18:07:38Z

@LuciferYang Can this be merged ? Thank you!

dongjoon-hyun · 2025-07-25T18:21:32Z

cc @peter-toth

…yProvider when listing applications ### What changes were proposed in this pull request? SPARK-38896 modified how applications are listed from the KVStore to close the KVStore iterator eagerly [Link](https://github.com/apache/spark/pull/36237/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R328). This meant that `FsHistoryProvider.getListing` now eagerly goes through every application in the KVStore before returning an iterator to the caller. In a couple of contexts where `FsHistoryProvider.getListing` is used, this is very detrimental. e.g. [here](https://github.com/apache/spark/blame/589e93a02725939c266f9ee97f96fdc6d3db33cd/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala#L112), due to `.exists()` we would previously only need to go through a handful of applications before the condition is satisfied. This causes significant perf regression for the SHS homepage in our environment which contains ~10000 Spark apps in a single history server. To fix the issue, while preserving the original intent of closing the iterator early, this PR proposes pushing down filter predicates and number of applications required to FsHistoryProvider. ### Why are the changes needed? To fix a perf regression in SHS due to SPARK-38896 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests for `HistoryPage` and `ApplicationListResource` Tested performance on local SHS with a large number of apps (~75k) consistent with production. Before: ``` smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length' 75061 smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done Total time: 3.607995s Total time: 3.564875s Total time: 3.095895s Total time: 3.153576s Total time: 3.157186s Total time: 3.251107s Total time: 3.681727s Total time: 4.622074s Total time: 6.866931s Total time: 3.523224s smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done Total time: 3.340698s Total time: 3.206455s Total time: 3.140326s Total time: 4.704944s Total time: 3.982831s Total time: 7.375094s Total time: 3.328329s Total time: 3.264700s Total time: 3.283851s Total time: 3.456416s ``` After: ``` smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 36.7M 0 36.7M 0 0 7662k 0 --:--:-- 0:00:04 --:--:-- 7663k 75077 smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done Total time: 0.224714s Total time: 0.012205s Total time: 0.014709s Total time: 0.008092s Total time: 0.007284s Total time: 0.006350s Total time: 0.005414s Total time: 0.006391s Total time: 0.005668s Total time: 0.004738s smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done Total time: 1.439507s Total time: 0.015126s Total time: 0.009085s Total time: 0.007620s Total time: 0.007692s Total time: 0.007420s Total time: 0.007152s Total time: 0.010515s Total time: 0.011493s Total time: 0.007564s ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #51428 from shardulm94/smahadik/shs-slow. Authored-by: Shardul Mahadik <[email protected]> Signed-off-by: yangjie01 <[email protected]> (cherry picked from commit aeae9ff) Signed-off-by: yangjie01 <[email protected]>

LuciferYang · 2025-07-26T11:25:58Z

Merged into master/branch-4.0/branch-3.5. Thanks @shardulm94 @mridulm @dongjoon-hyun and @peter-toth

github-actions bot added WEB UI CORE labels Jul 9, 2025

SPARK-52737: Pushdown predicate and number of apps to FsHistoryProvid…

b98ee31

…er when listing applications

shardulm94 force-pushed the smahadik/shs-slow branch from 9041c6e to b98ee31 Compare July 11, 2025 18:39

LuciferYang approved these changes Jul 14, 2025

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala Show resolved Hide resolved

mridulm reviewed Jul 14, 2025

View reviewed changes

Address comments

9daf6fb

LuciferYang reviewed Jul 15, 2025

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala Outdated Show resolved Hide resolved

Fix whitespace

19d4acd

mridulm approved these changes Jul 16, 2025

View reviewed changes

dongjoon-hyun reviewed Jul 23, 2025

View reviewed changes

dongjoon-hyun approved these changes Jul 23, 2025

View reviewed changes

peter-toth approved these changes Jul 26, 2025

View reviewed changes

LuciferYang closed this in aeae9ff Jul 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52737][CORE] Pushdown predicate and number of apps to FsHistoryProvider when listing applications #51428

[SPARK-52737][CORE] Pushdown predicate and number of apps to FsHistoryProvider when listing applications #51428

shardulm94 commented Jul 9, 2025 •

edited

Loading

Uh oh!

shardulm94 commented Jul 9, 2025

Uh oh!

LuciferYang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mridulm left a comment

Uh oh!

shardulm94 commented Jul 16, 2025 •

edited

Loading

Uh oh!

shardulm94 commented Jul 22, 2025

Uh oh!

mridulm commented Jul 22, 2025

Uh oh!

dongjoon-hyun Jul 23, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun Jul 23, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

LuciferYang commented Jul 23, 2025 •

edited

Loading

Uh oh!

thejdeep commented Jul 25, 2025

Uh oh!

dongjoon-hyun commented Jul 25, 2025

Uh oh!

LuciferYang commented Jul 26, 2025

Uh oh!

Uh oh!

[SPARK-52737][CORE] Pushdown predicate and number of apps to FsHistoryProvider when listing applications #51428

[SPARK-52737][CORE] Pushdown predicate and number of apps to FsHistoryProvider when listing applications #51428

Conversation

shardulm94 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

shardulm94 commented Jul 9, 2025

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

shardulm94 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shardulm94 commented Jul 22, 2025

Uh oh!

mridulm commented Jul 22, 2025

Uh oh!

dongjoon-hyun Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thejdeep commented Jul 25, 2025

Uh oh!

dongjoon-hyun commented Jul 25, 2025

Uh oh!

LuciferYang commented Jul 26, 2025

Uh oh!

Uh oh!

shardulm94 commented Jul 9, 2025 •

edited

Loading

shardulm94 commented Jul 16, 2025 •

edited

Loading

dongjoon-hyun Jul 23, 2025 •

edited

Loading

LuciferYang commented Jul 23, 2025 •

edited

Loading