Skip to content

[SPARK-52737][CORE] Pushdown predicate and number of apps to FsHistoryProvider when listing applications #51428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

shardulm94
Copy link
Contributor

@shardulm94 shardulm94 commented Jul 9, 2025

What changes were proposed in this pull request?

SPARK-38896 modified how applications are listed from the KVStore to close the KVStore iterator eagerly Link. This meant that FsHistoryProvider.getListing now eagerly goes through every application in the KVStore before returning an iterator to the caller. In a couple of contexts where FsHistoryProvider.getListing is used, this is very detrimental. e.g. here, due to .exists() we would previously only need to go through a handful of applications before the condition is satisfied. This causes significant perf regression for the SHS homepage in our environment which contains ~10000 Spark apps in a single history server.

To fix the issue, while preserving the original intent of closing the iterator early, this PR proposes pushing down filter predicates and number of applications required to FsHistoryProvider.

Why are the changes needed?

To fix a perf regression in SHS due to SPARK-38896

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests for HistoryPage and ApplicationListResource

Tested performance on local SHS with a large number of apps (~75k) consistent with production.
Before:

smahadik@localhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length'
75061

smahadik@localhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done
Total time: 3.607995s
Total time: 3.564875s
Total time: 3.095895s
Total time: 3.153576s
Total time: 3.157186s
Total time: 3.251107s
Total time: 3.681727s
Total time: 4.622074s
Total time: 6.866931s
Total time: 3.523224s

smahadik@localhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done
Total time: 3.340698s
Total time: 3.206455s
Total time: 3.140326s
Total time: 4.704944s
Total time: 3.982831s
Total time: 7.375094s
Total time: 3.328329s
Total time: 3.264700s
Total time: 3.283851s
Total time: 3.456416s

After:

smahadik@localhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 36.7M    0 36.7M    0     0  7662k      0 --:--:--  0:00:04 --:--:-- 7663k
75077

smahadik@localhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done
Total time: 0.224714s
Total time: 0.012205s
Total time: 0.014709s
Total time: 0.008092s
Total time: 0.007284s
Total time: 0.006350s
Total time: 0.005414s
Total time: 0.006391s
Total time: 0.005668s
Total time: 0.004738s


smahadik@localhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done
Total time: 1.439507s
Total time: 0.015126s
Total time: 0.009085s
Total time: 0.007620s
Total time: 0.007692s
Total time: 0.007420s
Total time: 0.007152s
Total time: 0.010515s
Total time: 0.011493s
Total time: 0.007564s

Was this patch authored or co-authored using generative AI tooling?

No

@shardulm94
Copy link
Contributor Author

cc: @LuciferYang @mridulm @thejdeep

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes are ok with me, but could you supplement the pr description with a comparison of the benefits? For example, a comparison of the access latency in scenarios where bad cases occur before and after the modification.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shardulm94
Copy link
Contributor Author

shardulm94 commented Jul 16, 2025

Just as an update here, I am having trouble setting the scenario up locally at a scale where we can reproduce the issue. We originally identified the issue using flamegraphs from our production instance. I am trying to see whats a good way to scale test this.

@shardulm94
Copy link
Contributor Author

@LuciferYang @mridulm Sorry it took a while, but I was able to scale test the change. I have added the performance numbers in the PR description.

@mridulm
Copy link
Contributor

mridulm commented Jul 22, 2025

The numbers look good to me, I will let @LuciferYang review it/merge the PR.
Thanks Shardul !

@@ -109,7 +109,7 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("")
}

def shouldDisplayApplications(requestedIncomplete: Boolean): Boolean = {
parent.getApplicationList().exists(isApplicationCompleted(_) != requestedIncomplete)
parent.getApplicationInfoList(1)(isApplicationCompleted(_) != requestedIncomplete).nonEmpty
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we check only one here, @shardulm94 and @mridulm ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. The next one was the predicate.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except the existing unaddressed comment). Thank you. It looks like a nice improvement.

@LuciferYang
Copy link
Contributor

LuciferYang commented Jul 23, 2025

I will merge this later.

@thejdeep
Copy link
Contributor

@LuciferYang Can this be merged ? Thank you!

@dongjoon-hyun
Copy link
Member

cc @peter-toth

LuciferYang pushed a commit that referenced this pull request Jul 26, 2025
…yProvider when listing applications

### What changes were proposed in this pull request?
SPARK-38896 modified how applications are listed from the KVStore to close the KVStore iterator eagerly [Link](https://github.com/apache/spark/pull/36237/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R328). This meant that `FsHistoryProvider.getListing` now eagerly goes through every application in the KVStore before returning an iterator to the caller. In a couple of contexts where `FsHistoryProvider.getListing` is used, this is very detrimental. e.g. [here](https://github.com/apache/spark/blame/589e93a02725939c266f9ee97f96fdc6d3db33cd/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala#L112), due to `.exists()` we would previously only need to go through a handful of applications before the condition is satisfied. This causes significant perf regression for the SHS homepage in our environment which contains ~10000 Spark apps in a single history server.

To fix the issue, while preserving the original intent of closing the iterator early, this PR proposes pushing down filter predicates and number of applications required to FsHistoryProvider.

### Why are the changes needed?
To fix a perf regression in SHS due to SPARK-38896

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests for `HistoryPage` and `ApplicationListResource`

Tested performance on local SHS with a large number of apps (~75k) consistent with production.
Before:
```
smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length'
75061

smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done
Total time: 3.607995s
Total time: 3.564875s
Total time: 3.095895s
Total time: 3.153576s
Total time: 3.157186s
Total time: 3.251107s
Total time: 3.681727s
Total time: 4.622074s
Total time: 6.866931s
Total time: 3.523224s

smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done
Total time: 3.340698s
Total time: 3.206455s
Total time: 3.140326s
Total time: 4.704944s
Total time: 3.982831s
Total time: 7.375094s
Total time: 3.328329s
Total time: 3.264700s
Total time: 3.283851s
Total time: 3.456416s
```

After:
```
smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 36.7M    0 36.7M    0     0  7662k      0 --:--:--  0:00:04 --:--:-- 7663k
75077

smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done
Total time: 0.224714s
Total time: 0.012205s
Total time: 0.014709s
Total time: 0.008092s
Total time: 0.007284s
Total time: 0.006350s
Total time: 0.005414s
Total time: 0.006391s
Total time: 0.005668s
Total time: 0.004738s

smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done
Total time: 1.439507s
Total time: 0.015126s
Total time: 0.009085s
Total time: 0.007620s
Total time: 0.007692s
Total time: 0.007420s
Total time: 0.007152s
Total time: 0.010515s
Total time: 0.011493s
Total time: 0.007564s
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #51428 from shardulm94/smahadik/shs-slow.

Authored-by: Shardul Mahadik <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
(cherry picked from commit aeae9ff)
Signed-off-by: yangjie01 <[email protected]>
LuciferYang pushed a commit that referenced this pull request Jul 26, 2025
…yProvider when listing applications

### What changes were proposed in this pull request?
SPARK-38896 modified how applications are listed from the KVStore to close the KVStore iterator eagerly [Link](https://github.com/apache/spark/pull/36237/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R328). This meant that `FsHistoryProvider.getListing` now eagerly goes through every application in the KVStore before returning an iterator to the caller. In a couple of contexts where `FsHistoryProvider.getListing` is used, this is very detrimental. e.g. [here](https://github.com/apache/spark/blame/589e93a02725939c266f9ee97f96fdc6d3db33cd/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala#L112), due to `.exists()` we would previously only need to go through a handful of applications before the condition is satisfied. This causes significant perf regression for the SHS homepage in our environment which contains ~10000 Spark apps in a single history server.

To fix the issue, while preserving the original intent of closing the iterator early, this PR proposes pushing down filter predicates and number of applications required to FsHistoryProvider.

### Why are the changes needed?
To fix a perf regression in SHS due to SPARK-38896

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests for `HistoryPage` and `ApplicationListResource`

Tested performance on local SHS with a large number of apps (~75k) consistent with production.
Before:
```
smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length'
75061

smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done
Total time: 3.607995s
Total time: 3.564875s
Total time: 3.095895s
Total time: 3.153576s
Total time: 3.157186s
Total time: 3.251107s
Total time: 3.681727s
Total time: 4.622074s
Total time: 6.866931s
Total time: 3.523224s

smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done
Total time: 3.340698s
Total time: 3.206455s
Total time: 3.140326s
Total time: 4.704944s
Total time: 3.982831s
Total time: 7.375094s
Total time: 3.328329s
Total time: 3.264700s
Total time: 3.283851s
Total time: 3.456416s
```

After:
```
smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 36.7M    0 36.7M    0     0  7662k      0 --:--:--  0:00:04 --:--:-- 7663k
75077

smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done
Total time: 0.224714s
Total time: 0.012205s
Total time: 0.014709s
Total time: 0.008092s
Total time: 0.007284s
Total time: 0.006350s
Total time: 0.005414s
Total time: 0.006391s
Total time: 0.005668s
Total time: 0.004738s

smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done
Total time: 1.439507s
Total time: 0.015126s
Total time: 0.009085s
Total time: 0.007620s
Total time: 0.007692s
Total time: 0.007420s
Total time: 0.007152s
Total time: 0.010515s
Total time: 0.011493s
Total time: 0.007564s
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #51428 from shardulm94/smahadik/shs-slow.

Authored-by: Shardul Mahadik <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
(cherry picked from commit aeae9ff)
Signed-off-by: yangjie01 <[email protected]>
@LuciferYang
Copy link
Contributor

Merged into master/branch-4.0/branch-3.5. Thanks @shardulm94 @mridulm @dongjoon-hyun and @peter-toth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants