-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52737][CORE] Pushdown predicate and number of apps to FsHistoryProvider when listing applications #51428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…er when listing applications
9041c6e
to
b98ee31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes are ok with me, but could you supplement the pr description with a comparison of the benefits? For example, a comparison of the access latency in scenarios where bad cases occur before and after the modification.
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/deploy/history/HistoryServer.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Just as an update here, I am having trouble setting the scenario up locally at a scale where we can reproduce the issue. We originally identified the issue using flamegraphs from our production instance. I am trying to see whats a good way to scale test this. |
@LuciferYang @mridulm Sorry it took a while, but I was able to scale test the change. I have added the performance numbers in the PR description. |
The numbers look good to me, I will let @LuciferYang review it/merge the PR. |
@@ -109,7 +109,7 @@ private[history] class HistoryPage(parent: HistoryServer) extends WebUIPage("") | |||
} | |||
|
|||
def shouldDisplayApplications(requestedIncomplete: Boolean): Boolean = { | |||
parent.getApplicationList().exists(isApplicationCompleted(_) != requestedIncomplete) | |||
parent.getApplicationInfoList(1)(isApplicationCompleted(_) != requestedIncomplete).nonEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we check only one here, @shardulm94 and @mridulm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. The next one was the predicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (except the existing unaddressed comment). Thank you. It looks like a nice improvement.
I will merge this later. |
@LuciferYang Can this be merged ? Thank you! |
cc @peter-toth |
…yProvider when listing applications ### What changes were proposed in this pull request? SPARK-38896 modified how applications are listed from the KVStore to close the KVStore iterator eagerly [Link](https://github.com/apache/spark/pull/36237/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R328). This meant that `FsHistoryProvider.getListing` now eagerly goes through every application in the KVStore before returning an iterator to the caller. In a couple of contexts where `FsHistoryProvider.getListing` is used, this is very detrimental. e.g. [here](https://github.com/apache/spark/blame/589e93a02725939c266f9ee97f96fdc6d3db33cd/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala#L112), due to `.exists()` we would previously only need to go through a handful of applications before the condition is satisfied. This causes significant perf regression for the SHS homepage in our environment which contains ~10000 Spark apps in a single history server. To fix the issue, while preserving the original intent of closing the iterator early, this PR proposes pushing down filter predicates and number of applications required to FsHistoryProvider. ### Why are the changes needed? To fix a perf regression in SHS due to SPARK-38896 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests for `HistoryPage` and `ApplicationListResource` Tested performance on local SHS with a large number of apps (~75k) consistent with production. Before: ``` smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length' 75061 smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done Total time: 3.607995s Total time: 3.564875s Total time: 3.095895s Total time: 3.153576s Total time: 3.157186s Total time: 3.251107s Total time: 3.681727s Total time: 4.622074s Total time: 6.866931s Total time: 3.523224s smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done Total time: 3.340698s Total time: 3.206455s Total time: 3.140326s Total time: 4.704944s Total time: 3.982831s Total time: 7.375094s Total time: 3.328329s Total time: 3.264700s Total time: 3.283851s Total time: 3.456416s ``` After: ``` smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 36.7M 0 36.7M 0 0 7662k 0 --:--:-- 0:00:04 --:--:-- 7663k 75077 smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done Total time: 0.224714s Total time: 0.012205s Total time: 0.014709s Total time: 0.008092s Total time: 0.007284s Total time: 0.006350s Total time: 0.005414s Total time: 0.006391s Total time: 0.005668s Total time: 0.004738s smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done Total time: 1.439507s Total time: 0.015126s Total time: 0.009085s Total time: 0.007620s Total time: 0.007692s Total time: 0.007420s Total time: 0.007152s Total time: 0.010515s Total time: 0.011493s Total time: 0.007564s ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #51428 from shardulm94/smahadik/shs-slow. Authored-by: Shardul Mahadik <[email protected]> Signed-off-by: yangjie01 <[email protected]> (cherry picked from commit aeae9ff) Signed-off-by: yangjie01 <[email protected]>
…yProvider when listing applications ### What changes were proposed in this pull request? SPARK-38896 modified how applications are listed from the KVStore to close the KVStore iterator eagerly [Link](https://github.com/apache/spark/pull/36237/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R328). This meant that `FsHistoryProvider.getListing` now eagerly goes through every application in the KVStore before returning an iterator to the caller. In a couple of contexts where `FsHistoryProvider.getListing` is used, this is very detrimental. e.g. [here](https://github.com/apache/spark/blame/589e93a02725939c266f9ee97f96fdc6d3db33cd/core/src/main/scala/org/apache/spark/deploy/history/HistoryPage.scala#L112), due to `.exists()` we would previously only need to go through a handful of applications before the condition is satisfied. This causes significant perf regression for the SHS homepage in our environment which contains ~10000 Spark apps in a single history server. To fix the issue, while preserving the original intent of closing the iterator early, this PR proposes pushing down filter predicates and number of applications required to FsHistoryProvider. ### Why are the changes needed? To fix a perf regression in SHS due to SPARK-38896 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests for `HistoryPage` and `ApplicationListResource` Tested performance on local SHS with a large number of apps (~75k) consistent with production. Before: ``` smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length' 75061 smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done Total time: 3.607995s Total time: 3.564875s Total time: 3.095895s Total time: 3.153576s Total time: 3.157186s Total time: 3.251107s Total time: 3.681727s Total time: 4.622074s Total time: 6.866931s Total time: 3.523224s smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done Total time: 3.340698s Total time: 3.206455s Total time: 3.140326s Total time: 4.704944s Total time: 3.982831s Total time: 7.375094s Total time: 3.328329s Total time: 3.264700s Total time: 3.283851s Total time: 3.456416s ``` After: ``` smahadiklocalhost [ ~ ]$ curl http://localhost:18080/api/v1/applications | jq 'length' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 36.7M 0 36.7M 0 0 7662k 0 --:--:-- 0:00:04 --:--:-- 7663k 75077 smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080; done Total time: 0.224714s Total time: 0.012205s Total time: 0.014709s Total time: 0.008092s Total time: 0.007284s Total time: 0.006350s Total time: 0.005414s Total time: 0.006391s Total time: 0.005668s Total time: 0.004738s smahadiklocalhost [ ~ ]$ for i in {1..10}; do curl -s -w "\nTotal time: %{time_total}s\n" -o /dev/null http://localhost:18080/api/v1/applications?limit=10; done Total time: 1.439507s Total time: 0.015126s Total time: 0.009085s Total time: 0.007620s Total time: 0.007692s Total time: 0.007420s Total time: 0.007152s Total time: 0.010515s Total time: 0.011493s Total time: 0.007564s ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #51428 from shardulm94/smahadik/shs-slow. Authored-by: Shardul Mahadik <[email protected]> Signed-off-by: yangjie01 <[email protected]> (cherry picked from commit aeae9ff) Signed-off-by: yangjie01 <[email protected]>
Merged into master/branch-4.0/branch-3.5. Thanks @shardulm94 @mridulm @dongjoon-hyun and @peter-toth |
What changes were proposed in this pull request?
SPARK-38896 modified how applications are listed from the KVStore to close the KVStore iterator eagerly Link. This meant that
FsHistoryProvider.getListing
now eagerly goes through every application in the KVStore before returning an iterator to the caller. In a couple of contexts whereFsHistoryProvider.getListing
is used, this is very detrimental. e.g. here, due to.exists()
we would previously only need to go through a handful of applications before the condition is satisfied. This causes significant perf regression for the SHS homepage in our environment which contains ~10000 Spark apps in a single history server.To fix the issue, while preserving the original intent of closing the iterator early, this PR proposes pushing down filter predicates and number of applications required to FsHistoryProvider.
Why are the changes needed?
To fix a perf regression in SHS due to SPARK-38896
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing unit tests for
HistoryPage
andApplicationListResource
Tested performance on local SHS with a large number of apps (~75k) consistent with production.
Before:
After:
Was this patch authored or co-authored using generative AI tooling?
No