-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-38640][CORE] Fix NPE with memory-only cache blocks and RDD fetching #35959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38640][CORE] Fix NPE with memory-only cache blocks and RDD fetching #35959
Conversation
|
@attilapiros found this while testing #35085 |
|
I wasn't exactly sure what the "right" fix was here that avoided weird edge cases where you change a block from memory only to memory and disk or vice versa. The alternative is adding a null check in |
|
Can one of the admins verify this patch? |
|
Why not simply move the (Note, all invocations are within a BlockManagerMasterEndpoint event loop - so no MT issues : else all uses of |
|
Okay good to know, I'll just change it to that then. Was just overcomplicating things hah |
8f4d2b9 to
7c79636
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we test with both true and false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if you want to make sure just having the shuffle service enabled doesn't affect this. I expect there's other tests testing all the default configs with all the caching behaviors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing both enabled and disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this disk persisted blocks correct for this test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the test name
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
mridulm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @Kimahriman !
795b6e8 to
bad0931
Compare
|
Can you rebase and force push? something's up with the PR builder |
|
That's what I tried to do this morning, was there a fix since then? |
|
Oh, I just saw the email about some problem with PR builders. Let's sit tight, seems external |
|
Here's the link to my successful action: https://github.com/Kimahriman/spark/actions/runs/2154365947 |
…ching ### What changes were proposed in this pull request? Fixes a bug where if `spark.shuffle.service.fetch.rdd.enabled=true`, memory-only cached blocks will fail to unpersist. ### Why are the changes needed? In #33020, when all RDD blocks are removed from `externalShuffleServiceBlockStatus`, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it to `externalShuffleServiceBlockStatus`, but when removing them there is no check, so a memory-only cache block will keep `externalShuffleServiceBlockStatus` null, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added to `externalShuffleServiceBlockStatus` in the first place. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New and updated UT Closes #35959 from Kimahriman/fetch-rdd-memory-only-unpersist. Authored-by: Adam Binford <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit e0939f0) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/3.3 |
…ching ### What changes were proposed in this pull request? Fixes a bug where if `spark.shuffle.service.fetch.rdd.enabled=true`, memory-only cached blocks will fail to unpersist. ### Why are the changes needed? In apache#33020, when all RDD blocks are removed from `externalShuffleServiceBlockStatus`, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it to `externalShuffleServiceBlockStatus`, but when removing them there is no check, so a memory-only cache block will keep `externalShuffleServiceBlockStatus` null, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added to `externalShuffleServiceBlockStatus` in the first place. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New and updated UT Closes apache#35959 from Kimahriman/fetch-rdd-memory-only-unpersist. Authored-by: Adam Binford <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit e0939f0) Signed-off-by: Sean Owen <[email protected]>
What changes were proposed in this pull request?
Fixes a bug where if
spark.shuffle.service.fetch.rdd.enabled=true, memory-only cached blocks will fail to unpersist.Why are the changes needed?
In #33020, when all RDD blocks are removed from
externalShuffleServiceBlockStatus, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it toexternalShuffleServiceBlockStatus, but when removing them there is no check, so a memory-only cache block will keepexternalShuffleServiceBlockStatusnull, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added toexternalShuffleServiceBlockStatusin the first place.Does this PR introduce any user-facing change?
No
How was this patch tested?
New and updated UT