Skip to content

Conversation

@Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Fixes a bug where if spark.shuffle.service.fetch.rdd.enabled=true, memory-only cached blocks will fail to unpersist.

Why are the changes needed?

In #33020, when all RDD blocks are removed from externalShuffleServiceBlockStatus, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it to externalShuffleServiceBlockStatus, but when removing them there is no check, so a memory-only cache block will keep externalShuffleServiceBlockStatus null, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added to externalShuffleServiceBlockStatus in the first place.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New and updated UT

@Kimahriman
Copy link
Contributor Author

@attilapiros found this while testing #35085

@github-actions github-actions bot added the CORE label Mar 24, 2022
@Kimahriman Kimahriman changed the title [SPARK-38640][CORE] Only remove shuffle service tracked RDD [SPARK-38640][CORE] Fix NPE with memory-only cache blocks and RDD fetching Mar 24, 2022
@Kimahriman
Copy link
Contributor Author

I wasn't exactly sure what the "right" fix was here that avoided weird edge cases where you change a block from memory only to memory and disk or vice versa. The alternative is adding a null check in BlockStatusPerBlockId.remove but I wasn't sure if that was a race condition waiting to happen or if that will only ever be called from one thread. But I guess the get would also be a race condition if that was the case, so maybe that's the better alternative?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mridulm
Copy link
Contributor

mridulm commented Mar 25, 2022

Why not simply move the blocks.remove(blockId) in BlockStatusPerBlockId.remove within a if (null != blocks) check ?

(Note, all invocations are within a BlockManagerMasterEndpoint event loop - so no MT issues : else all uses of BlockStatusPerBlockId.blocks would have been broken :-) ).

@Kimahriman
Copy link
Contributor Author

Okay good to know, I'll just change it to that then. Was just overcomplicating things hah

@Kimahriman Kimahriman force-pushed the fetch-rdd-memory-only-unpersist branch 2 times, most recently from 8f4d2b9 to 7c79636 Compare March 25, 2022 22:11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we test with both true and false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if you want to make sure just having the shuffle service enabled doesn't affect this. I expect there's other tests testing all the default configs with all the caching behaviors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing both enabled and disabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this disk persisted blocks correct for this test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make it more descriptive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test name

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @Kimahriman !

@Kimahriman Kimahriman force-pushed the fetch-rdd-memory-only-unpersist branch from 795b6e8 to bad0931 Compare April 12, 2022 10:40
@srowen
Copy link
Member

srowen commented Apr 12, 2022

Can you rebase and force push? something's up with the PR builder

@Kimahriman
Copy link
Contributor Author

That's what I tried to do this morning, was there a fix since then?

@srowen
Copy link
Member

srowen commented Apr 12, 2022

Oh, I just saw the email about some problem with PR builders. Let's sit tight, seems external

@Kimahriman
Copy link
Contributor Author

Here's the link to my successful action: https://github.com/Kimahriman/spark/actions/runs/2154365947

@srowen srowen closed this in e0939f0 Apr 17, 2022
srowen pushed a commit that referenced this pull request Apr 17, 2022
…ching

### What changes were proposed in this pull request?

Fixes a bug where if `spark.shuffle.service.fetch.rdd.enabled=true`, memory-only cached blocks will fail to unpersist.

### Why are the changes needed?

In #33020, when all RDD blocks are removed from `externalShuffleServiceBlockStatus`, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it to `externalShuffleServiceBlockStatus`, but when removing them there is no check, so a memory-only cache block will keep `externalShuffleServiceBlockStatus` null, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added to `externalShuffleServiceBlockStatus` in the first place.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New and updated UT

Closes #35959 from Kimahriman/fetch-rdd-memory-only-unpersist.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit e0939f0)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Apr 17, 2022

Merged to master/3.3

Kimahriman added a commit to Kimahriman/spark that referenced this pull request Apr 10, 2023
…ching

### What changes were proposed in this pull request?

Fixes a bug where if `spark.shuffle.service.fetch.rdd.enabled=true`, memory-only cached blocks will fail to unpersist.

### Why are the changes needed?

In apache#33020, when all RDD blocks are removed from `externalShuffleServiceBlockStatus`, the underlying Map is nulled to reduce memory. When persisting blocks we check if it's using disk before adding it to `externalShuffleServiceBlockStatus`, but when removing them there is no check, so a memory-only cache block will keep `externalShuffleServiceBlockStatus` null, and when unpersisting it throw an NPE because it tries to remove from the null Map. This adds checks to the removal as well to only remove if the block is on disk, and therefore should have been added to `externalShuffleServiceBlockStatus` in the first place.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New and updated UT

Closes apache#35959 from Kimahriman/fetch-rdd-memory-only-unpersist.

Authored-by: Adam Binford <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit e0939f0)
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants