-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10796][CORE]Resubmit stage while lost task in Zombie and removed TaskSetsAttempt #8927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I will run a test job on the latest code, to confirm that problem exist or not... |
|
Reproduce that, so re-open that |
|
Test build #43057 has finished for PR 8927 at commit
|
|
jenkins retest this please |
|
Test build #43059 has finished for PR 8927 at commit
|
|
Test build #43060 has finished for PR 8927 at commit
|
ce83c9b to
ff9ae61
Compare
|
jenkins retest this please |
|
Test build #56344 has finished for PR 8927 at commit
|
|
Test build #56345 has finished for PR 8927 at commit
|
fb478bb to
70af484
Compare
|
|
||
| outputCommitCoordinator.stageEnd(stage.id) | ||
| listenerBus.post(SparkListenerStageCompleted(stage.latestInfo)) | ||
| taskScheduler.zombieTasks(stage.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once stage was finished, it should make previous taskset Zombie
|
Test build #56691 has finished for PR 8927 at commit
|
| Success, | ||
| makeMapStatus("hostA", reduceRdd.partitions.size))) | ||
| assert(shuffleStage.numAvailableOutputs === 2) | ||
| assert(mapOutputTracker.getMapSizesByExecutorId(shuffleId, 0).map(_._1).toSet === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For running stage , executor lost will not register outputlocs in this PR
|
Test build #56692 has finished for PR 8927 at commit
|
|
Test build #58235 has finished for PR 8927 at commit
|
|
Test build #58236 has finished for PR 8927 at commit
|
|
@suyanNone, I think the conflicts should be resolved at the last once (as a mergeable state). Would you be able to resolve them? |
|
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
We meet that problem in Spark 1.3.0, and I also reproduce on the latest version.
desc:
We know a running
ShuffleMapStagewill have multipleTaskSet: one Active TaskSet, multiple Zombie TaskSet, and mutiple removedTaskSetWe think a running
ShuffleMapStageis success only if its partition are all process success, namely each task‘s MapStatus are all add intooutputLocsMapStatus of running
ShuffleMapStagemay succeed by RemovedTaskSet1../Zombie TaskSet1 / Zombie TaskSet2 /..../ Active TaskSetN. So it had a chance that some output only hold by some RemovedTaskset or ZombieTaskSet.If lost a executor, it chanced that some lost-executor related MapStatus are succeed by some Zombie TaskSet.
In current logical, The solution to resolved that lost MapStatus problem is,
each TaskSet re-running that those tasks which succeed in lost-executor: re-add into
TaskSet's pendingTasks,and re-add it paritions into
Stage‘s pendingPartitions.but it is useless if that lost MapStatus only belong to Zombie/Removed TaskSet, it is Zombie, so will never be scheduled his
pendingTasksThe condition for resubmit stage is only if some task throws
FetchFailedException, but may the lost-executor just not empty any MapStatus of parent Stage for one of running Stages,and it‘s happen to that running
Stagewas lost a MapStatus only belong to a ZombieTask or removedTaskset.So if all Zombie TaskSets are all processed his runningTasks and Active TaskSet are all processed his pendingTask, then will removed by
TaskSchedulerImp, then that running Stage's pending partitions is still nonEmpty. it will hangs......TestCase to show problem:
main changes:
DAGScheuleronly receive Task Resubmit events from ActiveTaskSets, so it can comparependingPartitionswithShuffleMapStagemissing outputs to know whether there have some partition cannot compute according current Tasksets, and make a decision if there is a need to resubmitShuffleMapStageother changes: