-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25921][Follow Up][PySpark] Fix barrier task run without BarrierTaskContext while python worker reuse #23435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon and @cloud-fan. |
|
Test build #100674 has finished for PR 23435 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine if the tests pass
|
@HyukjinKwon The newly added UT can pass in python2.7 and pypy, but fail in pyhton3. It seems that the worker reuse didn't take effect in python3, I'm looking into this, not sure it's a bug or not. |
|
Test build #100676 has finished for PR 23435 at commit
|
It's a bug that worker reuse loses efficacy caused by the unexpected return of checking the end of stream logic in python worker, I'll give another PR and JIRA tomorrow to fix it, this PR will continue after the problem fix. |
|
Let's fix #23470 first. |
… parallelize lazy iterable range ## What changes were proposed in this pull request? During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description. We fix this by force using the passed-in iterator. ## How was this patch tested? New UT in test_worker.py. Closes #23470 from xuanyuanking/SPARK-26549. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
|
Test build #101012 has finished for PR 23435 at commit
|
|
Merged to master. |
|
This followup was not merged into branch-2.4 although the main PR went into branch-2.4 due to conflicts. Since it's rather stylic changes, I think it's okay not to backport this followup. To reduce the diff between master and branch-2.4, we can backport it too if anyone thinks it should. |
Yea, agree. |
… parallelize lazy iterable range ## What changes were proposed in this pull request? During the follow-up work(apache#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in apache#3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description. We fix this by force using the passed-in iterator. ## How was this patch tested? New UT in test_worker.py. Closes apache#23470 from xuanyuanking/SPARK-26549. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…rTaskContext while python worker reuse ## What changes were proposed in this pull request? It's the follow-up PR for apache#22962, contains the following works: - Remove `__init__` in TaskContext and BarrierTaskContext. - Add more comments to explain the fix. - Rewrite UT in a new class. ## How was this patch tested? New UT in test_taskcontext.py Closes apache#23435 from xuanyuanking/SPARK-25921-follow. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
It's the follow-up PR for #22962, contains the following works:
__init__in TaskContext and BarrierTaskContext.How was this patch tested?
New UT in test_taskcontext.py