[SPARK-25921][Follow Up][PySpark] Fix barrier task run without BarrierTaskContext while python worker reuse #23435

xuanyuanking · 2019-01-03T06:25:34Z

What changes were proposed in this pull request?

It's the follow-up PR for #22962, contains the following works:

Remove __init__ in TaskContext and BarrierTaskContext.
Add more comments to explain the fix.
Rewrite UT in a new class.

How was this patch tested?

New UT in test_taskcontext.py

xuanyuanking · 2019-01-03T06:27:41Z

cc @HyukjinKwon and @cloud-fan.
Sorry for the late about this follow-up, please have a look.

SparkQA · 2019-01-03T06:51:26Z

Test build #100674 has finished for PR 23435 at commit 0cf822f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Looks fine if the tests pass

python/pyspark/tests/test_taskcontext.py

xuanyuanking · 2019-01-03T08:28:34Z

@HyukjinKwon The newly added UT can pass in python2.7 and pypy, but fail in pyhton3. It seems that the worker reuse didn't take effect in python3, I'm looking into this, not sure it's a bug or not.

SparkQA · 2019-01-03T08:45:50Z

Test build #100676 has finished for PR 23435 at commit a5c20db.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2019-01-03T16:48:00Z

It seems that the worker reuse didn't take effect in python3, I'm looking into this, not sure it's a bug or not.

It's a bug that worker reuse loses efficacy caused by the unexpected return of checking the end of stream logic in python worker, I'll give another PR and JIRA tomorrow to fix it, this PR will continue after the problem fix.

python/pyspark/taskcontext.py

HyukjinKwon · 2019-01-06T03:50:36Z

Let's fix #23470 first.

… parallelize lazy iterable range ## What changes were proposed in this pull request? During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description. We fix this by force using the passed-in iterator. ## How was this patch tested? New UT in test_worker.py. Closes #23470 from xuanyuanking/SPARK-26549. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

SparkQA · 2019-01-10T09:56:16Z

Test build #101012 has finished for PR 23435 at commit eedd445.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-11T06:27:56Z

Merged to master.

HyukjinKwon · 2019-01-11T06:31:13Z

This followup was not merged into branch-2.4 although the main PR went into branch-2.4 due to conflicts. Since it's rather stylic changes, I think it's okay not to backport this followup.

To reduce the diff between master and branch-2.4, we can backport it too if anyone thinks it should.

xuanyuanking · 2019-01-11T09:03:43Z

 I think it's okay not to backport this followup.

Yea, agree.
Thanks Hyukjin and Felix for your review.

… parallelize lazy iterable range ## What changes were proposed in this pull request? During the follow-up work(apache#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in apache#3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description. We fix this by force using the passed-in iterator. ## How was this patch tested? New UT in test_worker.py. Closes apache#23470 from xuanyuanking/SPARK-26549. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…rTaskContext while python worker reuse ## What changes were proposed in this pull request? It's the follow-up PR for apache#22962, contains the following works: - Remove `__init__` in TaskContext and BarrierTaskContext. - Add more comments to explain the fix. - Rewrite UT in a new class. ## How was this patch tested? New UT in test_taskcontext.py Closes apache#23435 from xuanyuanking/SPARK-25921-follow. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

follow up pr for SPARK-25921

0cf822f

HyukjinKwon reviewed Jan 3, 2019

View reviewed changes

python/pyspark/tests/test_taskcontext.py Outdated Show resolved Hide resolved

python/pyspark/tests/test_taskcontext.py Outdated Show resolved Hide resolved

python/pyspark/tests/test_taskcontext.py Outdated Show resolved Hide resolved

compatible support for python3

a5c20db

felixcheung reviewed Jan 4, 2019

View reviewed changes

python/pyspark/taskcontext.py Outdated Show resolved Hide resolved

address comments

eedd445

xuanyuanking mentioned this pull request Jan 5, 2019

[SPARK-26549][PySpark] Fix for python worker reuse take no effect for parallelize lazy iterable range #23470

Closed

asfgit closed this in 98e831d Jan 11, 2019

xuanyuanking deleted the SPARK-25921-follow branch January 11, 2019 09:03

[SPARK-25921][Follow Up][PySpark] Fix barrier task run without BarrierTaskContext while python worker reuse #23435

[SPARK-25921][Follow Up][PySpark] Fix barrier task run without BarrierTaskContext while python worker reuse #23435

Uh oh!

Conversation

xuanyuanking commented Jan 3, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

xuanyuanking commented Jan 3, 2019

Uh oh!

SparkQA commented Jan 3, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xuanyuanking commented Jan 3, 2019

Uh oh!

SparkQA commented Jan 3, 2019

Uh oh!

xuanyuanking commented Jan 3, 2019

Uh oh!

Uh oh!

HyukjinKwon commented Jan 6, 2019

Uh oh!

SparkQA commented Jan 10, 2019

Uh oh!

HyukjinKwon commented Jan 11, 2019

Uh oh!

HyukjinKwon commented Jan 11, 2019

Uh oh!

xuanyuanking commented Jan 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants