Skip to content

Conversation

@xuanyuanking
Copy link
Member

What changes were proposed in this pull request?

It's the follow-up PR for #22962, contains the following works:

  • Remove __init__ in TaskContext and BarrierTaskContext.
  • Add more comments to explain the fix.
  • Rewrite UT in a new class.

How was this patch tested?

New UT in test_taskcontext.py

@xuanyuanking
Copy link
Member Author

cc @HyukjinKwon and @cloud-fan.
Sorry for the late about this follow-up, please have a look.

@SparkQA
Copy link

SparkQA commented Jan 3, 2019

Test build #100674 has finished for PR 23435 at commit 0cf822f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine if the tests pass

@xuanyuanking
Copy link
Member Author

@HyukjinKwon The newly added UT can pass in python2.7 and pypy, but fail in pyhton3. It seems that the worker reuse didn't take effect in python3, I'm looking into this, not sure it's a bug or not.

@SparkQA
Copy link

SparkQA commented Jan 3, 2019

Test build #100676 has finished for PR 23435 at commit a5c20db.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

It seems that the worker reuse didn't take effect in python3, I'm looking into this, not sure it's a bug or not.

It's a bug that worker reuse loses efficacy caused by the unexpected return of checking the end of stream logic in python worker, I'll give another PR and JIRA tomorrow to fix it, this PR will continue after the problem fix.

@HyukjinKwon
Copy link
Member

Let's fix #23470 first.

asfgit pushed a commit that referenced this pull request Jan 9, 2019
… parallelize lazy iterable range

## What changes were proposed in this pull request?

During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description.

We fix this by force using the passed-in iterator.

## How was this patch tested?
New UT in test_worker.py.

Closes #23470 from xuanyuanking/SPARK-26549.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@SparkQA
Copy link

SparkQA commented Jan 10, 2019

Test build #101012 has finished for PR 23435 at commit eedd445.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

This followup was not merged into branch-2.4 although the main PR went into branch-2.4 due to conflicts. Since it's rather stylic changes, I think it's okay not to backport this followup.

To reduce the diff between master and branch-2.4, we can backport it too if anyone thinks it should.

@asfgit asfgit closed this in 98e831d Jan 11, 2019
@xuanyuanking
Copy link
Member Author

 I think it's okay not to backport this followup.

Yea, agree.
Thanks Hyukjin and Felix for your review.

@xuanyuanking xuanyuanking deleted the SPARK-25921-follow branch January 11, 2019 09:03
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… parallelize lazy iterable range

## What changes were proposed in this pull request?

During the follow-up work(apache#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in apache#3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description.

We fix this by force using the passed-in iterator.

## How was this patch tested?
New UT in test_worker.py.

Closes apache#23470 from xuanyuanking/SPARK-26549.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…rTaskContext while python worker reuse

## What changes were proposed in this pull request?

It's the follow-up PR for apache#22962, contains the following works:
- Remove `__init__` in TaskContext and BarrierTaskContext.
- Add more comments to explain the fix.
- Rewrite UT in a new class.

## How was this patch tested?

New UT in test_taskcontext.py

Closes apache#23435 from xuanyuanking/SPARK-25921-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants