[SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python #14963

gsemet · 2016-09-05T16:08:50Z

Use a virtualenv for isolation and easy installation of pep8 and pylint by the lint-python linter script.

This involves 2 changes in this script:

it creates a virtualenv if we are not in one. This way, we can use pip to install development environment's dependencies (which are different from the "prod" environment, for instance we don't want to install pep or pylint on prod)
dependencies are described in requirements.txt. This will "freeze" the version of the tools used. We don't want the HEAD starting breaking because a new version of pylint has been posted on pypi which add new checks. sphinx version is also "frozen".

This basically reverts 85a50a6

Might have been a solution to SPARK-9385 (see #7704)

Lot of new tests introduced by the newer version of pylint have been disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as #14830 for documentation examples, which is part of the effort on code style described in #14567). Each subsequent pull request will fix one or more error and reenable the according pylint check.

List of new disabled tests:

bad-super-call
consider-iterating-dictionary
consider-using-enumerate
eval-used
exec-used
invalid-length-returned
misplaced-comparison-constant
raising-bad-type
redefined-variable-type
trailing-newlines
trailing-whitespace
ungrouped-imports
unnecessary-pass
unneeded-not
wrong-import-order
wrong-import-position

How was this patch tested?

Manual execution of lint-python on my machine.

gsemet · 2016-09-05T16:12:18Z

python/pylintrc

I have actually added a lot of exclusion here to make the code pass:

bad-super-call

consider-iterating-dictionary

consider-using-enumerate

eval-used

exec-used

invalid-length-returned

misplaced-comparison-constant

raising-bad-type

redefined-variable-type

trailing-newlines

trailing-whitespace

ungrouped-imports

unnecessary-pass

unneeded-not

wrong-import-order

wrong-import-position

gsemet · 2016-09-05T16:25:04Z

dev/lint-python

Note: here I leave the virtualenv of the current environment. tell me if which solution you prefere:

jump into a custom venv? (this proposal)

fail if in a virtualenv

only warn the user if already in a virtualenv

no warning at all, the user knows what he/she does

I think if we do jump to a custom virtual env for syntax checking we should at least restore the users existing virtual env after.

What do you mean with "after"? For the Sphinx build? I think we can do both in this virtual env, this would allow to build without having to do sudo pip install Sphinx-build.
And we can control precisely the version of sphinx in this requirements.txt

In anyway, this virtualenv will be left automatically at the end of the execution of this script.

Using the virtualenv for the sphinx build sounds reasonable if we are already doing it for pep8. I was thinking that since we explicitly leave the current virtual env if the user is in one we should try and restore it after the script has been run - but if that already happens automatically thats great.

Actually I'm thinking it may be better not to leave virtualenv if we are inside one. So we don't have this copy paste of the code for desactivating it.
What do you think?

SparkQA · 2016-09-05T18:21:34Z

Test build #64949 has finished for PR 14963 at commit e0ad40a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-05T18:27:08Z

Test build #64950 has finished for PR 14963 at commit bd82d8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-09-09T06:14:22Z

dev/lint-python

Wait even simpler[ish] - do we need to deactivate the old env or will activating the new env be enough (provided we store the old VIRTUAL_ENV path to restore to after we deactivate).

I am open to suggestion. If we are already inside a venv we can expect the user knows what it does and don't desactivate it.

holdenk · 2016-09-09T06:16:19Z

Thanks for working to reenable pylint (super important IMHO). Perhaps @davies could take a look at this approach?
I like the building up of venv seems to help set up for your other work improving pep8 support although it is maybe a bit of a bigger change than needed locally.

gsemet · 2016-09-09T06:48:47Z

dev/requirements.txt

Need to add the sphinx version here

holdenk · 2016-09-09T06:50:40Z

Oh wait I've been a little confused by the PR title - pylint is already reenabled and was only temporary disabled awhile back - maybe less important but would make sense in the context of the other PRs you've put together. Could you maybe give this PR a clearer name like "Use virtualenv for pep8" rather than "Reenable Pylint"?

gsemet · 2016-09-09T06:54:49Z

Hum I don't see how it was reenabled... Where is it called? And I had many errors to ignore once I have reenabled it on the execution of lint-python.

I'll update the title. At least, being inside a venv allow to precisely control the version of pep and pylint, this is kind of critical since errors changes from one version to another

holdenk · 2016-09-09T07:03:27Z

I mean we already have the pep8 version locked since we download a specific version with curl rather than depending on the system version.

gsemet · 2016-09-09T07:10:10Z

Yes but for pylint you have many dependencies to update as well (astroid,...). At least with a virtualenv, pip does it for us :)
And every time I see an hard coded external url I am afraid it might break soon or later

holdenk · 2016-09-09T07:21:46Z

I also think virtualenv or similar is probably the right way to go for PySpark developer dependencies in the long term :)

gsemet · 2016-09-09T07:42:02Z

Great ! By the way, I am also working on virtualenv and wheel support for PySpark job deployment (see #14180)

holdenk · 2016-09-09T07:50:23Z

Yah I've been semi-following #14180 & #13599 but haven't had much time to look at them closely. I think something like what you and @zjffdu are working on could make a real improvement to the PySpark experience for users with custom libraries.

gsemet · 2016-09-09T08:30:45Z

Great !

SparkQA · 2016-09-09T10:33:27Z

Test build #65145 has finished for PR 14963 at commit 994bdfd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-09T11:12:23Z

Test build #65146 has finished for PR 14963 at commit 3f7db49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-09-14T21:37:15Z

cc @davies ?

gsemet · 2016-09-16T09:04:07Z

Rebased + updated PR summary

Hello, sorry to bother you, but if this patch gets merged, I can work on the pylint errors and submit new PR I had to add in the ignore list of pylint.

If I reenable most of them, here is an extract of errors I get:

************* Module pyspark.shell
W: 78, 8: Use of exec (exec-used)
************* Module pyspark.cloudpickle
R:767,16: Redefinition of v type from property to staticmethod (redefined-variable-type)
************* Module pyspark.tests
R:782,12: Redefinition of a type from pyspark.rdd.RDD to pyspark.rdd.PipelinedRDD (redefined-variable-type)
R:783,12: Redefinition of b type from pyspark.rdd.RDD to pyspark.rdd.PipelinedRDD (redefined-variable-type)
************* Module pyspark.rdd
R:1467,12: Redefinition of ser type from pyspark.serializers.AutoBatchedSerializer to pyspark.serializers.BatchedSerializer (redefined-variable-type)
************* Module pyspark.context
R:130,12: Redefinition of self.serializer type from pyspark.serializers.AutoBatchedSerializer to pyspark.serializers.BatchedSerializer (redefined-variable-type)
************* Module pyspark.worker
W:104,17: Use of eval (eval-used)
************* Module pyspark.shuffle
C:732,32: Consider iterating the dictionary directly instead of calling .keys() (consider-iterating-dictionary)
************* Module pyspark.ml.tests
R:1459, 8: Redefinition of a type from pyspark.ml.linalg.DenseVector to pyspark.ml.linalg.SparseVector (redefined-variable-type)
************* Module pyspark.ml.linalg
E:683, 4: __len__ does not return non-negative integer (invalid-length-returned)
R:1079, 8: Redefinition of colPtrs type from list to str (redefined-variable-type)
R:1078, 8: Redefinition of rowIndices type from list to str (redefined-variable-type)
************* Module pyspark.streaming.kafka
R:132,12: Redefinition of ser type from pyspark.serializers.PairDeserializer to pyspark.serializers.AutoBatchedSerializer (redefined-variable-type)
************* Module pyspark.streaming.tests
R:419, 8: Redefinition of initial type from list to pyspark.rdd.RDD (redefined-variable-type)
E:506,12: Bad first argument 'StreamingListener' given to super() (bad-super-call)
************* Module pyspark.streaming.context
C:298,12: Consider using enumerate instead of iterating with range and len (consider-using-enumerate)

These are not big errors, but it would be quite easy to fix them and have pylint runing at each pullrequest.

Thanks

SparkQA · 2016-09-16T11:30:59Z

Test build #65481 has finished for PR 14963 at commit dff7b23.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-09-21T16:37:22Z

So even though this passes tests - its showing up in the PR dashboard as not for me. It might be easier to get reviewers to take a look if it was listed correctly, if you merge in the latest master the tests will re-run or we can explicitly ask jenkins to retest this.

gsemet · 2016-09-21T19:30:45Z

What is the PR dashboard ?

I usually rebase this patch one or twice a week, I'll do it tomorrow

holdenk · 2016-09-21T19:58:27Z

oh actually it seems like the PR dashboard updated its self (it lives at https://spark-prs.appspot.com/ and is sometimes used by reviewers to decide which PRs to look at).

gsemet · 2016-10-12T20:37:47Z

You and I agree, actually:

PySpark can run inside Anaconda, and indeed this is greatly valuable. This will make available to the "driver" all the package provided by Anaconda (in client mode)
Making the same environment available for executor is a bit messy for the moment. Here we have an NFS share that every executor use, but having it automatically deployed would be so cool. that's why the other PR :)

For development, PySpark will still use the anaconda environment if the user wishes so. Just for the lint-python execution, we revert back to a virtualenv, but only for the duration of the script.

gsemet · 2016-10-12T20:57:03Z

I really think using Spark with Anaconda is a must have. Deploying jobs that runs inside a Conda environment is so fast and efficient. I really want to push for this pull request #14180 that also add anaconda support for executors (thanks to jeff's work). Actually before this PR I never tried Anaconda, and now I use it everyday on jupyter. Anaconda on the driver and executor make totally sense!

SparkQA · 2016-10-12T21:18:00Z

Test build #66835 has finished for PR 14963 at commit 087d1a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T21:33:40Z

Test build #66838 has finished for PR 14963 at commit 93395fd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gsemet · 2016-10-12T21:38:56Z

Ho, I have these pylint errors on my ubuntu! Probably I did not rebased correctly.

Fixed with new check ignore:

deprecated-method
unsubscriptable-object,
used-before-assignment,

of course, once merged. i can work on fixing this pylint errors !

SparkQA · 2016-10-12T21:56:29Z

Test build #66836 has finished for PR 14963 at commit 8a6d659.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T22:00:18Z

Test build #66837 has finished for PR 14963 at commit 278ff07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-12T23:35:45Z

Test build #66848 has finished for PR 14963 at commit 132fec8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-13T00:18:13Z

Test build #66846 has finished for PR 14963 at commit cd98471.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-10-13T19:00:58Z

@stibbons however the problem is the virtualenv that gets created when a user is in a conda enviroment is partially broken.

holdenk · 2016-10-16T18:24:01Z

Also if it would be useful to have a quick chat off-line and then circle back to the PR with the result of our chat let me know - I'd really like to see us get something like this in so we can have better/more consistent linting.

gsemet · 2016-10-20T07:13:28Z

I agree, just email me :)

Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>

SparkQA · 2016-12-22T14:43:35Z

Test build #70518 has finished for PR 14963 at commit 4e27af2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gsemet · 2017-01-09T10:17:10Z

Any hope this patch might be integrated ?

holdenk · 2017-02-24T23:06:29Z

That's a reasonable question, I'll try and look at this some next week but if you've got a chance to look at/address the current comments on it that would be great.

SparkQA · 2017-02-25T03:10:19Z

Test build #73451 has finished for PR 14963 at commit 215b7b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-03-01T00:47:53Z

So under the current jenkins set up this isn't running the Python style checks in Jenkins so it would be good to make sure these work in jenkins (can you update a python file)?

SparkQA · 2017-05-07T22:41:59Z

Test build #76550 has finished for PR 14963 at commit 215b7b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-06-26T23:44:16Z

Hi, are you still working on this?

SparkQA · 2017-07-05T03:34:35Z

Test build #79181 has finished for PR 14963 at commit 215b7b3.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds no public classes.

gsemet reviewed Sep 5, 2016
View reviewed changes

gsemet force-pushed the venv-pep8-pylint branch from e0ad40a to bd82d8c Compare September 5, 2016 16:15

gsemet reviewed Sep 5, 2016
View reviewed changes

holdenk reviewed Sep 9, 2016
View reviewed changes

gsemet reviewed Sep 9, 2016
View reviewed changes

dev/requirements.txt Outdated

Copy link

Contributor Author

gsemet Sep 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add the sphinx version here

gsemet changed the title ~~[SPARK-16992][PYSPARK] Reenable Pylint~~ [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 Sep 9, 2016

gsemet force-pushed the venv-pep8-pylint branch 2 times, most recently from 994bdfd to 3f7db49 Compare September 9, 2016 09:00

gsemet changed the title ~~[SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8~~ [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Sep 9, 2016

gsemet force-pushed the venv-pep8-pylint branch from 3f7db49 to dff7b23 Compare September 16, 2016 09:13

gsemet force-pushed the venv-pep8-pylint branch from dff7b23 to ca2dcb2 Compare September 22, 2016 07:29

gsemet force-pushed the venv-pep8-pylint branch from 93395fd to cd98471 Compare October 12, 2016 21:38

gsemet force-pushed the venv-pep8-pylint branch from cd98471 to 132fec8 Compare October 12, 2016 21:48

gsemet force-pushed the venv-pep8-pylint branch from 132fec8 to 4e27af2 Compare December 22, 2016 12:13

gsemet mentioned this pull request Feb 15, 2017

[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830

Closed

gsemet added 2 commits February 25, 2017 01:38

ignore dev/sphinx-report.txt as well

53f2ead

Delete sphinx-report.txt

215b7b3

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

[SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python #14963

[SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python #14963

Uh oh!

Conversation

gsemet commented Sep 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Uh oh!

gsemet Sep 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsemet Sep 5, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

gsemet Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

gsemet Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2016

Uh oh!

SparkQA commented Sep 5, 2016

Uh oh!

holdenk Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gsemet Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk commented Sep 9, 2016

Uh oh!

gsemet Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk commented Sep 9, 2016

Uh oh!

gsemet commented Sep 9, 2016

Uh oh!

holdenk commented Sep 9, 2016

Uh oh!

gsemet commented Sep 9, 2016

Uh oh!

holdenk commented Sep 9, 2016

Uh oh!

gsemet commented Sep 9, 2016

Uh oh!

holdenk commented Sep 9, 2016

Uh oh!

gsemet commented Sep 9, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

holdenk commented Sep 14, 2016

Uh oh!

gsemet commented Sep 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 16, 2016

Uh oh!

holdenk commented Sep 21, 2016

Uh oh!

gsemet commented Sep 21, 2016

Uh oh!

holdenk commented Sep 21, 2016

Uh oh!

gsemet commented Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gsemet commented Oct 12, 2016

gsemet commented Sep 5, 2016 •

edited

Loading

gsemet Sep 5, 2016 •

edited

Loading

holdenk Sep 9, 2016 •

edited

Loading

gsemet commented Sep 16, 2016 •

edited

Loading

gsemet commented Oct 12, 2016 •

edited

Loading