Skip to content

Conversation

gsemet
Copy link
Contributor

@gsemet gsemet commented Sep 5, 2016

Use a virtualenv for isolation and easy installation of pep8 and pylint by the lint-python linter script.

This involves 2 changes in this script:

  • it creates a virtualenv if we are not in one. This way, we can use pip to install development environment's dependencies (which are different from the "prod" environment, for instance we don't want to install pep or pylint on prod)
  • dependencies are described in requirements.txt. This will "freeze" the version of the tools used. We don't want the HEAD starting breaking because a new version of pylint has been posted on pypi which add new checks. sphinx version is also "frozen".

This basically reverts 85a50a6

Might have been a solution to SPARK-9385 (see #7704)

Lot of new tests introduced by the newer version of pylint have been disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as #14830 for documentation examples, which is part of the effort on code style described in #14567). Each subsequent pull request will fix one or more error and reenable the according pylint check.

List of new disabled tests:

  • bad-super-call
  • consider-iterating-dictionary
  • consider-using-enumerate
  • eval-used
  • exec-used
  • invalid-length-returned
  • misplaced-comparison-constant
  • raising-bad-type
  • redefined-variable-type
  • trailing-newlines
  • trailing-whitespace
  • ungrouped-imports
  • unnecessary-pass
  • unneeded-not
  • wrong-import-order
  • wrong-import-position

How was this patch tested?

Manual execution of lint-python on my machine.

python/pylintrc Outdated
Copy link
Contributor Author

@gsemet gsemet Sep 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have actually added a lot of exclusion here to make the code pass:

  • bad-super-call
  • consider-iterating-dictionary
  • consider-using-enumerate
  • eval-used
  • exec-used
  • invalid-length-returned
  • misplaced-comparison-constant
  • raising-bad-type
  • redefined-variable-type
  • trailing-newlines
  • trailing-whitespace
  • ungrouped-imports
  • unnecessary-pass
  • unneeded-not
  • wrong-import-order
  • wrong-import-position

dev/lint-python Outdated
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: here I leave the virtualenv of the current environment. tell me if which solution you prefere:

  • jump into a custom venv? (this proposal)
  • fail if in a virtualenv
  • only warn the user if already in a virtualenv
  • no warning at all, the user knows what he/she does

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we do jump to a custom virtual env for syntax checking we should at least restore the users existing virtual env after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "after"? For the Sphinx build? I think we can do both in this virtual env, this would allow to build without having to do sudo pip install Sphinx-build.
And we can control precisely the version of sphinx in this requirements.txt

In anyway, this virtualenv will be left automatically at the end of the execution of this script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the virtualenv for the sphinx build sounds reasonable if we are already doing it for pep8. I was thinking that since we explicitly leave the current virtual env if the user is in one we should try and restore it after the script has been run - but if that already happens automatically thats great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm thinking it may be better not to leave virtualenv if we are inside one. So we don't have this copy paste of the code for desactivating it.
What do you think?

@SparkQA
Copy link

SparkQA commented Sep 5, 2016

Test build #64949 has finished for PR 14963 at commit e0ad40a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2016

Test build #64950 has finished for PR 14963 at commit bd82d8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

dev/lint-python Outdated
Copy link
Contributor

@holdenk holdenk Sep 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait even simpler[ish] - do we need to deactivate the old env or will activating the new env be enough (provided we store the old VIRTUAL_ENV path to restore to after we deactivate).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to suggestion. If we are already inside a venv we can expect the user knows what it does and don't desactivate it.

@holdenk
Copy link
Contributor

holdenk commented Sep 9, 2016

Thanks for working to reenable pylint (super important IMHO). Perhaps @davies could take a look at this approach?
I like the building up of venv seems to help set up for your other work improving pep8 support although it is maybe a bit of a bigger change than needed locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add the sphinx version here

@holdenk
Copy link
Contributor

holdenk commented Sep 9, 2016

Oh wait I've been a little confused by the PR title - pylint is already reenabled and was only temporary disabled awhile back - maybe less important but would make sense in the context of the other PRs you've put together. Could you maybe give this PR a clearer name like "Use virtualenv for pep8" rather than "Reenable Pylint"?

@gsemet
Copy link
Contributor Author

gsemet commented Sep 9, 2016

Hum I don't see how it was reenabled... Where is it called? And I had many errors to ignore once I have reenabled it on the execution of lint-python.

I'll update the title. At least, being inside a venv allow to precisely control the version of pep and pylint, this is kind of critical since errors changes from one version to another

@gsemet gsemet changed the title [SPARK-16992][PYSPARK] Reenable Pylint [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 Sep 9, 2016
@holdenk
Copy link
Contributor

holdenk commented Sep 9, 2016

I mean we already have the pep8 version locked since we download a specific version with curl rather than depending on the system version.

@gsemet
Copy link
Contributor Author

gsemet commented Sep 9, 2016

Yes but for pylint you have many dependencies to update as well (astroid,...). At least with a virtualenv, pip does it for us :)
And every time I see an hard coded external url I am afraid it might break soon or later

@holdenk
Copy link
Contributor

holdenk commented Sep 9, 2016

I also think virtualenv or similar is probably the right way to go for PySpark developer dependencies in the long term :)

@gsemet
Copy link
Contributor Author

gsemet commented Sep 9, 2016

Great ! By the way, I am also working on virtualenv and wheel support for PySpark job deployment (see #14180)

@holdenk
Copy link
Contributor

holdenk commented Sep 9, 2016

Yah I've been semi-following #14180 & #13599 but haven't had much time to look at them closely. I think something like what you and @zjffdu are working on could make a real improvement to the PySpark experience for users with custom libraries.

@gsemet
Copy link
Contributor Author

gsemet commented Sep 9, 2016

Great !

@gsemet gsemet force-pushed the venv-pep8-pylint branch 2 times, most recently from 994bdfd to 3f7db49 Compare September 9, 2016 09:00
@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65145 has finished for PR 14963 at commit 994bdfd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65146 has finished for PR 14963 at commit 3f7db49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet changed the title [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 [SPARK-16992][PYSPARK] Virtualenv for Pylint and pep8 in lint-python Sep 9, 2016
@holdenk
Copy link
Contributor

holdenk commented Sep 14, 2016

cc @davies ?

@gsemet
Copy link
Contributor Author

gsemet commented Sep 16, 2016

Rebased + updated PR summary

Hello, sorry to bother you, but if this patch gets merged, I can work on the pylint errors and submit new PR I had to add in the ignore list of pylint.

If I reenable most of them, here is an extract of errors I get:

************* Module pyspark.shell
W: 78, 8: Use of exec (exec-used)
************* Module pyspark.cloudpickle
R:767,16: Redefinition of v type from property to staticmethod (redefined-variable-type)
************* Module pyspark.tests
R:782,12: Redefinition of a type from pyspark.rdd.RDD to pyspark.rdd.PipelinedRDD (redefined-variable-type)
R:783,12: Redefinition of b type from pyspark.rdd.RDD to pyspark.rdd.PipelinedRDD (redefined-variable-type)
************* Module pyspark.rdd
R:1467,12: Redefinition of ser type from pyspark.serializers.AutoBatchedSerializer to pyspark.serializers.BatchedSerializer (redefined-variable-type)
************* Module pyspark.context
R:130,12: Redefinition of self.serializer type from pyspark.serializers.AutoBatchedSerializer to pyspark.serializers.BatchedSerializer (redefined-variable-type)
************* Module pyspark.worker
W:104,17: Use of eval (eval-used)
************* Module pyspark.shuffle
C:732,32: Consider iterating the dictionary directly instead of calling .keys() (consider-iterating-dictionary)
************* Module pyspark.ml.tests
R:1459, 8: Redefinition of a type from pyspark.ml.linalg.DenseVector to pyspark.ml.linalg.SparseVector (redefined-variable-type)
************* Module pyspark.ml.linalg
E:683, 4: __len__ does not return non-negative integer (invalid-length-returned)
R:1079, 8: Redefinition of colPtrs type from list to str (redefined-variable-type)
R:1078, 8: Redefinition of rowIndices type from list to str (redefined-variable-type)
************* Module pyspark.streaming.kafka
R:132,12: Redefinition of ser type from pyspark.serializers.PairDeserializer to pyspark.serializers.AutoBatchedSerializer (redefined-variable-type)
************* Module pyspark.streaming.tests
R:419, 8: Redefinition of initial type from list to pyspark.rdd.RDD (redefined-variable-type)
E:506,12: Bad first argument 'StreamingListener' given to super() (bad-super-call)
************* Module pyspark.streaming.context
C:298,12: Consider using enumerate instead of iterating with range and len (consider-using-enumerate)

These are not big errors, but it would be quite easy to fix them and have pylint runing at each pullrequest.

Thanks

@SparkQA
Copy link

SparkQA commented Sep 16, 2016

Test build #65481 has finished for PR 14963 at commit dff7b23.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Sep 21, 2016

So even though this passes tests - its showing up in the PR dashboard as not for me. It might be easier to get reviewers to take a look if it was listed correctly, if you merge in the latest master the tests will re-run or we can explicitly ask jenkins to retest this.

@gsemet
Copy link
Contributor Author

gsemet commented Sep 21, 2016

What is the PR dashboard ?

I usually rebase this patch one or twice a week, I'll do it tomorrow

@holdenk
Copy link
Contributor

holdenk commented Sep 21, 2016

oh actually it seems like the PR dashboard updated its self (it lives at https://spark-prs.appspot.com/ and is sometimes used by reviewers to decide which PRs to look at).

@gsemet
Copy link
Contributor Author

gsemet commented Oct 12, 2016

You and I agree, actually:

  • PySpark can run inside Anaconda, and indeed this is greatly valuable. This will make available to the "driver" all the package provided by Anaconda (in client mode)
  • Making the same environment available for executor is a bit messy for the moment. Here we have an NFS share that every executor use, but having it automatically deployed would be so cool. that's why the other PR :)

For development, PySpark will still use the anaconda environment if the user wishes so. Just for the lint-python execution, we revert back to a virtualenv, but only for the duration of the script.

@gsemet
Copy link
Contributor Author

gsemet commented Oct 12, 2016

I really think using Spark with Anaconda is a must have. Deploying jobs that runs inside a Conda environment is so fast and efficient. I really want to push for this pull request #14180 that also add anaconda support for executors (thanks to jeff's work). Actually before this PR I never tried Anaconda, and now I use it everyday on jupyter. Anaconda on the driver and executor make totally sense!

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66835 has finished for PR 14963 at commit 087d1a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66838 has finished for PR 14963 at commit 93395fd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet
Copy link
Contributor Author

gsemet commented Oct 12, 2016

Ho, I have these pylint errors on my ubuntu! Probably I did not rebased correctly.

Fixed with new check ignore:

  • deprecated-method
  • unsubscriptable-object,
  • used-before-assignment,

of course, once merged. i can work on fixing this pylint errors !

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66836 has finished for PR 14963 at commit 8a6d659.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66837 has finished for PR 14963 at commit 278ff07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66848 has finished for PR 14963 at commit 132fec8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2016

Test build #66846 has finished for PR 14963 at commit cd98471.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Oct 13, 2016

@stibbons however the problem is the virtualenv that gets created when a user is in a conda enviroment is partially broken.

@holdenk
Copy link
Contributor

holdenk commented Oct 16, 2016

Also if it would be useful to have a quick chat off-line and then circle back to the PR with the result of our chat let me know - I'd really like to see us get something like this in so we can have better/more consistent linting.

@gsemet
Copy link
Contributor Author

gsemet commented Oct 20, 2016

I agree, just email me :)

Use a virtualenv for isolation and easy installation.

This basically reverts 85a50a6

Might have been a solution to SPARK-9385.

Lot of new test disabled. I propose to fix issues in various
pull requests (obviously most 'import' order errors should be
fixed by my other pull requests such as apache#14830 for documentation
examples, which is part of the effort on code style described
in apache#14567). Each subsequent pull request will fix one or more
error and reenable the according pylint check.

List of new disabled checks:
- bad-super-call
- consider-iterating-dictionary
- consider-using-enumerate
- eval-used
- exec-used
- invalid-length-returned
- misplaced-comparison-constant
- raising-bad-type
- redefined-variable-type
- trailing-newlines
- trailing-whitespace
- ungrouped-imports
- unnecessary-pass
- unneeded-not
- wrong-import-order
- wrong-import-position

Signed-off-by: Gaetan Semet <[email protected]>
@SparkQA
Copy link

SparkQA commented Dec 22, 2016

Test build #70518 has finished for PR 14963 at commit 4e27af2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet
Copy link
Contributor Author

gsemet commented Jan 9, 2017

Any hope this patch might be integrated ?

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

That's a reasonable question, I'll try and look at this some next week but if you've got a chance to look at/address the current comments on it that would be great.

@SparkQA
Copy link

SparkQA commented Feb 25, 2017

Test build #73451 has finished for PR 14963 at commit 215b7b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Mar 1, 2017

So under the current jenkins set up this isn't running the Python style checks in Jenkins so it would be good to make sure these work in jenkins (can you update a python file)?

@SparkQA
Copy link

SparkQA commented May 7, 2017

Test build #76550 has finished for PR 14963 at commit 215b7b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jun 26, 2017

Hi, are you still working on this?

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79181 has finished for PR 14963 at commit 215b7b3.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants