Skip to content

Conversation

@nchammas
Copy link
Contributor

@nchammas nchammas commented Feb 10, 2020

What changes were proposed in this pull request?

This PR makes the following refinements to the workflow for building docs:

  • Install Python and Ruby consistently using pyenv and rbenv across both the docs README and the release Dockerfile.
  • Pin the Python and Ruby versions we use.
  • Pin all direct Python and Ruby dependency versions.
  • Eliminate any use of sudo pip, which the Python community discourages, or sudo gem.

Why are the changes needed?

This PR should increase the consistency and reproducibility of the doc-building process by managing Python and Ruby in a more consistent way, and by eliminating unused or outdated code.

Here's a possible example of an issue building the docs that would be addressed by the changes in this PR: #27459 (comment)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual tests:

  • I was able to build the Docker image successfully, minus the final part about RUN useradd.
    • I am unable to run do-release-docker.sh because I am not a committer and don't have the required GPG key.
  • I built the docs locally and viewed them in the browser.

I think I need a committer to more fully test out these changes.


# Image for building Spark releases. Based on Ubuntu 18.04.
#
# Includes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun and @wangyum might better to review.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118187 has finished for PR 27534 at commit bc15b86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor Author

nchammas commented Feb 13, 2020

Should I break up this PR to make it easier to review? (Edit: Done.)

e.g.

  1. Makefile trim (split to [SPARK-30880][DOCS] Delete Sphinx Makefile cruft #27625)
  2. pages -> nav (split to [SPARK-30731] Update deprecated Mkdocs option #27626)
  3. pyenv and rbenv + dependency pinning

Asking just in case the silence from other reviewers is because this PR is too big.

If y'all just haven't gotten around to taking a look, no worries. I'm being proactive.

@nchammas
Copy link
Contributor Author

I've split off the Makefile and Mkdocs changes and updated my previous comment with links to the new PRs.

@SparkQA
Copy link

SparkQA commented Feb 18, 2020

Test build #118652 has finished for PR 27534 at commit 25fdaef.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas
Copy link
Contributor Author

Retest this please.

@nchammas
Copy link
Contributor Author

Is it possible for contributors to re-trigger the GitHub workflows for a certain commit? I'm not seeing how to do that.

@srowen
Copy link
Member

srowen commented Feb 26, 2020

FWIW, I don't think Affected Version really matters much for an improvement. You can set it to say it would be a valid improvement as of a version -- maybe it improves a thing that's only present from 2.4.0 onwards or something. But for most cases it probably 'affects' (could be applied to) lots of versions, and that isn't as interesting as in the case of a bug. Not required unless it's really important clarify, doesn't hurt in most cases either.

@srowen
Copy link
Member

srowen commented Feb 27, 2020

Just to be clear, any objections to merging?

@nchammas
Copy link
Contributor Author

I don't think any committers have tried working with the proposed build instructions and/or Docker definition. Perhaps a committer could try things out to make sure it works for them?

@srowen
Copy link
Member

srowen commented Mar 5, 2020

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119425 has finished for PR 27534 at commit 2b1b42f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

docs/README.md Outdated
You need to have [Ruby](https://www.ruby-lang.org/en/documentation/installation/) and
[Python](https://docs.python.org/2/using/unix.html#getting-and-installing-the-latest-version-of-python)
installed. Also install the following libraries:
You need to have Ruby 2 and Python 3 installed. A handy way to install and manage various versions of Ruby and Python is with [`rbenv`] and [`pyenv`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert these doc changes in the documentation? Otherwise, I am good with other changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively, we can have one section to describe use rbenv and pyenv separately .. that works to me too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them to their own sub-section. If we think it's distracting / not helpful, I can remove it.

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119478 has finished for PR 27534 at commit ebba668.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in 7892f88 Mar 7, 2020
@nchammas nchammas deleted the SPARK-30731-building-docs branch March 9, 2020 18:44
pandoc pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev

ENV PATH "$PATH:/root/.pyenv/bin:/root/.pyenv/shims"
RUN curl -L https://github.com/pyenv/pyenv-installer/raw/dd3f7d0914c5b4a416ca71ffabdf2954f2021596/bin/pyenv-installer | bash
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I suspect it was not tested due to the limitation described in the PR description:

Generating SQL API Markdown files.
20/03/31 06:41:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/opt/spark-rm/output/spark/sql/gen-sql-api-docs.py", line 21, in <module>
    from pyspark.java_gateway import launch_gateway
  File "/opt/spark-rm/output/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 51, in <module>
  File "/opt/spark-rm/output/spark/python/lib/pyspark.zip/pyspark/context.py", line 22, in <module>
ImportError: No module named threading
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Seems the installed Python is weird. threading is the standard Python library that has existed from Python 2 to Python 3, but seems not existent with the Python installed here.

Let me revert this to make RC preparation easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's strange, because I was able to build enough of the Dockerfile to get past this point, and I did not hit this problem.

What I couldn't test was to run everything via do-release-docker.sh.

@cloud-fan
Copy link
Contributor

cloud-fan commented Mar 31, 2020

We can manually test it by changing the ENTRYPOINT in Dockerfile to /bin/bash and check if the python and ruby are installed correctly.

The release script doesn't work anymore after this patch because we use a non-standard way to install python/ruby (download a bash script and run it):

  1. The actual work is done by user spark-rm, which can't access /root/.pyenv
  2. Even if we fix the permission issue, the installed python doesn't work well with the system libraries, and have errors like ImportError: No module named threading
  3. It's wrong to append the PATH, and we should prepend it. Otherwise python still points to the system default which is 2.7.

@HyukjinKwon
Copy link
Member

Reverted at 4d4c3e7.

@dongjoon-hyun
Copy link
Member

BTW, are we using the release script in master branch?
We had better use the release script in branch-3.0 to avoid Apache Spark 2.4.2 situation.

cc @rxin since he is a release manager.

@cloud-fan
Copy link
Contributor

Yea we should use the script in branch-3.0 so this doesn't block the RC. But we should still revert this PR as it doesn't work.

@dongjoon-hyun
Copy link
Member

Sure~ Of course for reverting this PR.

@HyukjinKwon
Copy link
Member

I think my wording made some miscommunication here:

Let me revert this to make RC preparation easier.

I simply meant future RCs :-).

@nchammas
Copy link
Contributor Author

We can manually test it by changing the ENTRYPOINT in Dockerfile to /bin/bash and check if the python and ruby are installed correctly.

The release script doesn't work anymore after this patch because we use a non-standard way to install python/ruby (download a bash script and run it):

  1. The actual work is done by user spark-rm, which can't access /root/.pyenv
  2. Even if we fix the permission issue, the installed python doesn't work well with the system libraries, and have errors like ImportError: No module named threading
  3. It's wrong to append the PATH, and we should prepend it. Otherwise python still points to the system default which is 2.7.

This is strange since I tested this successfully via docker build . from within the folder containing the Dockerfile.

What I couldn't do was test building the image via do-release-docker.sh, which I called out in the PR description and again in the comments.

I think the reason my test didn't see these issues is because the do-release-docker.sh script builds the image with some options that change the executing user from root.

run_silent "Building spark-rm image with tag $IMGTAG..." "docker-build.log" \
  docker build --no-cache -t "spark-rm:$IMGTAG" --build-arg UID=$UID "$SELF/spark-rm"

I guess the moral of the story is that I couldn't test do-release-docker.sh directly, and my substitute test of docker build . was not a good test since it didn't account for the root vs. spark-rm user change.

I should have more carefully tried to duplicate the docker build command as it appeared in do-release-docker.sh, and insisted more forcefully that we not merge this PR in until a committer had had a chance to try out do-release-docker.sh directly.

Sorry.

@nchammas
Copy link
Contributor Author

I can try taking another stab at this change for the future, now that I have a better idea of how to test it.

However, given that this change was met with some resistance to begin with, I won't try to fix it unless there is explicit committer interest.

@cloud-fan
Copy link
Contributor

This PR does have a good point to fix the dependency versions so that the script is more robust. I'm happy to see a working version of it.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 1, 2020

We can pin the version for release related ones; however, I doubt if we should do that for others e.g., CI, documentation. It's being discussed at #27928

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
Split from apache#27534.

### What changes were proposed in this pull request?

This PR updates a deprecated Mkdocs option to use the new name.

### Why are the changes needed?

This change will prevent the docs from failing to build when we update to a version of Mkdocs that no longer supports the deprecated option.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

I built the docs locally and reviewed them in my browser.

Closes apache#27626 from nchammas/SPARK-30731-mkdocs-dep-opt.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
Split off from apache#27534.

### What changes were proposed in this pull request?

This PR deletes some unused cruft from the Sphinx Makefiles.

### Why are the changes needed?

To remove dead code and the associated maintenance burden.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Tested locally by building the Python docs and reviewing them in my browser.

Closes apache#27625 from nchammas/SPARK-30731-makefile-cleanup.

Lead-authored-by: Nicholas Chammas <[email protected]>
Co-authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?

This PR makes the following refinements to the workflow for building docs:
* Install Python and Ruby consistently using pyenv and rbenv across both the docs README and the release Dockerfile.
* Pin the Python and Ruby versions we use.
* Pin all direct Python and Ruby dependency versions.
* Eliminate any use of `sudo pip`, which the Python community discourages, or `sudo gem`.

### Why are the changes needed?

This PR should increase the consistency and reproducibility of the doc-building process by managing Python and Ruby in a more consistent way, and by eliminating unused or outdated code.

Here's a possible example of an issue building the docs that would be addressed by the changes in this PR: apache#27459 (comment)

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual tests:
* I was able to build the Docker image successfully, minus the final part about `RUN useradd`.
    * I am unable to run `do-release-docker.sh` because I am not a committer and don't have the required GPG key.
* I built the docs locally and viewed them in the browser.

I think I need a committer to more fully test out these changes.

Closes apache#27534 from nchammas/SPARK-30731-building-docs.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants