[SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation #44300

nchammas · 2023-12-11T21:16:19Z

What changes were proposed in this pull request?

Summary:

Eliminate the need to manually maintain HTML tables for Spark configurations.
Using the approach introduced here, we also:
- Ensure that internal configs are not accidentally documented publicly.
- Ensure that configs are documented publicly just as they are documented in the code.
Reorganize the sections in the SQL tuning page slightly so that related tuning techniques are clearer.
Make it possible for users to hyperlink directly to config entries in our documentation.

This change enables us to assign the various SQL configs to groups that can then be referenced automatically in our documentation. This will replace the large number of manually maintained HTML tables with references to generated tables.

For example, the SQL performance tuning page has a section on data caching that references two configs. They are in an HTML table that has to be manually maintained.

With this new system, we group the configs in SQLConf.scala according to how we'd like to display them in our documentation. For example, we can assign some configs to the sql-tuning-caching-data group:

  val COMPRESS_CACHED = buildConf("spark.sql.inMemoryColumnarStorage.compressed")
    ...
    .withTag("sql-tuning-caching-data")

Then we can reference this sql-tuning-caching-data config group from within sql-performance-tuning.md as follows:

{% include_api_gen _generated/config_tables/sql-tuning-caching-data.html %}

This pulls in an automatically generated HTML table that includes all the configs tagged sql-tuning-caching-data.

In addition to eliminating the need to manually maintain HTML tables of configs, this PR also fixes some problems with the current documentation:

It ensures that internal configs are not accidentally published in the user-facing documentation, because internal configs do not get exported to documentation. For example, spark.sql.files.openCostInBytes is an internal config, but it is currently documented publicly in the SQL tuning guide.
It ensures that the public documentation of a config matches exactly what is written in the code. For example, the public documentation for spark.sql.autoBroadcastJoinThreshold is missing the following line that is in the code:

and file-based data source tables where the statistics are computed directly on the files of data.

This PR also adds anchors to each config in the generated HTML tables, so that people can link directly to specific configurations in the documentation.

This PR builds on the work done in #27459, and is related to the work done in #28274.

Why are the changes needed?

This makes Spark configuration documentation much easier to maintain and more useful to users.

Does this PR introduce any user-facing change?

Yes, it alters some of the user-facing documentation.

How was this patch tested?

Manually built the documentation and viewed it in my browser.

Here's a screenshot of the generated docs: sql-performance-tuning.html 📸

I also reviewed the generated documentation for configuration.html but didn't screenshot it because it's too long.

Was this patch authored or co-authored using generative AI tooling?

No.

nchammas

@HyukjinKwon @cloud-fan @gatorsmile - What do you think? Tagging you because you last reviewed the documentation machinery I am modifying here.

nchammas · 2023-12-11T21:20:23Z

docs/sql-performance-tuning.md

-</tr>
-
-</table>
+{% include_relative generated-sql-config-table-caching-data.html %}


This diff demonstrates the main benefit of this PR. Instead of needing to copy and maintain the full HTML table of some configs, we tags the ones we want to group together in SQLConf.scala and then reference that group's table here.

nchammas · 2023-12-11T21:28:09Z

core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala

    this
  }

+  def withTag(tag: String): ConfigBuilder = {


I am on the fence regarding this name. Maybe something more explicit like withDocumentationGroup would be better?

We could also do away with custom group names and instead select certain prefixes for use as documentation groups, like spark.sql.cbo, spark.sql.statistics, etc.

However, this won't work for groupings that don't align with config name prefixes, like the breakdown of runtime vs. static configurations that was added in #28274.

docs/README.md

also add section for cbo and add new intro

nchammas · 2023-12-16T19:47:56Z

There are about 55 HTML tables of Spark configs (both SQL and non-SQL) across our documentation. They probably span a few thousand lines of HTML. I believe they can all be replaced with automatically generated HTML tables.

I intend to do that if this PR gets buy-in from committers.

we'll add that in a future pr

HyukjinKwon · 2023-12-17T20:53:10Z

(will try to take a look around next week)

nchammas · 2023-12-17T21:05:47Z

Thank you, Hyukjin. (I hope when I edited this comment it didn't re-ping you. That wasn't my intention.)

we need it until generated function tables get moved, too

nchammas · 2023-12-29T01:27:02Z

Btw @HyukjinKwon if there is anything I can do to make this PR easier to review, I'm all ears. The key change is the introduction of withTag in ConfigBuilder.scala, and the rest of the PR flows from that.

…ables

nchammas · 2024-01-11T05:21:50Z

@HyukjinKwon - How are you feeling about this PR? Shall I recruit another committer to take a look instead? No worries if you need more time, or if you disapprove of the approach. I just want to check in since I have the time and motivation to migrate all of our config documentation to use this new system.

As I noted in a previous comment, there are around 55 config tables across our documentation that span several thousand lines of manually maintained HTML. They can all be eliminated with this approach. I think it would be a big win for the maintainability of our documentation.

nchammas · 2024-01-16T16:32:06Z

Silence on a PR usually means there is something wrong with it, so I am proactively trying to figure out how to move this idea forward.

The possibilities I am considering are:

Committers disapprove of the abstract idea of automating the generation of config tables.
Committers approve of the abstract idea but disapprove of the implementation proposed in this PR.
Committers approve of the abstract idea and may approve of the implementation proposed here, but find it difficult to review.
Committers are constrained (due to time, energy, other priorities, etc.) and cannot participate at this time.

In the case of 2, I have proposed a completely different approach in #44756 that does not touch Spark core.

In the case of 3, I pushed #44755, which is a stripped down version of this PR which should be easier to review.

In the case of 1, I need some feedback on why this is a bad idea so I can drop this effort. As you can see, I believe it's a good idea and am eager to push it forward.

In the case of 4, please forgive me for repeatedly pinging this PR and being annoying. 😅

bjornjorgensen · 2024-01-16T20:53:15Z

@nchammas there seams to be a lot (some) of PR's that have been approved now, but have not been merged to master. So I think it have something to do with Hollidays and so on..

This is not the first PR for fixing the docs. Can an email to @dev be a good thing her?

nchammas added 8 commits December 11, 2023 15:56

point to python 3 download

708e9ad

both ruby 1 and 2 are EOL

7d7f39b

allow configs to have tags / groups for documentation

be00f99

add method to export all configs with tags to python

482f9bc

add tags to a few configs

6747836

typo

83b7d69

replace manual table with generated table

3eff490

generate config table per config tag / group

fb7097f

github-actions bot added SQL DOCS CORE PYTHON labels Dec 11, 2023

revert heading change for now

807f526

nchammas commented Dec 11, 2023

View reviewed changes

docs/README.md Outdated Show resolved Hide resolved

nchammas added 5 commits December 12, 2023 00:40

aqe coalesce partitions -> generated table

84e27dc

make anchors unique even if config is repeated

5dcc747

validate tags with a regex instead

88a200c

replace remaining html tables with includes

c7eb234

also add section for cbo and add new intro

put generated tables under _generated/

9d9e79b

nchammas changed the title ~~Assign SQL configs to groups for use in documentation~~ [SPARK-46395] Assign SQL configs to groups for use in documentation Dec 13, 2023

nchammas changed the title ~~[SPARK-46395] Assign SQL configs to groups for use in documentation~~ [SPARK-46395][SQL] Assign SQL configs to groups for use in documentation Dec 13, 2023

nchammas marked this pull request as ready for review December 13, 2023 21:00

nchammas added 3 commits December 13, 2023 16:32

move default processing to separate function

dda1109

prevent break after # anchor

b8e8f00

Merge branch 'master' of github.com:apache/spark into sql-config-groups

9d68b41

nchammas mentioned this pull request Dec 14, 2023

[SPARK-46357] Replace incorrect documentation use of setConf with conf.set #44290

Closed

tags should be a set, not a list

5adaa4f

nchammas added 6 commits December 17, 2023 01:51

Merge branch 'master' of github.com:apache/spark into sql-config-groups

f32cd37

remove unnecessary whitespace

88e5ed2

remove new section on cbo

2cf958c

we'll add that in a future pr

remove cbo tags

a6d7bab

tweak generated table locations and names

9390439

remove cbo sidebar item

de637da

nchammas changed the title ~~[SPARK-46395][SQL] Assign SQL configs to groups for use in documentation~~ [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation Dec 17, 2023

nchammas added 6 commits December 17, 2023 19:56

automatically tag static vs. runtime sql configs

9f0db38

restore generated-* ignore

c96899b

we need it until generated function tables get moved, too

Merge branch 'master' of github.com:apache/spark into sql-config-groups

9656c93

Merge branch 'master' of github.com:apache/spark into sql-config-groups

1d7db08

clarify some details with comments

3815b9e

Merge branch 'master' of github.com:apache/spark into sql-config-groups

fa22dd4

nchammas mentioned this pull request Dec 30, 2023

[SPARK-46546][DOCS] Fix the formatting of tables in running-on-yarn pages #44540

Closed

nchammas mentioned this pull request Jan 9, 2024

[SPARK-46437][DOCS] Add custom tags for conditional Jekyll includes #44630

Closed

nchammas added 4 commits January 9, 2024 15:40

Merge branch 'master' into sql-config-groups

7a360b8

Merge branch 'master' into sql-config-groups

d486ee2

use include_api_gen + _generated_config_tables -> _generated/config_t…

0e5401a

…ables

generate parent dir too

fe7e9d7

Merge branch 'master' into sql-config-groups

2abe799

This was referenced Jan 16, 2024

[SPARK-46395][CORE] Assign Spark configs to groups for use in documentation #44755

Closed

[SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation #44756

Closed

nchammas closed this Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation #44300

[SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation #44300

Uh oh!

nchammas commented Dec 11, 2023 •

edited

Loading

Uh oh!

nchammas left a comment •

edited

Loading

Uh oh!

nchammas Dec 11, 2023 •

edited

Loading

Uh oh!

nchammas Dec 11, 2023

Uh oh!

nchammas Dec 11, 2023

Uh oh!

Uh oh!

nchammas commented Dec 16, 2023

Uh oh!

HyukjinKwon commented Dec 17, 2023

Uh oh!

nchammas commented Dec 17, 2023

Uh oh!

nchammas commented Dec 29, 2023

Uh oh!

nchammas commented Jan 11, 2024 •

edited

Loading

Uh oh!

nchammas commented Jan 16, 2024

Uh oh!

bjornjorgensen commented Jan 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation #44300

[SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation #44300

Uh oh!

Conversation

nchammas commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

nchammas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchammas Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchammas Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

nchammas Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nchammas commented Dec 16, 2023

Uh oh!

HyukjinKwon commented Dec 17, 2023

Uh oh!

nchammas commented Dec 17, 2023

Uh oh!

nchammas commented Dec 29, 2023

Uh oh!

nchammas commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nchammas commented Jan 16, 2024

Uh oh!

bjornjorgensen commented Jan 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nchammas commented Dec 11, 2023 •

edited

Loading

nchammas left a comment •

edited

Loading

nchammas Dec 11, 2023 •

edited

Loading

nchammas commented Jan 11, 2024 •

edited

Loading