Skip to content

Conversation

@nchammas
Copy link
Contributor

@nchammas nchammas commented Dec 11, 2023

What changes were proposed in this pull request?

Summary:

  • Eliminate the need to manually maintain HTML tables for Spark configurations.
  • Using the approach introduced here, we also:
    • Ensure that internal configs are not accidentally documented publicly.
    • Ensure that configs are documented publicly just as they are documented in the code.
  • Reorganize the sections in the SQL tuning page slightly so that related tuning techniques are clearer.
  • Make it possible for users to hyperlink directly to config entries in our documentation.

This change enables us to assign the various SQL configs to groups that can then be referenced automatically in our documentation. This will replace the large number of manually maintained HTML tables with references to generated tables.

For example, the SQL performance tuning page has a section on data caching that references two configs. They are in an HTML table that has to be manually maintained.

With this new system, we group the configs in SQLConf.scala according to how we'd like to display them in our documentation. For example, we can assign some configs to the sql-tuning-caching-data group:

  val COMPRESS_CACHED = buildConf("spark.sql.inMemoryColumnarStorage.compressed")
    ...
    .withTag("sql-tuning-caching-data")

Then we can reference this sql-tuning-caching-data config group from within sql-performance-tuning.md as follows:

{% include_api_gen _generated/config_tables/sql-tuning-caching-data.html %}

This pulls in an automatically generated HTML table that includes all the configs tagged sql-tuning-caching-data.

In addition to eliminating the need to manually maintain HTML tables of configs, this PR also fixes some problems with the current documentation:

  1. It ensures that internal configs are not accidentally published in the user-facing documentation, because internal configs do not get exported to documentation. For example, spark.sql.files.openCostInBytes is an internal config, but it is currently documented publicly in the SQL tuning guide.
  2. It ensures that the public documentation of a config matches exactly what is written in the code. For example, the public documentation for spark.sql.autoBroadcastJoinThreshold is missing the following line that is in the code:

    and file-based data source tables where the statistics are computed directly on the files of data.

This PR also adds anchors to each config in the generated HTML tables, so that people can link directly to specific configurations in the documentation.

This PR builds on the work done in #27459, and is related to the work done in #28274.

Why are the changes needed?

This makes Spark configuration documentation much easier to maintain and more useful to users.

Does this PR introduce any user-facing change?

Yes, it alters some of the user-facing documentation.

How was this patch tested?

Manually built the documentation and viewed it in my browser.

Here's a screenshot of the generated docs: sql-performance-tuning.html 📸

I also reviewed the generated documentation for configuration.html but didn't screenshot it because it's too long.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Contributor Author

@nchammas nchammas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon @cloud-fan @gatorsmile - What do you think? Tagging you because you last reviewed the documentation machinery I am modifying here.

</tr>

</table>
{% include_relative generated-sql-config-table-caching-data.html %}
Copy link
Contributor Author

@nchammas nchammas Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff demonstrates the main benefit of this PR. Instead of needing to copy and maintain the full HTML table of some configs, we tags the ones we want to group together in SQLConf.scala and then reference that group's table here.

this
}

def withTag(tag: String): ConfigBuilder = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on the fence regarding this name. Maybe something more explicit like withDocumentationGroup would be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do away with custom group names and instead select certain prefixes for use as documentation groups, like spark.sql.cbo, spark.sql.statistics, etc.

However, this won't work for groupings that don't align with config name prefixes, like the breakdown of runtime vs. static configurations that was added in #28274.

@nchammas nchammas changed the title Assign SQL configs to groups for use in documentation [SPARK-46395] Assign SQL configs to groups for use in documentation Dec 13, 2023
@nchammas nchammas changed the title [SPARK-46395] Assign SQL configs to groups for use in documentation [SPARK-46395][SQL] Assign SQL configs to groups for use in documentation Dec 13, 2023
@nchammas nchammas marked this pull request as ready for review December 13, 2023 21:00
@nchammas
Copy link
Contributor Author

There are about 55 HTML tables of Spark configs (both SQL and non-SQL) across our documentation. They probably span a few thousand lines of HTML. I believe they can all be replaced with automatically generated HTML tables.

I intend to do that if this PR gets buy-in from committers.

@nchammas nchammas changed the title [SPARK-46395][SQL] Assign SQL configs to groups for use in documentation [SPARK-46395][DOCS] Assign Spark configs to groups for use in documentation Dec 17, 2023
@HyukjinKwon
Copy link
Member

(will try to take a look around next week)

@nchammas
Copy link
Contributor Author

Thank you, Hyukjin. (I hope when I edited this comment it didn't re-ping you. That wasn't my intention.)

@nchammas
Copy link
Contributor Author

Btw @HyukjinKwon if there is anything I can do to make this PR easier to review, I'm all ears. The key change is the introduction of withTag in ConfigBuilder.scala, and the rest of the PR flows from that.

@nchammas
Copy link
Contributor Author

nchammas commented Jan 11, 2024

@HyukjinKwon - How are you feeling about this PR? Shall I recruit another committer to take a look instead? No worries if you need more time, or if you disapprove of the approach. I just want to check in since I have the time and motivation to migrate all of our config documentation to use this new system.

As I noted in a previous comment, there are around 55 config tables across our documentation that span several thousand lines of manually maintained HTML. They can all be eliminated with this approach. I think it would be a big win for the maintainability of our documentation.

@nchammas
Copy link
Contributor Author

Silence on a PR usually means there is something wrong with it, so I am proactively trying to figure out how to move this idea forward.

The possibilities I am considering are:

  1. Committers disapprove of the abstract idea of automating the generation of config tables.
  2. Committers approve of the abstract idea but disapprove of the implementation proposed in this PR.
  3. Committers approve of the abstract idea and may approve of the implementation proposed here, but find it difficult to review.
  4. Committers are constrained (due to time, energy, other priorities, etc.) and cannot participate at this time.

In the case of 2, I have proposed a completely different approach in #44756 that does not touch Spark core.

In the case of 3, I pushed #44755, which is a stripped down version of this PR which should be easier to review.

In the case of 1, I need some feedback on why this is a bad idea so I can drop this effort. As you can see, I believe it's a good idea and am eager to push it forward.

In the case of 4, please forgive me for repeatedly pinging this PR and being annoying. 😅

@nchammas nchammas closed this Jan 16, 2024
@bjornjorgensen
Copy link
Contributor

@nchammas there seams to be a lot (some) of PR's that have been approved now, but have not been merged to master. So I think it have something to do with Hollidays and so on..

This is not the first PR for fixing the docs. Can an email to @dev be a good thing her?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants