Skip to content

Conversation

@aDotInTheVoid
Copy link
Member

@aDotInTheVoid aDotInTheVoid commented Jun 17, 2025

r? @ghost

What

rustdoc --output-format=postcard is like rustdoc-json, but using https://postcard.rs/ / https://docs.rs/postcard/1.1.1/ instead of JSON.

Why

JSON Size and speed isn't great. People want more speed, and smaller docs. There are proposals to make the JSON smaller (and therefor faster) by making field-names shorter, and omitting them when the value is the default. But

How good is it?

In a very unscientific benchmark for aws-sdk-ec2, it's ~3.6x smaller (255MiB vs 69 MiB) and ~1.8x faster to deserialize (1.6273 s vs 914.05 ms)

What's the metaformat

  • 22 bytes of magic numbers
  • varint(u32) format version
  • Crate as usual

This way, users can look at the magic number to check it's a rustdoc-json-postcard file, then read the version number to know if they can decode it. Only then can they deserialize the Crate itself. I plan to write a library that does this, so it's easy to do well.

Why is this a draft

@rustbot rustbot added A-compiletest Area: The compiletest test runner A-rustdoc-json Area: Rustdoc JSON backend A-testsuite Area: The testsuite used to check the correctness of rustc T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Jun 17, 2025
@aDotInTheVoid
Copy link
Member Author

pub mod postcard {

pub type Magic = [u8; 22];
pub const MAGIC: Magic = *b"\x00\xFFRustdocJsonPostcard\xFF";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A friend points out https://hackers.town/@zwol/114155807716413069, with advice on how to design a magic number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having 'Json' in there seems perverse :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly that link is now dead since they moved the server to masto.hackers.town and the posts no longer has the same ID, or it doesn't exists

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It got archived: https://archive.is/0QDhX. Repeated here for posterity.

another day, another binary file format with a badly designed magic number

not gonna call it out specifically but here are some RFC2119 MUSTs for magic number design:

  • MUST be the very first N bytes in the file
  • MUST be at least four bytes long, eight is better
  • MUST include at least one byte with the high bit set
  • MUST include a byte sequence that is invalid UTF-8
  • SHOULD include a zero byte, but you can usually get away with having that be part of the overall version number that immediately follows the magic number (did I mention that you really SHOULD put an overall version number right after the magic number, unless you know and have documented exactly why it's not necessary, e.g. PNG?)

good examples: PNG, ELF

bad examples: GIF PE PDF

Here is a template. If you follow this template for your binary file format's magic number, you will be doing it better than a depressingly large number of senior software engineers.

First eight bytes of the file:

0xDC 0xDF X X x x (0x01 0x00 | 0x00 0x01)

0xDC 0xDF are bytes with the high bit set. Together with the next two bytes, they form a four-byte sequence that cannot appear in any valid ASCII, UTF-8, Corrected UTF-8, or UTF-16 (regardless of endianness) text document. This is not a perfectly bulletproof declaration that the file does not contain text, but it should be strong enough except maybe for formats like PDF that can't decide if they're structured text or binary.
X X x x: Four ASCII alphanumeric characters naming your file format. Make them clearly related to your recommended file name extension. I'm giving you four characters because we're running out of three-letter acronyms. If you don't need four characters, pad at the end with 0x1A (aka ^Z).

The first two of these (the uppercase Xes) must not have their high bits set, lest the "this is not text" declaration be weakened. For the other two (lowercase xes), use of ASCII alphanumerics is just a strong recommendation.
0x01 0x00 or 0x00 0x01: This is to be understood as a 16-bit unsigned integer in your choice of little- or big-endian order. It serves three functions. In descending order of importance:

  • It includes a zero byte, reinforcing the declaration that this is not a text file.
  • It demonstrates which byte ordering will be used throughout the file. It does not matter which order you choose, but you need to consciously choose either big- or little-endian and then use that byte order consistently throughout the file. Yes, I have seen cases where people didn't do that.
  • It's an escape hatch. If one day you discover that you need to alter the structure of the rest of the file in a totally incompatible way, and yet it is still meaningfully the same format, so you don't want to change the name characters, you can change the 0x01 to 0x02. We both hope that day will never come, but we both know it might.

@aDotInTheVoid
Copy link
Member Author

Something @jamesmunns pointed out is that this means that reordering fields or enum variants in rustdoc-json-types will now require a FORMAT_VERSION bump. This could probably be detected in CI using postcard-schema.

More broadly, we should think about where (if at all) postcard-schema fits into this.

@jamesmunns
Copy link
Member

As a note, I'm working on iterating on postcard and postcard-schema right now, postcard-schema-ng was just released, and is a form that might be releasable as a 1.0 soon, but I'd need to finish up the items at jamesmunns/postcard#241 to see if any additional iteration is required.

postcard-schema gets you two interesting pieces of data:

  1. postcard-schema::Key can be used to generate an 8-byte hash of the schema and the string of your choice, as a const. This can be snapshotted to detect wire changes in CI.
  2. postcard-schema::NamedType/postcard-schema-ng::DataModelType is the full reflection-style schema of the data type, which is also serializable as postcard data. This can be useful if you want the data to be archival: storing the schema inside the file itself, so you could still decode it even if the schema changes (using the postcard-dyn crate, giving you a serde_json::Value-like view of the data).

postcard is also getting a 2.0 soon, but it's important to note that the wire format is NOT changing. You will be able to use the library version v1.0 and v2.0 interchangably, wrt to serialization/deserialization (it's a breaking change because I'm removing some external crates that are now out of dates from my public API, it's likely your code won't need to change at all).

@jamesmunns
Copy link
Member

A possibly useful form for the file format could be:

struct PostcardFile<T> {
    key: Key,
    schema: Option<Schema>,
    data: T,
}

I've considered "standardizing" this format a bit, maybe with a trailing CRC32.

@aDotInTheVoid
Copy link
Member Author

postcard is also getting a 2.0 soon, but it's important to note that the wire format is NOT changing. [...]

Awesome! It'd be great to not have cobs and embedded-io in Cargo.lock (and that for all the users that care about performance).

A possibly useful form for the file format could be:

I think we definatly want to keep the magic number, so that consumers can tell if this file is rustdoc output at all, and a linear format version so they can tell if rustdoc is too old or too new for them if the schema's changed (vs a schema hash that only tells you that it's changed). Embedded the schema into the output itself is an interesting idea, I'll need to look more at it. But as long as both of these come after the magic number and linear format version, we should be fine to change them after the fact.

@bors
Copy link
Collaborator

bors commented Jun 29, 2025

☔ The latest upstream changes (presumably #143173) made this pull request unmergeable. Please resolve the merge conflicts.

@bors bors added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Jun 29, 2025
bors added a commit that referenced this pull request Oct 26, 2025
…oundwork, r=jieyouxu

compiletest: pass rustdoc mode as param, rather than implicitly

Spun out of #142642

In the future, I want the rustdoc-json test suite to invoke rustdoc twice, once with `--output-format=json`, and once with the (not yet implemented) `--output-format=postcard` flag.

Doing that requires being able to explicitly tell the `.document()` function which format to use, rather then implicitly using json in the rustdoc-json suite, and HTML in all others.

r? `@jieyouxu`

CC `@jalil-salame`
github-actions bot pushed a commit to rust-lang/rustc-dev-guide that referenced this pull request Oct 27, 2025
…oundwork, r=jieyouxu

compiletest: pass rustdoc mode as param, rather than implicitly

Spun out of rust-lang/rust#142642

In the future, I want the rustdoc-json test suite to invoke rustdoc twice, once with `--output-format=json`, and once with the (not yet implemented) `--output-format=postcard` flag.

Doing that requires being able to explicitly tell the `.document()` function which format to use, rather then implicitly using json in the rustdoc-json suite, and HTML in all others.

r? `@jieyouxu`

CC `@jalil-salame`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-compiletest Area: The compiletest test runner A-rustdoc-json Area: Rustdoc JSON backend A-testsuite Area: The testsuite used to check the correctness of rustc S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants