-
Couldn't load subscription status.
- Fork 13.9k
rustdoc-json: Postcard output #142642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
rustdoc-json: Postcard output #142642
Conversation
| pub mod postcard { | ||
|
|
||
| pub type Magic = [u8; 22]; | ||
| pub const MAGIC: Magic = *b"\x00\xFFRustdocJsonPostcard\xFF"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A friend points out https://hackers.town/@zwol/114155807716413069, with advice on how to design a magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having 'Json' in there seems perverse :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly that link is now dead since they moved the server to masto.hackers.town and the posts no longer has the same ID, or it doesn't exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It got archived: https://archive.is/0QDhX. Repeated here for posterity.
another day, another binary file format with a badly designed magic number
not gonna call it out specifically but here are some RFC2119 MUSTs for magic number design:
- MUST be the very first N bytes in the file
- MUST be at least four bytes long, eight is better
- MUST include at least one byte with the high bit set
- MUST include a byte sequence that is invalid UTF-8
- SHOULD include a zero byte, but you can usually get away with having that be part of the overall version number that immediately follows the magic number (did I mention that you really SHOULD put an overall version number right after the magic number, unless you know and have documented exactly why it's not necessary, e.g. PNG?)
good examples: PNG, ELF
bad examples: GIF PE PDF
Here is a template. If you follow this template for your binary file format's magic number, you will be doing it better than a depressingly large number of senior software engineers.
First eight bytes of the file:
0xDC 0xDF X X x x (0x01 0x00 | 0x00 0x01)
0xDC 0xDF are bytes with the high bit set. Together with the next two bytes, they form a four-byte sequence that cannot appear in any valid ASCII, UTF-8, Corrected UTF-8, or UTF-16 (regardless of endianness) text document. This is not a perfectly bulletproof declaration that the file does not contain text, but it should be strong enough except maybe for formats like PDF that can't decide if they're structured text or binary.
X X x x: Four ASCII alphanumeric characters naming your file format. Make them clearly related to your recommended file name extension. I'm giving you four characters because we're running out of three-letter acronyms. If you don't need four characters, pad at the end with 0x1A (aka ^Z).The first two of these (the uppercase Xes) must not have their high bits set, lest the "this is not text" declaration be weakened. For the other two (lowercase xes), use of ASCII alphanumerics is just a strong recommendation.
0x01 0x00 or 0x00 0x01: This is to be understood as a 16-bit unsigned integer in your choice of little- or big-endian order. It serves three functions. In descending order of importance:
- It includes a zero byte, reinforcing the declaration that this is not a text file.
- It demonstrates which byte ordering will be used throughout the file. It does not matter which order you choose, but you need to consciously choose either big- or little-endian and then use that byte order consistently throughout the file. Yes, I have seen cases where people didn't do that.
- It's an escape hatch. If one day you discover that you need to alter the structure of the rest of the file in a totally incompatible way, and yet it is still meaningfully the same format, so you don't want to change the name characters, you can change the 0x01 to 0x02. We both hope that day will never come, but we both know it might.
|
Something @jamesmunns pointed out is that this means that reordering fields or enum variants in More broadly, we should think about where (if at all) postcard-schema fits into this. |
|
As a note, I'm working on iterating on
|
|
A possibly useful form for the file format could be: struct PostcardFile<T> {
key: Key,
schema: Option<Schema>,
data: T,
}I've considered "standardizing" this format a bit, maybe with a trailing CRC32. |
Awesome! It'd be great to not have
I think we definatly want to keep the magic number, so that consumers can tell if this file is rustdoc output at all, and a linear format version so they can tell if rustdoc is too old or too new for them if the schema's changed (vs a schema hash that only tells you that it's changed). Embedded the schema into the output itself is an interesting idea, I'll need to look more at it. But as long as both of these come after the magic number and linear format version, we should be fine to change them after the fact. |
|
☔ The latest upstream changes (presumably #143173) made this pull request unmergeable. Please resolve the merge conflicts. |
…oundwork, r=jieyouxu compiletest: pass rustdoc mode as param, rather than implicitly Spun out of #142642 In the future, I want the rustdoc-json test suite to invoke rustdoc twice, once with `--output-format=json`, and once with the (not yet implemented) `--output-format=postcard` flag. Doing that requires being able to explicitly tell the `.document()` function which format to use, rather then implicitly using json in the rustdoc-json suite, and HTML in all others. r? `@jieyouxu` CC `@jalil-salame`
…oundwork, r=jieyouxu compiletest: pass rustdoc mode as param, rather than implicitly Spun out of rust-lang/rust#142642 In the future, I want the rustdoc-json test suite to invoke rustdoc twice, once with `--output-format=json`, and once with the (not yet implemented) `--output-format=postcard` flag. Doing that requires being able to explicitly tell the `.document()` function which format to use, rather then implicitly using json in the rustdoc-json suite, and HTML in all others. r? `@jieyouxu` CC `@jalil-salame`
r? @ghost
What
rustdoc --output-format=postcardis like rustdoc-json, but using https://postcard.rs/ / https://docs.rs/postcard/1.1.1/ instead of JSON.Why
JSON Size and speed isn't great. People want more speed, and smaller docs. There are proposals to make the JSON smaller (and therefor faster) by making field-names shorter, and omitting them when the value is the default. But
How good is it?
In a very unscientific benchmark for aws-sdk-ec2, it's ~3.6x smaller (255MiB vs 69 MiB) and ~1.8x faster to deserialize (1.6273 s vs 914.05 ms)
What's the metaformat
Crateas usualThis way, users can look at the magic number to check it's a rustdoc-json-postcard file, then read the version number to know if they can decode it. Only then can they deserialize the
Crateitself. I plan to write a library that does this, so it's easy to do well.Why is this a draft
HtmlRendererandJsonRendererare configures from the same options, we should change thisRenderOptionstoDocContext#147832.is_json()instead of the current hacks