[float] document e2e training -> inference flow #2190

danielvegamyhre · 2025-05-09T16:14:50Z

Summary

Document the E2E training => inference flow with examples.

pytorch-bot · 2025-05-09T16:14:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2190

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 723088a with merge base cdced21 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-05-09T16:21:48Z

cc @andrewor14

…produces inf

andrewor14

Tested this out locally too, works for me. Thanks!

andrewor14 · 2025-05-09T18:51:57Z

(Might need to add a distributed checkpoint section but we can do that in a separate PR)

danielvegamyhre · 2025-05-09T19:34:22Z

torchao/float8/README.md

+
+# save the model
+torch.save({
+    'model': m,


In practice the model would be in some modeling file, and the training code and inference code would both import it separately, in order to avoid the need to deserialize the python model definition w/ torch.load(...., weights_only=False), which has some security risks.

However, I was aiming to have these be copy/paste-able runnable standalone examples, which seemed to require this bad practice. Thoughts @andrewor14 @vkuzo?

I think a good way to do it as as follows:

create a reproducible model definition

create a new instance of (1), train it, save weights to checkpoint

create a new instance of (1), load weights from checkpoint, finetune it or do inference

there is no saving of model definition with torch.save needed in the flow as above

That's what I originally tried actually, but it doesn't work because the weights in the serialized/checkpointed model from step (2) are registered under different names (prefixed with _orig_mod) than the freshly initialized model in step (3).

I solved this by saving the converted model definition directly in torch.save and loading the model state dict into that, but it's not ideal imo. I'm curious how torchtitan/torchtune do this as well

Ok going to merge this for now, we can discuss alternatives async if you want.

* document e2e training -> inference flow * add save/load checkpoint * update to how we load checkpoint * remove debugging * add more detail * remove unused import * lower lr to prevent large optimizer step into weight territory which produces inf * use actual loss function

danielvegamyhre added the topic: documentation Use this tag if this PR adds or improves documentation label May 9, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 9, 2025

danielvegamyhre force-pushed the be branch from c3be38b to 101b1c7 Compare May 9, 2025 16:15

danielvegamyhre requested review from drisspg and vkuzo May 9, 2025 16:16

danielvegamyhre force-pushed the be branch from 101b1c7 to 399f4fc Compare May 9, 2025 16:20

document e2e training -> inference flow

d70f7f7

danielvegamyhre force-pushed the be branch from 399f4fc to d70f7f7 Compare May 9, 2025 16:26

add save/load checkpoint

31572b8

danielvegamyhre force-pushed the be branch from 9182018 to 31572b8 Compare May 9, 2025 16:31

danielvegamyhre marked this pull request as draft May 9, 2025 16:40

danielvegamyhre changed the title ~~[float8] document e2e training -> inference flow~~ [WIP] document e2e training -> inference flow May 9, 2025

danielvegamyhre added 6 commits May 9, 2025 09:53

update to how we load checkpoint

e49fa3b

remove debugging

d9b958c

add more detail

6ca6cc8

remove unused import

e065ced

lower lr to prevent large optimizer step into weight territory which …

4604daa

…produces inf

use actual loss function

723088a

danielvegamyhre changed the title ~~[WIP] document e2e training -> inference flow~~ [float] document e2e training -> inference flow May 9, 2025

danielvegamyhre marked this pull request as ready for review May 9, 2025 17:23

andrewor14 approved these changes May 9, 2025

View reviewed changes

danielvegamyhre commented May 9, 2025

View reviewed changes

danielvegamyhre merged commit a0a0969 into main May 13, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[float] document e2e training -> inference flow #2190

[float] document e2e training -> inference flow #2190

Uh oh!

danielvegamyhre commented May 9, 2025

Uh oh!

pytorch-bot bot commented May 9, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented May 9, 2025

Uh oh!

andrewor14 left a comment

Uh oh!

andrewor14 commented May 9, 2025

Uh oh!

danielvegamyhre May 9, 2025

Uh oh!

vkuzo May 9, 2025

Uh oh!

danielvegamyhre May 9, 2025

Uh oh!

danielvegamyhre May 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[float] document e2e training -> inference flow #2190

[float] document e2e training -> inference flow #2190

Uh oh!

Conversation

danielvegamyhre commented May 9, 2025

Summary

Uh oh!

pytorch-bot bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2190

✅ No Failures

Uh oh!

danielvegamyhre commented May 9, 2025

Uh oh!

andrewor14 left a comment

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented May 9, 2025

Uh oh!

danielvegamyhre May 9, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo May 9, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre May 9, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pytorch-bot bot commented May 9, 2025 •

edited

Loading