Skip to content

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Feb 19, 2025

If we don't wait for the first quorum, the trainer will continue to run forward and may use incorrect weights if the trainer is healing.

If we don't wait for the first quorum, the trainer will continue to run
forward and may use incorrect weights if the trainer is healing.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 19, 2025
@d4l3k
Copy link
Member

d4l3k commented Feb 19, 2025

@fegin do you have more details on where this is being triggered? We can recover in non start cases so we should figure out how to resolve this

Are we not zeroing grads correctly during recovery?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants