-
Notifications
You must be signed in to change notification settings - Fork 3k
Pull requests: NVIDIA/Megatron-LM
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
Fix a typo on README git checkout
module: documentation
#1705
opened Jul 24, 2025 by
GindaChen
Loading…
BugFix: FP8 Communication Mismatch with --first-last-layers-bf16 in tp-comm-overlap
bug
Something isn't working
module: transformer engine
#1703
opened Jul 24, 2025 by
xiaomin-D
Loading…
Align import to existing module
module: data pipeline
#1692
opened Jul 15, 2025 by
AlexanderLavelle
Loading…
fix(mtp logging): Correctly accumulate MTP loss for logging when log_interval > 1
module: moe
#1684
opened Jul 11, 2025 by
Luowaterbi
Loading…
Update pretrain_mamba.py
bug
Something isn't working
module: documentation
#1682
opened Jul 11, 2025 by
vignesh1507
Loading…
[feat, moe] Add support for global aux loss
module: moe
#1681
opened Jul 11, 2025 by
Victarry
Loading…
Issue 1672 fix: initializing the current pointed with int64 to avoid …
bug
Something isn't working
#1673
opened Jul 7, 2025 by
sharanmayank
Loading…
Speed up model parallel initialization
module: distributed
#1662
opened Jul 2, 2025 by
alexqdh
Loading…
bug fixed: wandb artifact requires the tracker file
module: debugging
#1654
opened Jun 27, 2025 by
yezhengmao1
Loading…
Apply roll operation to position_ids in MTP
module: moe
#1651
opened Jun 26, 2025 by
iansheng
Loading…
fix twice allgather in moe distrib optimizer
module: moe
#1645
opened Jun 23, 2025 by
irobot2013-why
Loading…
Fix log-timer-to-tensorboard on logging
module: debugging
#1631
opened Jun 13, 2025 by
wplf
Loading…
Fix typos: vritual → virtual and decoeder → decoder
module: documentation
#1626
opened Jun 11, 2025 by
EricLabile
Loading…
Fix: Apply q_layernorm consistently in MLA LoRA path
module: fine-tuning
#1624
opened Jun 11, 2025 by
Flink-ddd
Loading…
fix: when using moe parallel folding feature and set etp > 1 && ep == 1, the grad sync is incorrect and the loss curve is bad
bug
Something isn't working
module: moe
#1622
opened Jun 10, 2025 by
Louis-J
Loading…
use a cpu set to cache cuda tensor
finished_request_ids
module: inference
#1610
opened Jun 5, 2025 by
ladyrick
Loading…
Add DistTrain, Allow Encoder to Have Different DP Size
module: multimodal
#1605
opened May 30, 2025 by
zidanehuang001
Loading…
add node_rank argument for example scripts
module: training
#1604
opened May 30, 2025 by
xylllllllll
Loading…
Previous Next
ProTip!
Exclude everything labeled
bug
with -label:bug.