- 
                Notifications
    
You must be signed in to change notification settings  - Fork 4.5k
 
Feature/ddp #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
      
            FrankLeeeee
  merged 11 commits into
  hpcaitech:develop/experiments
from
ver217:feature/ddp
  
      
      
   
  Dec 4, 2021 
      
    
                
     Merged
            
            Feature/ddp #49
                    FrankLeeeee
  merged 11 commits into
  hpcaitech:develop/experiments
from
ver217:feature/ddp
  
      
      
   
  Dec 4, 2021 
              
            Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b7. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <[email protected]> Co-authored-by: 1SAA <[email protected]> Co-authored-by: ver217 <[email protected]>
              
                    FrankLeeeee
  
              
              approved these changes
              
                  
                    Dec 4, 2021 
                  
              
              
            
            
    
  FrankLeeeee 
      added a commit
      that referenced
      this pull request
    
      Dec 9, 2021 
    
    
      
  
    
      
    
  
* remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b7. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <[email protected]> Co-authored-by: 1SAA <[email protected]> Co-authored-by: ver217 <[email protected]> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * support torch ddp * fix loss accumulation * add log for ddp * change seed * modify timing hook Co-authored-by: Frank Lee <[email protected]> Co-authored-by: 1SAA <[email protected]> Co-authored-by: binmakeswell <[email protected]>
    
  FrankLeeeee 
      added a commit
      that referenced
      this pull request
    
      Dec 9, 2021 
    
    
      
  
    
      
    
  
* Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b7. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <[email protected]> * Split conv2d, class token, positional embedding in 2d, Fix random number in ddp Fix convergence in cifar10, Imagenet1000 * Integrate 1d tensor parallel in Colossal-AI (#39) * fixed 1D and 2D convergence (#38) * optimized 2D operations * fixed 1D ViT convergence problem * Feature/ddp (#49) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b7. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <[email protected]> Co-authored-by: 1SAA <[email protected]> Co-authored-by: ver217 <[email protected]> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * support torch ddp * fix loss accumulation * add log for ddp * change seed * modify timing hook Co-authored-by: Frank Lee <[email protected]> Co-authored-by: 1SAA <[email protected]> Co-authored-by: binmakeswell <[email protected]> * Feature/pipeline (#40) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b7. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <[email protected]> Co-authored-by: 1SAA <[email protected]> Co-authored-by: ver217 <[email protected]> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * optimize communication of pipeline parallel * fix grad clip for pipeline Co-authored-by: Frank Lee <[email protected]> Co-authored-by: 1SAA <[email protected]> Co-authored-by: binmakeswell <[email protected]> * optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51) * Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset * update api for better usability (#58) update api for better usability Co-authored-by: 1SAA <[email protected]> Co-authored-by: ver217 <[email protected]> Co-authored-by: puck_WCR <[email protected]> Co-authored-by: binmakeswell <[email protected]> Co-authored-by: アマデウス <[email protected]> Co-authored-by: BoxiangW <[email protected]>
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Support torch ddp now, and use it as the default ddp strategy.