-
Couldn't load subscription status.
- Fork 4.5k
Description
🐛 Describe the bug
After successfully saving several (1-3) models, the following error will appear
2023-10-19 11:32:04: Start saving model checkpoint with running states
2023-10-19 11:32:04: Traceback (most recent call last):
2023-10-19 11:32:04: File "train.py", line 406, in
2023-10-19 11:32:04: File "train.py", line 369, in main
2023-10-19 11:32:04: save_dir=args.save_dir,
2023-10-19 11:32:04: File "/mnt/cache/wangke/ColossalAI/applications/Colossal-LLaMA-2/colossal_llama2/utils/ckpt_io.py", line 56, in save_checkpoint
2023-10-19 11:32:04: booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True)
2023-10-19 11:32:04: File "/mnt/cache/wangke/envs/colossalai/lib/python3.8/site-packages/colossalai/booster/booster.py", line 250, in save_optimizer
2023-10-19 11:32:04: self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard)
2023-10-19 11:32:04: File "/mnt/cache/wangke/envs/colossalai/lib/python3.8/site-packages/colossalai/checkpoint_io/checkpoint_io_base.py", line 192, in save_optimizer
2023-10-19 11:32:04: self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard)
2023-10-19 11:32:04: File "/mnt/cache/wangke/envs/colossalai/lib/python3.8/site-packages/colossalai/booster/plugin/low_level_zero_plugin.py", line 132, in save_sharded_optimizer
2023-10-19 11:32:04: save_param_groups(state_dict, group_file_path)
2023-10-19 11:32:04: File "/mnt/cache/wangke/envs/colossalai/lib/python3.8/site-packages/colossalai/checkpoint_io/utils.py", line 342, in save_param_groups
2023-10-19 11:32:04: torch.save(param_groups, group_file_path)
2023-10-19 11:32:04: File "/mnt/cache/wangke/envs/colossalai/lib/python3.8/site-packages/torch/serialization.py", line 376, in save
2023-10-19 11:32:04: with _open_file_like(f, 'wb') as opened_file:
2023-10-19 11:32:04: File "/mnt/cache/wangke/envs/colossalai/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
2023-10-19 11:32:04: return _open_file(name_or_buffer, mode)
2023-10-19 11:32:04: File "/mnt/cache/wangke/envs/colossalai/lib/python3.8/site-packages/torch/serialization.py", line 211, in init
2023-10-19 11:32:04: super(_open_file, self).init(open(name, mode))
2023-10-19 11:32:04: FileExistsError: [Errno 17] File exists: '/mnt/cache/wangke/ColossalAI/applications/Colossal-LLaMA-2/outs/debug-2023-10-19-11:24:28/epoch-0_step-40/optimizer/pytorch_optim_group.bin'
....................................
2023-10-19 11:37:19: train.py FAILED
2023-10-19 11:37:19: ------------------------------------------------------------
2023-10-19 11:37:19: Failures:
2023-10-19 11:37:19: <NO_OTHER_FAILURES>
2023-10-19 11:37:19: ------------------------------------------------------------
2023-10-19 11:37:19: Root Cause (first observed failure):
2023-10-19 11:37:19: [0]:
2023-10-19 11:37:19: time : 2023-10-19_11:32:06
2023-10-19 11:37:19: host : pt-7b1gm1vf-master-0.pt-7b1gm1vf.ns-operations-a5acdc67.svc.cluster.local
2023-10-19 11:37:19: rank : 1 (local_rank: 1)
2023-10-19 11:37:19: exitcode : 1 (pid: 106)
2023-10-19 11:37:19: error_file: <N/A>
2023-10-19 11:37:19: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2023-10-19 11:37:19: ============================================================
Environment
ubuntu20.04-py3.8-cuda11.3-cudnn8-torch1.12
8XA800-80GB(卡数:8,vCPU:96核,内存:960GiB)*2