Race condition when prepare pretrained model in distributed training 

Hi,

I launched two processes per node to run distributed run_classifier.py. However, I am occasionally get below error:

```
11/20/2018 09:31:48 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmpa25_y4es to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba

 93%|█████████▎| 381028352/407873900 [00:11<00:01, 14366075.22B/s]
 94%|█████████▍| 383812608/407873900 [00:11<00:01, 16210783.00B/s]
 95%|█████████▍| 386455552/407873900 [00:11<00:01, 16205260.89B/s]11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   creating metadata file for /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   removing temp file /tmp/tmpa25_y4es

 95%|█████████▌| 388946944/407873900 [00:11<00:01, 18097539.03B/s]11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmpvxvnr8_1

 97%|█████████▋| 393660416/407873900 [00:11<00:00, 22199883.93B/s]
 98%|█████████▊| 399411200/407873900 [00:11<00:00, 27211860.00B/s]
 99%|█████████▉| 405128192/407873900 [00:11<00:00, 32287252.94B/s]
100%|██████████| 407873900/407873900 [00:11<00:00, 34098120.40B/s]
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmp5fcm4v8x to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
Traceback (most recent call last):
  File "examples/run_classifier.py", line 629, in <module>
    main()
  File "examples/run_classifier.py", line 485, in main
    model = BertForSequenceClassification.from_pretrained(args.bert_model, len(label_list))
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/site-packages/pytorch_pretrained_bert-0.1.2-py3.6.egg/pytorch_pretrained_bert/modeling.py", line 495, in from_pretrained
    archive.extractall(tempdir)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2007, in extractall
    numeric_owner=numeric_owner)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2049, in extract
    numeric_owner=numeric_owner)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2119, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2168, in makefile
    copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 248, in copyfileobj
    buf = src.read(bufsize)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
```
It looks like a race-condition that two processes are simultaneously writing model file to `/root/.pytorch_pretrained_bert/`. 

Please help to advice any workaround. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race condition when prepare pretrained model in distributed training #44

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition when prepare pretrained model in distributed training #44

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions