Skip to content

Commit f0d447a

Browse files
committed
Added Troubleshooting Info to README, and more expected Times
1 parent ed4dcea commit f0d447a

File tree

2 files changed

+44
-3
lines changed

2 files changed

+44
-3
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@
33
model
44
dataset
55
models
6-
compose_output
6+
compose_output
7+
saved_*

README.md

Lines changed: 42 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,22 +66,33 @@ The original python files from microsoft follow (different) licences, and any ch
6666
## Limitations / Hardware Requirement
6767

6868
For the container to run properly, it needs 15 to 25 gigabyte memory.
69-
On our servers, one epoch on the java data takes ~30h.
69+
On our servers, one cpu epoch on the java data takes ~30h.
7070
The Containers starts ~20 threads for training and your server should have >20 cores.
7171

7272
In comparison, training on a RTX 1070 took 7h per epoch.
73+
Training on a 3080ti took 6h per epoch.
74+
Training on an A40 took ~3h per epoch. In general, GPU tries to allocate around 12gb of memory.
75+
76+
In general, despite being a good first step, GPU Containers turned out to be quite fragile.
77+
We have seen multiple problems with Framework-Versions, Hardware and OS combinations.
78+
If you intend to reproduce the experiments, we recommend the CPU way for slow but steady progress,
79+
and to figure out your own GPU configuration if you intend to do actual changes.
7380

7481
## Known issues
7582

7683
The *preprocess.py* in the Dataset.zip sometimes failes to unzip all of the data.
7784
If this error occurs, some of the .jsonls will still be gzipped.
7885
To fix this, simply run `gunzip` on the remaining files and re-run the preprocess.py.
7986

87+
====================================================================================
88+
8089
Due to file-locks, it is not possible to use a model OR a dataset at two experiments at the same time.
8190
If you want to run two experiments at once using the same model, you need to make a copy.
8291
A short test from me showed that they give the same results when all parameters are equal.
8392

8493

94+
====================================================================================
95+
8596
Another issue is from windows, atleast default windows.
8697
if you get an error like
8798
```
@@ -96,10 +107,39 @@ This is due to windows changing the line-breaks / file encodings. Thanks windows
96107
**Easy Solution**: run `dos2unix entrypoint.sh` and rebuild the container.
97108
Its might easier/faster to pull the image from this repository, or you have to [edit the entrypoint to be compatible with windows](https://askubuntu.com/questions/966488/how-do-i-fix-r-command-not-found-errors-running-bash-scripts-in-wsl).
98109

110+
111+
====================================================================================
112+
99113
```
100114
xxx | RuntimeError: CUDA out of memory. Tried to allocate 62.00 MiB (GPU 0; 12.00 GiB total capacity; 10.57 GiB already allocated; 0 bytes free; 10.71 GiB reserved in total by PyTorch)
101115
```
102116

103117
This happens in old Pytorch versions.
104118

105-
Reduce batch size. To the best of my knowledge, nothing else can be done about this in old pytorch versions.
119+
Reduce batch size. To the best of my knowledge, nothing else can be done about this in old pytorch versions.
120+
121+
====================================================================================
122+
123+
Another thing that can happen is that the container stops after printing "starting epoch" like such:
124+
125+
```
126+
[...]
127+
python-codebert-training_1 | 03/17/2022 08:00:18 - INFO - __main__ - ***** Running training *****
128+
python-codebert-training_1 | 03/17/2022 08:00:18 - INFO - __main__ - Num examples = 164923
129+
python-codebert-training_1 | 03/17/2022 08:00:18 - INFO - __main__ - Batch size = 8
130+
python-codebert-training_1 | 03/17/2022 08:00:18 - INFO - __main__ - Num epoch = 10
131+
[Nothing here but you are waiting for it quite long already]
132+
```
133+
134+
This can be expected, as here the logging / printing halts until evaluation.
135+
However, it can also be blocked by some pytorch logic.
136+
This happened on our servers where we had multiple GPUs, that were all mounted in the docker-compose.
137+
You can narrow down whether this is your problem by
138+
139+
1. Checking Nvidia SMI, showing *n* python processes (where n is the numer of your GPUs)
140+
2. All related python processes have a low memory use
141+
3. the general load on the GPUs is low
142+
4. the time that you see the above message is suspiciously different from the numbers reported above
143+
144+
To adress this, just mount **one** GPU in.
145+
Only one GPU should be picked up, printed as such at the beginning of the container logs.

0 commit comments

Comments
 (0)