You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+15-8Lines changed: 15 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,19 +42,19 @@ docker-compose -f docker-compose-minimal.yml up
42
42
43
43
44
44
When shutting down the process before completion, make sure to **clean up your containers** !
45
-
Otherwise, using docker run it might just restart the stopped container.
45
+
Otherwise, using `docker run` it might just restart the stopped container.
46
46
47
47
Also, if you run the experiment multiple times, **extract the outputs** beforehand.
48
48
Otherwise, the output will be overwritten.
49
49
50
-
I have also tested this to work with podman and podman-compose on debian 10.
50
+
Version 1.2 was also tested to work with podman and podman-compose on debian 10.
51
51
For running with podman, make sure to have the output folder created first.
52
52
53
53
## Requirements / Non Docker
54
54
55
-
In older versions (1.0) this contained an `environment.yml`and instructions how to run this without docker on your own machine.
56
-
In theory, this is still possible, but the requirements is (intentionally) reduced to work flawless with the pre-existing dependencies in the container.
57
-
To work, you should be good starting from Python 3.6 and installing Pytorch 1.4.
55
+
The contained `environment.yml`is a starting point how to run this *without docker* on your own machine.
56
+
The provided `requirements.txt` is meant for docker-only as important parts (pytorch) are missing, to align with pre-existing dependencies in the container.
57
+
To work, you should be good starting from Python 3.6 and make a fresh conda env from the `environment.yml`.
58
58
59
59
## Licence
60
60
@@ -67,11 +67,11 @@ The original python files from microsoft follow (different) licences, and any ch
67
67
68
68
For the container to run properly, it needs 15 to 25 gigabyte memory.
69
69
On our servers, one cpu epoch on the java data takes ~30h.
70
-
The Containers starts ~20 threads for training and your server should have >20 cores.
70
+
The CPU Containers starts ~20 threads for training and your server should have >20 cores.
71
71
72
72
In comparison, training on a RTX 1070 took 7h per epoch.
73
73
Training on a 3080ti took 6h per epoch.
74
-
Training on an A40 took ~3h per epoch. In general, GPU tries to allocate around 12gb of memory.
74
+
Training on an A40 took ~4h per epoch. In general, GPU tries to allocate around 12gb of memory.
75
75
76
76
In general, despite being a good first step, GPU Containers turned out to be quite fragile.
77
77
We have seen multiple problems with Framework-Versions, Hardware and OS combinations.
@@ -142,4 +142,11 @@ You can narrow down whether this is your problem by
142
142
4. the time that you see the above message is suspiciously different from the numbers reported above
143
143
144
144
To adress this, just mount **one** GPU in.
145
-
Only one GPU should be picked up, printed as such at the beginning of the container logs.
145
+
Only one GPU should be picked up, printed as such at the beginning of the container logs.
146
+
147
+
## Version History
148
+
149
+
- 1.0 was the first version with everything hardcoded
150
+
- 1.1 had some elements hardcoded, others configurable
151
+
- 1.2 was fully configurable but hardcoded to **CPU only**
152
+
- 1.3 changed the base image and allows **GPU usage**
0 commit comments