-
Couldn't load subscription status.
- Fork 263
Description
Hello,
I am trying to figure out whether it is possible to share the loaded shared library between containers. The rationale behind this is that if I build twice an image (without using the cache --no-cache) and provided I take care a creating reproducible layer FS diff, then the instantiated container will not use twice the amount of memory.
To make it more explicit and without talking about shared libraries, let's have a look at the following image:
% cat Dockerfile
FROM ubuntu
RUN apt update && \
apt install --yes vmtouch
RUN touch /output.dat && \
dd if=/dev/zero of=/output.dat bs=1M count=24 && \
touch -d @0 /output.dat
CMD vmtouch -l /output.dat
% podman build --no-cache -t img1 .
% podman run --rm img1
Running the container from the described image, I can see the output.dat file is mapped once in memory. The Proportial Set Size matches the size of the file mapped in memory (see Pss column).
% ps -eF | grep vmtouch | grep -v grep | grep -v -e'/bin/sh'
ubuntu 61371 61369 0 6778 25116 3 12:50 ? 00:00:00 vmtouch -l /output.dat
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371: vmtouch -l /output.dat
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
7f6b9a31b000 r--s 00000000 00:39 184907 24576 24576 24576 24576 0 0 0 0 0 0 0 0 24576 0 output.dat
Now running the same image a second time:
% podman run --rm img1
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371: vmtouch -l /output.dat
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
7f6b9a31b000 r--s 00000000 00:39 184907 24576 24576 12288 24576 0 0 0 0 0 0 0 0 12288 0 output.dat
% ps -eF | grep vmtouch | grep -v grep | grep -v -e'/bin/sh'
ubuntu 61371 61369 0 6778 25116 3 12:50 ? 00:00:00 vmtouch -l /output.dat
ubuntu 61665 61663 0 6778 25224 5 13:50 ? 00:00:00 vmtouch -l /output.dat
% pmap -X 61665 | grep -e "output.dat" -e "Pss"
61665: vmtouch -l /output.dat
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
7f1d99fe5000 r--s 00000000 00:48 184907 24576 24576 12288 24576 0 0 0 0 0 0 0 0 12288 0 output.dat
The PSS is now divided by two meaning the file is mapped once and shared by the two containers. This is normal because I use the "overlay" storage driver. (it would not have worked with "vfs" backed by "ext4" for instance because of the lack of reflink support).
Now let's build another image without using the cache. The way output.dat is created will result in the same layer diff as we can see below:
% podman build --no-cache -t img2 .
% diff -u <(podman inspect --format="{{json .RootFS}}" img1 | jq .) <(podman inspect --format="{{json .RootFS}}" img2 | jq .)
--- /proc/self/fd/17 2021-08-16 13:47:40.588000000 +0000
+++ /proc/self/fd/20 2021-08-16 13:47:40.592000000 +0000
@@ -2,7 +2,7 @@
"Type": "layers",
"Layers": [
"sha256:7555a8182c42c7737a384cfe03a3c7329f646a3bf389c4bcd75379fc85e6c144",
- "sha256:31121a0cb3f5e2a4dd0d68d7d6b6de617d8d937b8b41e5ae5a13c5304c3dfe28",
+ "sha256:0a075e0d3129290f16273f6d6b7c56ae0b282cee8365d2aaa28b327fcc6825d0",
"sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f"
]
}
The last layer sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f is the same.
If I run a container for this latest image, I can see the PSS of output.dat, for the first container I ran, will not decrease because the file has not the same deviceid/inode.
% podman run --rm img2
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371: vmtouch -l /output.dat
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
7f6b9a31b000 r--s 00000000 00:39 184907 24576 24576 12288 24576 0 0 0 0 0 0 0 0 12288 0 output.dat
% ps -eF | grep vmtouch | grep -v grep | grep -v -e'/bin/sh'
ubuntu 61371 61369 0 6778 25116 3 12:50 ? 00:00:00 vmtouch -l /output.dat
ubuntu 61665 61663 0 6778 25224 5 13:50 ? 00:00:00 vmtouch -l /output.dat
ubuntu 61695 61693 0 6778 25260 5 13:51 ? 00:00:00 vmtouch -l /output.dat
% pmap -X 61695 | grep -e "output.dat" -e "Pss"
61695: vmtouch -l /output.dat
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
7ff800850000 r--s 00000000 00:56 134512622 24576 24576 24576 24576 0 0 0 0 0 0 0 0 24576 0 output.dat
This makes sense to me when the backing filesystem is something that does not support reflink like ext4. So I tried this on XFS with reflink activated but I have the same result. I would have expected that this could be improved because the mmap'ed file is in fact the same if one is a reflink of the other.
So my questions:
- Do you think such an improvement is feasible ?
- Is there a limitation from the kernel for instance ?
- In case it is feasible, is it worth working on such an optimisation ?
System information:
% uname -a
Linux lstacul-vm 5.11.0-25-generic #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
% podman version
Version: 3.2.1
API Version: 3.2.1
Go Version: go1.16.2
Built: Thu Jan 1 00:00:00 1970
OS/Arch: linux/amd64
% podman info
...
store:
configFile: /home/ubuntu/.config/containers/storage.conf
containerStore:
number: 3
paused: 0
running: 3
stopped: 0
graphDriverName: overlay
graphOptions: {}
graphRoot: /mnt/my-xfs/podman-user-root
graphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
imageStore:
number: 7
runRoot: /mnt/my-xfs/podman-user-root
volumePath: /mnt/my-xfs/podman-user-root/volumes
...
% cat ${HOME}/.config/containers/storage.conf
[storage]
driver = "overlay"
graphroot = "/mnt/my-xfs/podman-user-root"
runroot = "/mnt/my-xfs/podman-user-root"
% xfs_info /mnt/my-xfs
meta-data=/dev/vdb isize=512 agcount=4, agsize=13107200 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=1 bigtime=0
data = bsize=4096 blocks=52428800, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=25600, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0