Skip to content

Sharing memory between containers #996

@stac47

Description

@stac47

Hello,
I am trying to figure out whether it is possible to share the loaded shared library between containers. The rationale behind this is that if I build twice an image (without using the cache --no-cache) and provided I take care a creating reproducible layer FS diff, then the instantiated container will not use twice the amount of memory.
To make it more explicit and without talking about shared libraries, let's have a look at the following image:

% cat Dockerfile
FROM ubuntu
RUN apt update && \
    apt install --yes vmtouch
RUN touch /output.dat && \
    dd if=/dev/zero of=/output.dat  bs=1M  count=24 && \
    touch -d @0 /output.dat
CMD vmtouch -l /output.dat
% podman build --no-cache -t img1 .
% podman run --rm img1

Running the container from the described image, I can see the output.dat file is mapped once in memory. The Proportial Set Size matches the size of the file mapped in memory (see Pss column).

% ps -eF | grep vmtouch | grep -v grep  | grep -v -e'/bin/sh'
ubuntu     61371   61369  0  6778 25116   3 12:50 ?        00:00:00 vmtouch -l /output.dat
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7f6b9a31b000 r--s 00000000  00:39    184907 24576 24576 24576      24576         0        0              0             0              0               0    0       0  24576           0 output.dat

Now running the same image a second time:

% podman run --rm img1
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7f6b9a31b000 r--s 00000000  00:39    184907 24576 24576 12288      24576         0        0              0             0              0               0    0       0  12288           0 output.dat
% ps -eF | grep vmtouch | grep -v grep | grep -v -e'/bin/sh'
ubuntu     61371   61369  0  6778 25116   3 12:50 ?        00:00:00 vmtouch -l /output.dat
ubuntu     61665   61663  0  6778 25224   5 13:50 ?        00:00:00 vmtouch -l /output.dat
% pmap -X 61665 | grep -e "output.dat" -e "Pss"
61665:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7f1d99fe5000 r--s 00000000  00:48    184907 24576 24576 12288      24576         0        0              0             0              0               0    0       0  12288           0 output.dat

The PSS is now divided by two meaning the file is mapped once and shared by the two containers. This is normal because I use the "overlay" storage driver. (it would not have worked with "vfs" backed by "ext4" for instance because of the lack of reflink support).

Now let's build another image without using the cache. The way output.dat is created will result in the same layer diff as we can see below:

% podman build --no-cache -t img2 .
% diff -u <(podman inspect --format="{{json .RootFS}}" img1 | jq .) <(podman inspect --format="{{json .RootFS}}" img2 | jq .)
--- /proc/self/fd/17    2021-08-16 13:47:40.588000000 +0000
+++ /proc/self/fd/20    2021-08-16 13:47:40.592000000 +0000
@@ -2,7 +2,7 @@
   "Type": "layers",
   "Layers": [
     "sha256:7555a8182c42c7737a384cfe03a3c7329f646a3bf389c4bcd75379fc85e6c144",
-    "sha256:31121a0cb3f5e2a4dd0d68d7d6b6de617d8d937b8b41e5ae5a13c5304c3dfe28",
+    "sha256:0a075e0d3129290f16273f6d6b7c56ae0b282cee8365d2aaa28b327fcc6825d0",
     "sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f"
   ]
 }

The last layer sha256:62b02674a316aa00ca3e17fe18af907dffc03a4d082b749e15c159418be1ed8f is the same.
If I run a container for this latest image, I can see the PSS of output.dat, for the first container I ran, will not decrease because the file has not the same deviceid/inode.

% podman run --rm img2
% pmap -X 61371 | grep -e "output.dat" -e "Pss"
61371:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7f6b9a31b000 r--s 00000000  00:39    184907 24576 24576 12288      24576         0        0              0             0              0               0    0       0  12288           0 output.dat
% ps -eF | grep vmtouch | grep -v grep  | grep -v -e'/bin/sh'
ubuntu     61371   61369  0  6778 25116   3 12:50 ?        00:00:00 vmtouch -l /output.dat
ubuntu     61665   61663  0  6778 25224   5 13:50 ?        00:00:00 vmtouch -l /output.dat
ubuntu     61695   61693  0  6778 25260   5 13:51 ?        00:00:00 vmtouch -l /output.dat
% pmap -X 61695 | grep -e "output.dat" -e "Pss"
61695:   vmtouch -l /output.dat
         Address Perm   Offset Device     Inode  Size   Rss   Pss Referenced Anonymous LazyFree ShmemPmdMapped FilePmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked THPeligible Mapping
    7ff800850000 r--s 00000000  00:56 134512622 24576 24576 24576      24576         0        0              0             0              0               0    0       0  24576           0 output.dat

This makes sense to me when the backing filesystem is something that does not support reflink like ext4. So I tried this on XFS with reflink activated but I have the same result. I would have expected that this could be improved because the mmap'ed file is in fact the same if one is a reflink of the other.
So my questions:

  • Do you think such an improvement is feasible ?
  • Is there a limitation from the kernel for instance ?
  • In case it is feasible, is it worth working on such an optimisation ?

System information:

% uname -a
Linux lstacul-vm 5.11.0-25-generic #27-Ubuntu SMP Fri Jul 9 23:06:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
% podman version
Version:      3.2.1
API Version:  3.2.1
Go Version:   go1.16.2
Built:        Thu Jan  1 00:00:00 1970
OS/Arch:      linux/amd64
% podman info
...
store:
  configFile: /home/ubuntu/.config/containers/storage.conf
  containerStore:
    number: 3
    paused: 0
    running: 3
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /mnt/my-xfs/podman-user-root
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 7
  runRoot: /mnt/my-xfs/podman-user-root
  volumePath: /mnt/my-xfs/podman-user-root/volumes
...
% cat ${HOME}/.config/containers/storage.conf
[storage]
driver = "overlay"
graphroot = "/mnt/my-xfs/podman-user-root"
runroot = "/mnt/my-xfs/podman-user-root"
% xfs_info /mnt/my-xfs
meta-data=/dev/vdb               isize=512    agcount=4, agsize=13107200 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0
data     =                       bsize=4096   blocks=52428800, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=25600, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions