-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Avoiding inode exhaustion by making 13f745092f2685877ec13f0f984d89b3096d494b configurable #26205
Replies: 1 comment · 20 replies
-
If you running as root I would assume recent podman and kernels all default to idmapped mounts instead of making copies. If I try that locally I see no such leak happening. That doesn't change the fact that that there is still a leak in cases where idmapped mounts are not support (e.g. rootless containers). I would be against adding such option to podman run, one thing I wonder we have cc @giuseppe |
Beta Was this translation helpful? Give feedback.
All reactions
-
Beta Was this translation helpful? Give feedback.
All reactions
-
Running kernel 6.12 acts the same. |
Beta Was this translation helpful? Give feedback.
All reactions
-
thanks for your help, now it is much clearer what is going on, opened a PR: containers/storage#2346 |
Beta Was this translation helpful? Give feedback.
All reactions
-
as a temporary workaround, you could try to force contiguous mappings (there are no holes in the container IDs) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Terrific! Thanks for bearing with us. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
We have a business need that requires the use of dynamically created UID/GID mappings when running containers in a rootful environment. Unfortunately, this results in the exhaustion of inodes in the filesystem hosting the image store. What follows is a detailed look at the problem; some things we tried without resorting to changing podman/containers code; and a look at a an experimental solution we've come up with.
Background
Apologies for teaching your grandmother to suck eggs by going into detail but it helped me better understand the problem.
The following commit added code to check the UID and GID mapping specifications of a layer when looking for candidates to start a container. As indicated in the commit summary this was for performance reasons:
The code in question is found in
containers/storage/store.go
:What this means is that if a layer is required when starting a container the image store is checked to see if there is a matching layer. Candidate layers are then checked to see if their UID and GID maps match that specified on the podman run command.
For example, if I pull the image docker.io/library/almalinux:8 it will be placed in the image store
/var/lib/containers/storage
. Of specific interest is the location/var/lib/containers/storage/overlay
where you would find:This directory contains a number of subdirectories notably
diff
andmerged
:This represents a layer of an image that is ready to be instantiated in a running container. The layer contains no UID/GID mapping information.
A subsequent
podman run
will cause podman to search for candidate layers to be used to run a container. podman will find this “template” (for wont of a better term) to be used to build a runnable container. As we have no GID/UID mapping specification set it will select this layer to be used in the container.Once the container is up and running the storage area now contains:
Note, there are now two entries: (1) The template and (2) the layer in use (“running” layer). The
diff
andmerge
subdirectories now contain:Once run this running layer is cleaned up and we end up with
/var/lib/containers/storage/overlay
containing:Using this same image we now run with different UID and GID mappings:
This time when podman is looking for candidate layers to use it will note that our original “template” doesn’t have the UID/GID mapping we want. So, believing this won’t be the only time this layer will be used with the same mapping specification, it will prepare a second “template” and then instantiate a running container using that template.
When the container completes running and is torn down we end up with the new template in the store and ready to be used again. However, if I now run with a different mapping:
We see, when running:
When container completes running the container storage area now has:
And thus we grow and consume inodes. Repeat this process for hundreds or thousands of times then we exhaust the inodes of the file system.
Circumventions
In this section I describe a couple of approaches we looked at to circumvent this behaviour.
Additional Store Area
This is a feature of containers that may be configured in /etc/containers/storage.conf that can be used to save images.
If an image is pulled into the additional storage area then when it is run only the running layers end up in
/var/lib/containers/storage/overlay
and are removed once the container completes running. This happens for whatever UID/GID mapping specification is on the podman run command line. Thus we see no growth in inodes.A podman images command shows a slightly different output with the addition of the R/O column:
Drawbacks
There are a number of negatives to this approach:
--output
option on the podman command. However, if the is anything in the Docker file that requires modification of a layer (notably removing a file) then podman build objects and the action is ignored. This should come as no surprise as within the code and the configuration file this area is referred to as “read only”.Separate Image Store
Containers also allows for the specification of an
imagestore
area. This separates image layers “at rest” from those instantiated at run time (what containers refer to asgraphroot
which is its primary location of container storage).This also looked promising as this separation may lead to preventing inode growth. However, I was never able to get things running to test the effects of UID/GID mapping as a podman run yields:
An
strace
shows thechown
failing with-EROFS
. Again, although the comments in thestorage.conf
file and man page don't state it, the code also refers to this area as read/only.Options
So as you can see we tried some things that were faulty in concept and application. Therefore, we looked at new approaches.
According to the code and documentation podman is “working as designed”. However, I would argue there are good reasons to change the behaviour of podman (or rather the containers component it brings in during the package build) to make this performance feature introduced with commit
13f745092f2685877ec13f0f984d89b3096d494b
configurable with a default of it being enabled.To this end we added a new option to the
podman run
command:--nocache
that would cause the code instore.go
to stop looking for layers once a match was made despite the UID/GID mappings of that layer (we're happy with an imperfect match):To achieve this we changed the definition of the type
IDMappings
to include a new field 'NoCache`:Additional code was required in
cmd/podman/containers/run.go
to define the new flag and in the spec generation to use this value. If this approach is useful I can provide the entire code as a draft PR.Findings
When we use this code and add the
--nocache
option to the run command we observe the desired behaviour:--nocache
option:--nocache
:--nocache
and with a different IDMapping:Conclusion
Making the behaviour configurable appears to meet the requirement of preventing boundless growth of inode use as containers are run but the approach may (probably) be naïve. However, it may point to a more robust solution. Among other things such a change also means that things like
podman inspect
needs to cater for this new field withinIDMappings
or adding similar logic to thepodman container create
command.Beta Was this translation helpful? Give feedback.
All reactions