Docker: Reduce the size of a container

In container environments (Docker or Kubernetes) we need to deploy quickly, and the most important thing for this is their size. We must reduce them so that the download of them from the registry and their execution is slower the larger the container is and that minimally affects the complexity of the relations between services.

For a demonstration, we going to use a PowerDNS-based solution, I find that the original PowerDNS-Admin service container (https://github.com/ngoduykhanh/PowerDNS-Admin/blob/master/docker/Production/Dockerfile) has the following features:

The developer is very active and includes python code, javascript (nodejs) and css. The images in docker.hub are obsolete with respect code.
Production Dockerfile does not generate a valid image
It is based on Debian Slim, which although deletes a large number of files, is not sufficient small.

In Docker Hub there are images, but few are sufficient recent or do not use the original repository so the result comes from an old version of the code. For example, the most downloaded image (really/powerdns-admin) does not take into account the modifications of the last year and does not use yarn for the nodejs layer.

Table of Contents

First step: Decide whether to create a new Docker image

Sometimes it is a matter of necessity or time, in this case it has been decided to create a new image taking into account the above. As minimum requirements we need a GitHub account (https://www.github.com), a Docker Hub account (https://hub.docker.com/) and basic knowledge of git as well as advanced knowledge of creating Dockerfile.

In this case, https://github.com/aescanero/docker-powerdns-admin-alpine and https://cloud.docker.com/repository/docker/aescanero/powerdns-admin is created and linked to automatically create the images when a Dockerfile is uploaded to GitHub.

Second step: Choose the base of a container

Using a base of very small size and oriented to the reduction of each of the components (executables, libraries, etc.) that are going to be used is the minimum requirement to reduce the size of a container, and the choice should always be to use Alpine (https://alpinelinux.org/).

In addition to the points already indicated, standard distributions (based on Debian or RedHat) require a large number of items for package management, base system, auxiliary libraries, etc. Alpine eliminates these dependencies and provides a simple package system that allows identifying which package groups have been installed together to be able to eliminate them together (very useful for developments as we will see later)

Using Alpine can reduce the size and deployment time of a container by up to 40%.

Another option of recent appearance to consider is Redhat UBI, especially interesting is ubi-minimal (https://www.redhat.com/en/blog/introducing-red-hat-universal-base-image).

Third step: Install minimum packages

One of Alpine’s strengths is a simple and efficient package system that in addition to reducing the size reduces the creation time of the container.

The packages that we need to install are going to be divided into two blocks, the block of packages that will be left in the container and those that will help us to develop the container:

FROM alpine:3.10
MAINTAINER "Alejandro Escanero Blanco <aescanero@disasterproject.com>"

RUN apk add --no-cache python3 curl py3-mysqlclient py3-openssl py3-cryptography py3-pyldap netcat-openbsd \
  py3-virtualenv mariadb-client xmlsec py3-lxml py3-setuptools
RUN apk add --no-cache --virtual build-dependencies libffi-dev libxslt-dev python3-dev musl-dev gcc yarn xmlsec-dev

The first line will tell Docker what is the basis to use (Alpine version 3.10), the second is information about who creates the Dockerfile and from the third we are creating the script that will generate the container.

The first RUN installs the basic python3 packages with SSL, Mysql and LDAP support with the “apk add” command and a very important feature of apk: “--no-cache” that avoids saving downloaded files as package or temporary indexes.

The second RUN line has another very useful feature of apk: “--virtual VIRTUAL_NAME_PACK“, the list of packages to be installed is assigned to a single name that will allow us to remove all the packages that are installed in this step. So this list will include all the packages that are used to create a development and then we will remove at the end.

Fourth step: Python

Python is a programming language with a huge number of modules, usually those that are required for an application must be in a file called requirements.txt and installed with a package system called pip (pip2 for Python2, pip3 for Python3).

We find another package system, which unlike the usual ones (apk, deb, rpm) will download the sources and if it requires binaries it will proceed to compile them. This forces us to have a development environment that in the previous step we have labeled in order to eliminate that environment.

One of the problems that we can find is one in which the pip package management system is really slow, but there is a package (case of py_lxml that is to be installed by pip3) or the case of some installation process from sources such as the case of python_xmlsec that we install quickly with a single line:

RUN mkdir /xmlsec && curl -sSL https://github.com/mehcode/python-xmlsec/archive/1.3.6.tar.gz | tar -xzC /xmlsec --strip 1 && cd /xmlsec && pip3 install --no-cache-dir .

Another process that accelerates the creation of a container is not to use git when this is possible, as in the previous case for xmlsec, in powerdnsadmin we will do the same, remove the packages already installed from requirements.txt and run the python package installer at same as in Alpine without saving cache of any kind with “--no-cache-dir“

RUN mkdir -p /opt/pdnsadmin/ && cd /opt/pdnsadmin \
  && curl -sSL https://github.com/ngoduykhanh/PowerDNS-Admin/archive/master.tar.gz | tar -xzC /opt/pdnsadmin --strip 1 \
  && sed -i -e '/mysqlclient/d' -e '/pyOpenSSL/d' -e '/python-ldap/d' requirements.txt \
  && pip3 install --no-cache-dir -r requirements.txt

Finally, we must undestand that python applications usually use virtualenv, we must minimize its effect by indicating that you use the system packages and do not install in the local directory with “virtualenv --system-site-packages --no-setuptools --no-pip” The container is already a virtual environment, it is not necessary to add more layers.

Fifth step: Nodejs

The last step in development is the optimization of the deployment tasks of the javascript libraries and the style sheets, for this we must check if it uses npm or yarn for the installation of packages. The use of yarn is recommended because it is more efficient in downloading packages and the ability to eliminate duplicates within the tree of downloaded packages.

As the application to be installed uses “flask assets“, whose function is to unify the javascript libraries and the style sheets that the application will use, we have to check what elements of the node_modules folder will use javascript libraries and the style sheet, in the case of powerdns-admin are bootstrap, font-awesome, icheck, ionicons and multiselect, as well as some other files that we keep in a temporary folder and then return it to its original place. The remaining installed elements as well as the yarn cache are deleted.

Sixth step: other resources

Finally, it is necessary to check what happens when the container starts the application and see what applications can be replaced by elements already installed.

In this case the entry point of the container is an entrypoint.sh file that executes a some database queries before launching the service. The first query is to know if the port is open and does it with netcat, but the requirements of netcat are so meager that its replacement is meaningless.

However, if we have a candidate with mariadb-client, since it is only used to launch two simple SQL queries. Creating a python script that does exactly the same is easy and visibly reduces (about 20MB) the size of the container.

Conclusions

The difference between a container without optimization and another optimized can be really important affecting the creation, download and deployment times. As an example (using the test environment explained at Choose between Docker or Podman for test and development environments ) in the following video we see the creation of the container without optimization:

And here optimized:

As we can see there are big differences, the first is that the image not optimized is generated in 6:51 and the one optimized in 2:35. The second and most important is the size, the first is 1.11 GB and the optimized 321 MB both uncompressed. Server storage also appreciates it.