The Great DinD Bottleneck on GitLab Runners

16 November, 2024 — Exploring the 8-year-old bottleneck in GitLab's Docker builds, its impact on CI performance, and the challenges of optimisation

In the midst of an otherwise ordinary workday, it became clear that my GitLab CI pipeline had been creeping along at a frustrating pace for far too long. The slow build times had become too noticeable to ignore, and it was clear that optimising the pipeline was the next step to get things back on track.

This pipeline, as it stands, is one I inherited when I was inducted into the team — like a handed-down heirloom, complete with its quirks and outdated configurations. Anyhow, I got to work by removed unused packages from abandoned experiments, optimised import statements, and utilise Buildkit to enable parallel caching and improve layer handling. These changes reduced the image size from 1GB to 300MB and shaved over a minute off the build time.

The next thing that caught my attention, after diving into BuildKit, was the Dockerfile and the whole Docker-in-Docker (DinD) setup.

For most use cases, DinD has become the de facto standard when running CI jobs on GitLab. It provides a convenient, isolated environment for building and running Docker containers within your CI pipeline. The appeal here is that the run-once-and-throwaway nature reduces the chance of errors and ensures that your jobs always run in a consistent, standardised environment across different pipelines.

While it is possible to use the host machine's Docker directly - by connecting to its Docker socket via TCP or mounting the docker.sock as volume — the general recommendation is to avoid this. Especially when using public runners, DinD offers the crucial advantage of isolating your CI jobs from those of other users. This isolation is important in shared environments, as it prevents one job from interfering with another, maintaining security and stability

Placebo

Because of the disposable nature of Docker-in-Docker (DinD), cache from build layers isn't available to future build jobs. This is where the inline cache comes into play, a mechanism that can drastically reduce build times if your Dockerfile is well-structured and the majority of layers remain unchanged between builds. Inline cache is particularly important for optimising the build process and making it more efficient.

Many resources, including forums like Stack Overflow and GitLab's official documentation, recommend enabling BuildKit with the--build-arg BUILDKIT_INLINE_CACHE=1 flag and using --cache-from to pull an existing image from GitLab before the build step. This is touted as a way to leverage cached layers from previous builds and speed up the process.

However, after digging into this further, I stumbled upon two outstanding issue [Issue #17861 & Issue #1107] in the official gitlab-runner repository, still unresolved. Despite the many solutions and recommendations that claim success, this issue suggests that enabling these additional flags may be more of a placebo than a genuine solution.

Bottleneck

After countless hours of troubleshooting and mistakenly believing the placebo of incremental improvements, the true underlying issue finally became clear.

The issue seems to be related to DinD's caching mechanism, which, despite various flags and optimisations, doesn’t always behave as expected when changes are made - even if only in later stages of the build. My Dockerfile was properly optimised to ensure efficient caching, and when no changes are made to the code, DinD can successfully reuse the cache for all layers, including the base image. However, when there are any modifications, even if they only affect subsequent layers (such as changes in the application code or build arguments), the cache for earlier layers appears to get invalidated as well.

My hypothesis is that DinD's cache invalidation mechanism might not always be granular enough to properly distinguish between changes in subsequent stages and dependencies on earlier steps, causing the entire build process to be affected - even when the earlier layers themselves are unchanged. Or it might be some greater underlying issue with the way DinD has been built that prevents it from interacting effectively with Docker's cache system as we typically do.

Stopgap

Until this long-forgotten issue is ever addressed (which, given its indefinite deprioritisation, might be a while), there are a few stopgap measures to try and mitigate the impact:

(a) Provision a Private Runner: You could set up a private GitLab runner and connect it to its own Docker daemon. This would isolate your CI jobs and allow for more consistent caching, but it requires additional infrastructure and maintenance

(b) Connect to the Public Host's Docker: Another option is to connect to the Docker instance on the public host machine. However, this comes with the obvious risk of security vulnerabilities—since any user can access the host's Docker, you're introducing potential risks to the pipeline's security.

(c) Injecting Cache into DinD: There's also an untested approach where you attempt to inject the cache directly into the Docker-in-Docker (DinD) container, either via mounting of Volume or leveraging GitLab's CI caching system before the job starts. However, given that this isn't a documented method, it might require significant trial and error. Caveat that this approach might corrupt the cache if it's not used properly.

For me, though, I hit a bit of a dead end here. With no immediate need to speed up the pipeline (it wasn't urgent, just a fun detour), I decided to walk away from further optimisation attempts. Embracing the fact that the build starts from scratch each time actually gave me a sense of reassurance. There's a kind of peace in knowing that each job is always executed in a clean, consistent sandbox-like space, free from the risk of hidden corruption or unpredictable caching errors.

References

Interesting post on Docker-in-Docker by @jpetazzo who used to be a developer at Docker Inc. Using Docker-in-Docker for your CI or testing environment? Think twice.

Made by Owyong Jian Wei