Docling V2.60.0 Offline Model Loading Bug In Docker
Hey guys, this article dives into a frustrating bug in Docling v2.60.0. Specifically, we're talking about how it messes up when trying to load offline models within a Docker environment. If you're using Docling in a Docker container and need it to work offline, meaning without relying on external internet access, then pay close attention. This is a common setup for security-conscious environments, where air-gapped systems are the norm.
The Core Problem: Offline Model Loading Failure
So, here's the deal. Docling is designed to work with models that are downloaded and stored locally. You're supposed to be able to specify a path where these models live using the DOCLING_SERVE_ARTIFACTS_PATH environment variable. In theory, Docling should then use these local models to process your documents, even without an internet connection. Sounds great, right? Unfortunately, in version 2.60.0, this process breaks down inside a Docker container. Even when you've pre-loaded all the necessary models into a designated directory (let's say /opt/docling_models), and you've correctly set the DOCLING_SERVE_ARTIFACTS_PATH environment variable, the DocumentConverter seems to completely ignore it.
Instead of looking in the specified directory, Docling ends up hunting for the model files in a temporary directory, usually something like /tmp/tmpXXXXXX/. And because these files aren't there (because they should be in /opt/docling_models), you get a nasty FileNotFoundError. This makes it impossible to use Docling in an offline, or air-gapped, Docker environment. This is a major issue because the entire point of specifying an artifacts path and pre-downloading models is to ensure offline functionality. Without this, Docling is useless for many secure applications. The issue stems from a failure in how the code resolves the correct path to the required model files.
Basically, the software is programmed to look in the wrong place, despite being given the correct instructions. This is a significant bug that prevents offline PDF processing in Docker. We've confirmed the models are present at the location specified by DOCLING_SERVE_ARTIFACTS_PATH, and that this path is correctly passed to PdfPipelineOptions(artifacts_path=...). The symptoms strongly suggest a misconfiguration or a bug in how Docling resolves the paths to its models when operating inside a Docker container and with the specified environment variables.
Steps to Reproduce the Bug and Detailed Explanation
To really understand what's going on, let's walk through the exact steps to reproduce this bug. I'll provide you with everything you need to see the error yourself.
-
Create a Directory for the Test Case: Start by setting up a dedicated directory for our test. This keeps everything organized.
-
Create the Required Files: Inside that directory, we're going to create three key files: a
Dockerfile, arequirements.txt, and arun.pyfile. These files are essential for building a Docker image that reproduces the problem.-
Dockerfile: This file tells Docker how to build our image. It starts with a base Python image (python:3.11-slim), sets up a working directory (/app), copies therequirements.txtfile, and installs the necessary Docling package. The crucial part is setting theDOCLING_SERVE_ARTIFACTS_PATHenvironment variable to/opt/docling_models. We also create this directory and usedocling-tools models downloadto pre-download all the models into it. Finally, it copies therun.pyscript and sets the command to execute it when the container runs.FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Create the artifacts directory and pre-download all models as per documentation ENV DOCLING_SERVE_ARTIFACTS_PATH=/opt/docling_models RUN mkdir -p "$DOCLING_SERVE_ARTIFACTS_PATH" \ && chmod -R 777 "$DOCLING_SERVE_ARTIFACTS_PATH" \ && docling-tools models download -o "$DOCLING_SERVE_ARTIFACTS_PATH" COPY run.py . CMD ["python", "run.py"] -
requirements.txt: This file specifies the Docling version we're using, which isdocling==2.60.0. This ensures we're testing the exact version with the bug.docling==2.60.0 -
run.py: This Python script is the heart of our test. It imports the necessary Docling modules, defines the artifacts path from the environment variable (DOCLING_SERVE_ARTIFACTS_PATH), and then sets up aDocumentConverter. It creates a simple dummy PDF file and attempts to convert it. The script prints messages to indicate progress and, crucially, catches any exceptions to display the error message and traceback if the conversion fails. This allows us to observe theFileNotFoundError.import os import tempfile from docling.document_converter import DocumentConverter, PdfFormatOption, InputFormat from docling.datamodel.pipeline_options import PdfPipelineOptions print("--- Starting Docling Offline PDF Conversion Test ---") # 1. Define the artifacts path from the environment variable artifacts_path = os.environ.get("DOCLING_SERVE_ARTIFACTS_PATH") if not artifacts_path or not os.path.exists(artifacts_path): print(f"Error: DOCLING_SERVE_ARTIFACTS_PATH is not set or directory does not exist.") exit(1) print(f"Using artifacts_path: {artifacts_path}") print("Contents of artifacts directory:") # Verify that models were downloaded during build os.system(f"ls -R {artifacts_path}") # 2. Configure the converter to use the offline models pdf_opts = PdfPipelineOptions( artifacts_path=artifacts_path, do_ocr=False # Keep it simple for the test ) converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_opts), } ) print("\nDocumentConverter initialized successfully.") # 3. Create a dummy PDF file to convert with tempfile.NamedTemporaryFile(suffix=".pdf", delete=True) as tmp_pdf: # A minimal, valid PDF file tmp_pdf.write(b"%PDF-1.4\n1 0 obj<</Type/Page>>endobj\n2 0 obj<</Type/Catalog/Pages<</Kids[1 0 R]/Count 1>>>>endobj\ntrailer<</Root 2 0 R>>") tmp_pdf.flush() tmp_pdf_path = tmp_pdf.name print(f"\nAttempting to convert dummy PDF: {tmp_pdf_path}") try: # This is where the error occurs result = converter.convert(tmp_pdf_path) print("\n--- SUCCESS: PDF conversion completed without error. ---") except Exception as e: print(f"\n--- FAILURE: An error occurred during conversion. ---") import traceback print(f"Exception Type: {type(e).__name__}") print(f"Exception Message: {e}") print("\nTraceback:") print(''.join(traceback.format_exc()))
-
-
Build the Docker Image: Use the command
docker build -t docling-bug-test .from your terminal. This creates a Docker image nameddocling-bug-testbased on theDockerfilewe created. -
Run the Container: Execute the command
docker run --rm docling-bug-test. This runs the Docker container. The--rmflag automatically removes the container after it finishes. -
Observe the Output: The script will print output to your terminal. It will list the contents of
/opt/docling_models, confirming that the models were downloaded during the build process. However, it will then fail with aFileNotFoundError, pointing to a path within/tmp/where it's looking for the model files. This is the bug in action!
This sequence of steps allows you to reliably reproduce the error and see it for yourself. The detailed script provides clarity on the problem and the specific actions to trigger the bug.
Deep Dive into the Code and Error Analysis
Let's go further to understand why this happens. While I can't provide the exact lines of code where the error occurs without knowing the source code, we can deduce a lot from the error message and the observed behavior. The FileNotFoundError is the smoking gun. It signifies that the program (Docling) is searching for a file that isn't where it expects it to be. In this case, the model.safetensors file.
Looking at the provided code, we see the DOCLING_SERVE_ARTIFACTS_PATH environment variable is defined in the Dockerfile and accessed within run.py. This variable is then passed to PdfPipelineOptions. This tells us that the intended mechanism for specifying the model path is being used. However, the error suggests that the actual model loading logic inside Docling ignores this path, and instead defaults to, or calculates, a different path – the temporary directory.
This behavior strongly suggests a few possibilities:
- Hardcoded Path: There might be hardcoded paths within the Docling code that override the
artifacts_pathin certain scenarios. This is bad practice and creates inflexibility. - Path Resolution Issues: The code might be incorrectly resolving the path to the models, potentially due to incorrect use of environment variables or relative paths. The Docker environment introduces its complexities when it comes to path resolution, especially between the build context and the runtime environment.
- Incorrect Variable Usage: The code might be using a different environment variable name internally, instead of
DOCLING_SERVE_ARTIFACTS_PATH. So, it's not picking up the value we're setting. - Permissions Issues: While we use
chmod 777in the Dockerfile to give all permissions, the problem could be with the user or group the Docling process is running as inside the container. Perhaps it cannot access/opt/docling_modelsfor some reason, even though the directory exists. - Library Conflict or Bug: The issue could originate in the libraries Docling uses internally to load or access files. This is less likely, but possible.
To really pinpoint the cause, we'd need access to the Docling source code. However, based on the evidence, the most likely scenarios involve path resolution problems or hardcoded paths that are not correctly configured to use the DOCLING_SERVE_ARTIFACTS_PATH environment variable.
Impact and Implications of this Bug
This bug has some serious consequences, especially for those needing to use Docling in secure or offline environments. Here's a breakdown of the impact:
- Air-Gapped Systems: The primary impact is that you cannot use Docling to process PDF files in a completely offline, air-gapped Docker setup. This is a major limitation for systems where security is paramount, and internet access is prohibited.
- Security Concerns: If you are forced to use online model downloads for a system that should be offline, this can introduce security vulnerabilities. You are then at the mercy of the availability and security of external servers and models.
- Operational Restrictions: The need to always have an internet connection to process files through Docling limits your options. It could be a show-stopper for your project or workflow, depending on the requirements.
- Limited Usability: It reduces Docling's overall flexibility and usability in various scenarios. It restricts the deployment options and prevents use in edge cases.
- Increased Complexity: Developers and operators must find workarounds or custom solutions to mitigate the bug, which increases complexity and adds maintenance overhead. This might require additional scripts, model storage, or manual intervention, making the process less automated and reliable.
In short, this bug makes Docling unusable in many of the situations where it would be most valuable. It severely hinders its offline capabilities, which are crucial for security and certain operational environments.
Version Details and Further Information
Let's clarify some key details about the specific versions and the environment in which this bug occurs. This information is crucial for pinpointing and fixing the problem.
- Docling Version: The bug is observed in Docling version 2.60.0. This is the specific version affected, so it's a good starting point for your investigations if you are encountering a similar issue.
- Python Version: The test environment uses Python 3.11.9. This Python version is a key part of the execution environment, so it's important to know the compatibility with Docling.
- Docker Environment: The bug specifically manifests inside a Docker container. This means the problem relates to interactions between Docling and Docker's file system, environment variables, and possibly user permissions.
- Reproducibility: I've provided the detailed steps to reproduce the bug. This is incredibly valuable because it enables anyone to replicate the issue and to confirm the fix.
If you are facing the issue, make sure that you are using the same versions as stated above. If you're using a different version, it's a good idea to test by downgrading to 2.60.0 to see if the problem persists. You can also try setting up the Docker environment as detailed in the steps described earlier.
Potential Workarounds and Solutions
While there isn't a perfect workaround without a fix from the Docling developers, here are a few potential strategies you could try to mitigate this issue:
- Manual Model Placement: Manually copy the model files into the temporary directory (
/tmp) after the container starts, and before Docling tries to use them. This is, however, highly undesirable because it's not sustainable and complicates the build and deployment process. It also introduces potential security risks by exposing the models outside of your defined artifact path. - Modify the Docling Code: If you have access to the Docling source code (perhaps you're using a fork), the best solution would be to examine the code to see why it isn't honoring
DOCLING_SERVE_ARTIFACTS_PATHand correct the path resolution logic. This could involve changing how the path is constructed or checking where the model files are located. Remember that this would require building your own modified version of Docling. - Environment Variable Workaround: Experiment by setting environment variables in ways that might influence the path resolution. This might involve setting additional environment variables that could somehow interfere with Docling's pathing logic, although it's unlikely to solve the root problem.
- Docker Volume: You can try to mount the artifact path as a Docker volume. Then the artifact path should be mapped to the expected location. This is also not ideal, but it might force the models into the expected location.
- Contact Docling Developers: The most appropriate step is to report the bug to the Docling development team, providing the detailed reproduction steps. They can investigate the issue and release a patch. This is often the most reliable way to address software defects.
- Check for Updates: Keep an eye out for updates to Docling. The bug might be fixed in a later release. Always check the release notes for bug fixes or any known workarounds. Regularly update Docling to ensure that you are using the latest version of the program.
Conclusion and Next Steps
In conclusion, Docling v2.60.0 has a significant bug preventing proper loading of offline models from a specified artifacts path when used within a Docker container. This is preventing offline functionality in many air-gapped systems.
The steps to reproduce the bug are well-defined, and the expected behavior is clearly outlined. Until the Docling developers resolve this issue, you are left with workarounds that are less than ideal. The best course of action is to report the bug to the Docling team, to monitor for updates, and explore the workaround options.
I hope this detailed analysis helps you understand and troubleshoot the issue. Good luck!