Skip to main content

Docling

Performance Configuration

Docling uses a worker pool architecture where workers are automatically split into two groups for quality of service:
  • Small file workers: Process files less than 1MB and less than 100 pages
  • Large file workers: Process files greater than or equal to 1MB or greater than or equal to 100 pages
This prevents large files from blocking the processing queue.

Configuration Parameters

numWorkers: Total number of workers (automatically split 50/50 between small and large file processing) numThreads: Total threads available across all workers. Each worker receives numThreads / numWorkers threads.
Constraints
  • numWorkers >= 2 and must be even
  • numThreads >= 2
  • numThreads must be divisible by numWorkers
  • Recommended: Leave at least 20-30% of cores free for system processes
StrategyHigh concurrency, small files → Many workers, fewer threads per worker
  • Maximizes parallel processing of multiple small documents
  • Use when processing many PDFs under 100 pages
Low concurrency, large files → Fewer workers, more threads per worker
  • Allocates more computational power per document
  • Use when processing scanned books or technical manuals

Extra Configuration

tableMode: Controls table extraction strategy
  • accurate: Maximum precision, higher processing time (default)
  • fast: Speed-optimized with good accuracy
  • none: Disables table extraction
doCellMatching: Enables table cell matching for improved structure recognition. Recommended for complex tables (default: true) forceFullPageOCR: Forces full-page OCR instead of selective regions. Use when standard extraction misses content. Significantly increases processing time (default: false)

Example Configurations

The following examples consider a system with 48 CPU cores. Adjust numWorkers and numThreads based on your system’s CPU count and workload characteristics.
Use case: Steady workload with moderate file volume and average file sizesWhat this does: Middle ground between high throughput and heavy processing. Provides 2 workers for small files and 2 for large files with 9 threads each, balancing concurrency and processing power.
external:
  docling:
    # Performance
    numWorkers: 4   # 2 for small files, 2 for large files
    numThreads: 36  # 9 threads per worker, leaving 12 cores free
    
    # Extra configuration
    tableMode: "accurate"
    doCellMatching: true
    forceFullPageOCR: false
Use case: Processing large volumes of documents under 100 pages and under 1MBWhat this does: Maximizes concurrent file processing with 8 workers for small files and 8 for large files. Reduces threads per worker to increase parallelism. Uses fast table extraction to optimize speed over precision.
external:
  docling:
    # Performance
    numWorkers: 16  # 8 for small files, 8 for large files
    numThreads: 32  # 2 threads per worker, leaving 16 cores free
    
    # Extra configuration
    tableMode: "fast"        # Prioritize speed
    doCellMatching: false    # Disable for faster processing
    forceFullPageOCR: false
Use case: Processing fewer but larger or more complex documents with detailed tables What this does: Allocates more computational power per document with 8 threads per worker. Uses only 2 workers for small files and 2 for large files to focus resources on thorough processing of complex content.
external:
  docling:
    # Performance
    numWorkers: 4   # 2 for small files, 2 for large files
    numThreads: 32  # 8 threads per worker, leaving 16 cores free
    
    # Extra configuration
    tableMode: "accurate"
    doCellMatching: true
    forceFullPageOCR: false
Use case: Processing primarily scanned documents or images requiring full OCR extractionWhat this does: Reduces worker count to allocate more threads per worker for intensive OCR processing. Enables full page OCR to extract text from scanned images. Uses fewer workers (2 for small, 2 for large) with 10 threads each to maximize processing power per document.
external:
  docling:
    # Performance
    numWorkers: 4   # 2 for small files, 2 for large files
    numThreads: 40  # 10 threads per worker, leaving 8 cores free
    
    # Extra configuration
    tableMode: "accurate"
    doCellMatching: true
    forceFullPageOCR: true  # Required for scanned documents