Skip to main content

Enhanced Capabilities

Zylon supports additional capabilities that can be combined with any base or alternative preset. These capabilities extend the functionality but are not enabled by default.

Available Capabilities

CapabilityDescriptionExample Use Cases
multilingualEnhanced support for languages beyond EnglishInternational documents, non-English content processing

Adding Capabilities

Capabilities are added to presets using a comma-separated format: <base_preset>,<capability1>,<capability2> Examples:
# Base preset with multilingual capability
ai:
  preset: "baseline-24g,capabilities.multilingual"

# Alternative preset with multilingual capability
ai:
  preset: "alternatives.baseline-48g-context,capabilities.multilingual"

# Multiple capabilities (if more become available)
ai:
  preset: "baseline-48g,capabilities.multilingual,capabilities.feature2"
Capabilities can be stacked with any preset type including base, alternative, and experimental presets.

Multi-GPU Configuration

If your system has multiple GPUs, you can combine their memory capacity to use higher-tier presets. Select the preset based on total combined VRAM across all GPUs.

Configuration Steps

  1. Calculate total VRAM: Add up the memory of all GPUs
  2. Select appropriate preset: Choose preset for the total memory
  3. Configure GPU count: Set the numGPUs parameter
Configuration Example:
ai:
  preset: "baseline-48g"
  numGPUs: 2  # Using 2 GPUs with 24GB each (48GB total)

Multi-GPU Configuration Examples

Hardware SetupIndividual GPU MemoryTotal VRAMRecommended PresetConfiguration
2x RTX 409024GB each48GBbaseline-48gnumGPUs: 2
2x L424GB each48GBbaseline-48gnumGPUs: 2
4x RTX 409024GB each96GBbaseline-96gnumGPUs: 4
2x RTX A600048GB each96GBbaseline-96gnumGPUs: 2

Multi-GPU Best Practices

  • Ensure all GPUs are the same model for optimal performance
  • Verify adequate PCIe bandwidth between GPUs
  • Monitor GPU utilization to ensure balanced load
  • Consider NVLink connections for better inter-GPU communication when available
Complete Multi-GPU Example:
ai:
  preset: "baseline-96g,capabilities.multilingual"
  numGPUs: 4  # 4x RTX 4090 (24GB each = 96GB total)

Shared Memory Configuration

The Triton Inference Server uses shared memory to enable zero-copy data transfer between Zylon services and the inference engine. This eliminates serialization overhead and significantly improves inference throughput and reduces latency for high-volume workloads.

Default Allocation

By default, the inference server allocates 2GB of RAM for shared memory. This is sufficient for most text-based inference workloads.

When to Increase Shared Memory

You may encounter Shared memory allocation failed errors in these scenarios:
  • Large request queues: Processing high volumes of concurrent requests where queued input exceeds available shared memory
  • Image-based models: Vision workloads requiring multiple megabytes per image where batches of high-resolution images quickly exhaust the default allocation
  • Large document processing: Handling very large documents or multiple documents simultaneously

Configuration

To increase the shared memory limit, update your Zylon configuration file:
triton:
  sharedMemory:
    limit: "4Gi"  # Increase from default 2Gi
Use CaseRecommended LimitReason
Text-only inference2Gi (default)Sufficient for most text workloads
Low-volume vision tasks4GiHandles occasional image processing
High-volume vision tasks8GiSupports batch image processing
Mixed heavy workloads8-16GiAccommodates concurrent text and vision
Important considerations when increasing shared memory:
  • Allocating excessive shared memory can cause pods to be OOMKilled
  • Be conservative with increases—start with small increments (e.g., 2Gi → 4Gi)
  • Monitor actual usage with Kubernetes metrics before further increases
  • Shared memory is reserved from system RAM, reducing available memory for other processes