Documentation Index
Fetch the complete documentation index at: https://docs.zylon.ai/llms.txt
Use this file to discover all available pages before exploring further.
Enhanced Capabilities
Zylon supports additional capabilities that can be combined with any base or alternative preset. These capabilities extend the functionality but are not enabled by default.
Available Capabilities
| Capability | Description | Example Use Cases | Models |
|---|
multilingual | Enhanced support for languages beyond English | International documents, non-English content processing | intfloat/multilingual-e5-large |
Adding Capabilities
Capabilities are added to presets using a comma-separated format: <base_preset>,<capability1>,<capability2>
Examples:
# Base preset with multilingual capability
ai:
preset: "baseline-48g,capabilities.multilingual"
# Alternative preset with multilingual capability
ai:
preset: "alternatives.baseline-48g-context,capabilities.multilingual"
# Multiple capabilities (if more become available)
ai:
preset: "baseline-48g,capabilities.multilingual,capabilities.feature2"
Capabilities can be stacked with any preset type including base, alternative, and experimental presets.
Multi-GPU Configuration
If your system has multiple GPUs, you can combine their memory capacity to use higher-tier presets. Select the preset based on total combined VRAM across all GPUs.
Configuration Steps
- Calculate total VRAM: Add up the memory of all GPUs
- Select appropriate preset: Choose preset for the total memory
- Configure GPU count: Set the
numGPUs parameter
Configuration Example:
ai:
preset: "baseline-48g"
numGPUs: 2 # Using 2 GPUs with 24GB each (48GB total)
Multi-GPU Configuration Examples
| Hardware Setup | Individual GPU Memory | Total VRAM | Recommended Preset | Configuration |
|---|
| 2x RTX 4090 | 24GB each | 48GB | baseline-48g | numGPUs: 2 |
| 2x L4 | 24GB each | 48GB | baseline-48g | numGPUs: 2 |
| 4x RTX 4090 | 24GB each | 96GB | baseline-96g | numGPUs: 4 |
| 2x RTX A6000 | 48GB each | 96GB | baseline-96g | numGPUs: 2 |
Multi-GPU Best Practices
- Ensure all GPUs are the same model for optimal performance
- Verify adequate PCIe bandwidth between GPUs
- Monitor GPU utilization to ensure balanced load
- Consider NVLink connections for better inter-GPU communication when available
Complete Multi-GPU Example:
ai:
preset: "baseline-96g,capabilities.multilingual"
numGPUs: 4 # 4x RTX 4090 (24GB each = 96GB total)
Shared Memory Configuration
The Triton Inference Server uses shared memory to enable zero-copy data transfer between Zylon services and the inference engine. This eliminates serialization overhead and significantly improves inference throughput and reduces latency for high-volume workloads.
Default Allocation
By default, the inference server allocates 2GB of RAM for shared memory. This is sufficient for most text-based inference workloads.
When to Increase Shared Memory
You may encounter Shared memory allocation failed errors in these scenarios:
- Large request queues: Processing high volumes of concurrent requests where queued input exceeds available shared memory
- Image-based models: Vision workloads requiring multiple megabytes per image where batches of high-resolution images quickly exhaust the default allocation
- Large document processing: Handling very large documents or multiple documents simultaneously
Configuration
To increase the shared memory limit, update your Zylon configuration file:
triton:
sharedMemory:
limit: "4Gi" # Increase from default 2Gi
Recommended Shared Memory by Use Case
| Use Case | Recommended Limit | Reason |
|---|
| Text-only inference | 2Gi (default) | Sufficient for most text workloads |
| Low-volume vision tasks | 4Gi | Handles occasional image processing |
| High-volume vision tasks | 8Gi | Supports batch image processing |
| Mixed heavy workloads | 8-16Gi | Accommodates concurrent text and vision |
Important considerations when increasing shared memory:
- Allocating excessive shared memory can cause pods to be OOMKilled
- Be conservative with increases—start with small increments (e.g., 2Gi → 4Gi)
- Monitor actual usage with Kubernetes metrics before further increases
- Shared memory is reserved from system RAM, reducing available memory for other processes