The Zylon AI inferencing engine is the core component that runs artificial intelligence models on your hardware. To ensure optimal performance and prevent startup failures, you must configure the system with the correct preset based on your available GPU (Graphics Processing Unit) memory.
1. What are AI Presets?
AI presets are pre-configured settings that optimize the AI models and memory allocation for your specific hardware setup. Each preset is carefully tuned to:
- Load the appropriate AI model size for your GPU/RAM memory
- Allocate memory efficiently to prevent crashes
- Balance performance with available resources
- Enable specific capabilities when needed
Important: Selecting an incorrect preset will prevent the inference engine from starting. The system does not automatically detect your GPU capacity, so manual configuration is required.
2. Understanding GPU Memory Requirements
Your GPU (Graphics Processing Unit) has a specific amount of VRAM (Video Random Access Memory) that determines which AI models can run effectively. AI models require substantial memory to operate, and larger models with better capabilities need more VRAM.
How to check your GPU memory:
- Use
nvidia-smi command
- Refer to your hardware documentation
3. Presets
Set the AI preset in your Zylon configuration file using the ai.preset property. The default configuration uses a 24GB setup.
3.1 Base Presets
| Preset | Required GPU Memory | Compatible Hardware Examples |
|---|
baseline-24g | 24GB | RTX 4090, L4, RTX 3090 Ti |
baseline-32g | 32GB | RTX 5090 |
baseline-48g | 48GB | RTX A6000, A40, L40, L40s |
baseline-96g | 96GB | A100 80GB, H100, A6000 (dual) |
Choose the preset that matches your GPU memory capacity. Always select a preset that matches or is lower than your available VRAM.
ai:
preset: "baseline-48g" # For a system with L40s (48GB)
3.2 Alternative Presets
Zylon also provides alternative presets that offer specialized configurations trading certain capabilities for others. These are optional and should only be used when you have specific requirements that differ from the standard presets.
# For document and image processing
ai:
preset: "alternatives.baseline-96g-vision"
# For extended context processing
ai:
preset: "alternatives.baseline-48g-context"
Vision-Enabled Alternatives
These presets are only available in Zylon versions later than v1.44.0.
These presets include specialized computer vision capabilities in the ingestion pipeline, allowing the system to process and understand images, documents, and visual content. Useful for document digitization, image analysis or slide processing.
| Preset | Required GPU Memory | Trade-off |
|---|
alternatives.baseline-48g-vision | 48GB | Smaller model (Qwen 3 14B) |
alternatives.baseline-96g-vision | 96GB | Smaller model (Qwen 3 14B) |
When to use vision-enabled presets:
- Processing scanned documents, necessity of slide understanding
- Analyzing charts, graphs, and visual data
- Image understanding and description tasks
The use of these presets probably requires modifying the inference server’s shared memory. Check here how to do it
Context-Optimized Alternatives
These presets use smaller AI models to provide significantly larger context windows.
| Preset | Required GPU Memory | Trade-off |
|---|
alternatives.baseline-48g-context | 48GB | Smaller model (Qwen 3 14B) |
alternatives.baseline-96g-context | 96GB | Smaller model (Qwen 3 14B) |
3.3 Experimental Presets
Experimental presets are under active development and may not be stable. Use only in testing environments.
Experimental presets provide access to cutting-edge models and configurations that are being evaluated for future releases. These presets may have different performance characteristics or stability compared to baseline presets.
| Preset | Required GPU Memory | Model Family | Status |
|---|
experimental.mistral-24g | 24GB | Mistral | Beta |
experimental.mistral-48g | 48GB | Mistral | Beta |
experimental.gpt-oss-24g | 24GB | GPT-OSS | Beta |
experimental.gpt-oss-48g | 48GB | GPT-OSS | Beta |
experimental.gemma-24g | 24GB | Gemma 3 | Alpha |
Usage Example:
ai:
preset: "experimental.gpt-oss-24g"
Important Notes:
- Experimental presets may be removed or significantly changed between versions
- Performance and stability are not guaranteed
- Not recommended for production environments
- May require additional configuration parameters
When to use context-optimized presets:
- Extended conversation sessions
- Complex analysis requiring large amounts of context
Important: Using more context windows does not always yield better results.
4. Enhanced Capabilities (Optional)
Zylon supports additional capabilities that can be combined with any base or alternative preset. These capabilities extend the functionality but are not enabled by default.
| Capability | Description | Example Use Cases |
|---|
multilingual | Enhanced support for languages beyond English | International documents, non-English content processing |
Capabilities are added to presets using a comma-separated format: <base_preset>,<capability1>,<capability2>
# Base preset with multilingual capability
ai:
preset: "baseline-24g,capabilities.multilingual"
# Alternative preset with multilingual capability
ai:
preset: "alternatives.baseline-48g-context,capabilities.multilingual"
5. Multi-GPU Configuration (Optional)
If your system has multiple GPUs, you can combine their memory capacity. Select the preset based on total combined VRAM across all GPUs.
Multi-GPU Setup Steps
- Calculate total VRAM: Add up the memory of all GPUs
- Select the appropriate preset for the total memory
- Configure the number of GPUs in your configuration
ai:
preset: "baseline-48g"
numGPUs: 2 # Using 2 GPUs with 24GB each (48GB total)
Multi-GPU Examples
| Hardware Setup | Individual GPU Memory | Total VRAM | Recommended Preset | Configuration |
|---|
| 2x RTX 4090 | 24GB each | 48GB | baseline-48g | numGPUs: 2 |
| 2x L4 | 24GB each | 48GB | baseline-48g | numGPUs: 2 |
| 4x RTX 4090 | 24GB each | 96GB | baseline-96g | numGPUs: 4 |
6. Customizing the Default Presets
Custom configurations can cause system malfunction. Perform thorough testing before deployment.
Zylon supports custom model configurations for customers who need to use specialized LLM/Embeddings models. This advanced feature allows you to override the default models while maintaining system compatibility.
6.1 Supported Model Families
Zylon supports the following model families for custom configurations:
- Qwen 3 -
https://huggingface.co/Qwen/Qwen3-14B (default in baseline presets, or any model from the Qwen 3 family)
- Mistral Small 3 -
https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501
- Gemma 3 -
https://huggingface.co/google/gemma-3-12b-it (or any from the Gemma 3 family)
- Gemma 3n -
https://huggingface.co/google/gemma-3n-E4B-it (or any from the Gemma 3n family)
- GPT-OSS -
https://huggingface.co/openai/gpt-oss-20b (or any from the GPT-OSS family)
6.2 Custom Model Configuration
To customize the models used by a preset, modify your /etc/config/zylon-config.yaml file by adding a config section with model specifications.
Basic Structure:
ai:
preset: "<preset>"
config:
models:
- id: llm
modelRepo: <huggingface-model-url>
Default Models that Must Be Customized:
Primary LLM (id: llm) - This is the main language model that handles all text generation tasks
Embeddings Model (id: embed) - This handles document embeddings and semantic search
Parameters
| Parameter | Description | Valid Values |
|---|
id | Model identifier | llm, embed |
name | Custom model name | string |
type | Model type | llm, embedding |
contextWindow | Maximum context length | integer |
modelRepo | HuggingFace model path | string |
gpuMemoryUtilization | Fraction of GPU memory to use | 0.0-1.0 |
samplingParams | Default sampling parameters | SamplingParams |
reasoningSamplingParams | Reasoning sampling parameters | SamplingParams |
tokenizer | HuggingFace tokenizer path (llms only) | string |
promptStyle | Prompt formatting style (llms only) | qwen, mistral, gemma, gpt-oss |
supportReasoning | Enable reasoning capabilities (llms only) | boolean |
supportImage | Number of supported images (llms only) | integer |
supportAudio | Number of supported audios (llms only) | integer |
vectorDim | Vector dimensions (embeddings only) | integer |
Sampling parameters:
| Parameter | Description | Range |
|---|
temperature | Randomness in text generation | 0.0-2.0 |
maxTokens | Maximum tokens in response | 1-8192 |
minP | Minimum probability threshold | 0.0-1.0 |
topP | Nucleus sampling threshold | 0.0-1.0 |
topK | Top-K sampling limit | 1-100 |
repetitionPenalty | Penalty for repeated tokens | 1.0-2.0 |
presencePenalty | Penalty for token presence | -2.0-2.0 |
frequencyPenalty | Penalty for token frequency | -2.0-2.0 |
6.3 Configuration Examples
Example 1: Custom Fine-tuned GPT-OSS Model
ai:
preset: "experimental.gpt-oss-24g"
config:
models:
# Main LLM with custom fine-tuned model
- id: llm
modelRepo: "Jinx-org/Jinx-gpt-oss-20b-mxfp4"
tokenizer: "openai/gpt-oss-20b"
Example 2: Custom Qwen 3 with Different Parameters
ai:
preset: "baseline-24g"
config:
models:
- id: llm
samplingParams:
temperature: 0.7
maxTokens: 4096
7. Deprecated Presets (Legacy Support)
For customers that require older configurations, deprecated presets are available but not recommended for new installations.
| Preset Pattern | Description | Recommendation |
|---|
deprecated.<size>g.20250710 | Pre-Qwen 3 model configurations | Upgrade to current presets when possible |
Example: deprecated.24g.20250710 for legacy 24GB configuration
8. Shared memory
Our Inference Server - aka Triton Inference Server - uses shared memory to enable zero-copy data transfer between Zylon services and the inference engine. This eliminates the overhead of serializing and deserializing data across process boundaries, significantly improving inference throughput and reducing latency for high-volume workloads.
By default, the inference server allocates 2GB of RAM for shared memory. This is sufficient for most text-based inference workloads, but you may encounter Shared memory allocation failed errors in scenarios such as:
- Large request queues: When processing high volumes of concurrent requests, the queued input can exceed available shared memory
- Image-based models: Vision workloads often require multiple megabytes per image. A batch of high-resolution images can quickly exhaust the default allocation
Updating Shared Memory Allocation
To increase the shared memory limit, update Zylon configuration file and apply the changes.
triton:
sharedMemory:
limit: "4Gi" # Increase from default 2Gi
- Allocating excessive shared memory can cause the rest pods to be OOMKilled
- Be conservative with increases—start with small increments (e.g., 2Gi → 4Gi) and adjust based on observed metrics
9. Troubleshooting
Engine fails to start with memory error:
- Verify your actual GPU memory with
nvidia-smi
- Try the next lower preset (e.g.,
baseline-24g instead of baseline-32g)
- Remove optional capabilities to reduce memory usage
- Check for other applications using GPU memory.
- Reboot the machine
Poor performance or slow responses:
- Ensure you’re using the correct preset for your hardware
- Consider decreasing to a lower-tier preset
- Consider communicate with Zylon engieneers to understand what it is happening