The Zylon AI inferencing engine is the core component that runs artificial intelligence models on your hardware. To ensure optimal performance and prevent startup failures, you must configure the system with the correct preset based on your available GPU (Graphics Processing Unit) memory.
1. What are AI Presets?
AI presets are pre-configured settings that optimize the AI models and memory allocation for your specific hardware setup. Each preset is carefully tuned to:
- Load the appropriate AI model size for your GPU memory
- Allocate memory efficiently to prevent crashes
- Balance performance with available resources
- Enable specific capabilities when needed
Important: Selecting an incorrect preset will prevent the inference engine from starting. The system does not automatically detect your GPU capacity, so manual configuration is required.
2. Understanding GPU Memory Requirements
Your GPU (Graphics Processing Unit) has a specific amount of VRAM (Video Random Access Memory) that determines which AI models can run effectively. AI models require substantial memory to operate, and larger models with better capabilities need more VRAM.
How to check your GPU memory:
- Use
nvidia-smi
command
- Refer to your hardware documentation
3. Presets
Set the AI preset in your Zylon configuration file using the ai.preset
property. The default configuration uses a 24GB setup.
3.1 Base Presets
Preset | Required GPU Memory | Compatible Hardware Examples |
---|
baseline-24g | 24GB | RTX 4090, L4, RTX 3090 Ti |
baseline-32g | 32GB | RTX 5090 |
baseline-48g | 48GB | RTX A6000, A40, L40, L40s |
baseline-96g | 96GB | A100 80GB, H100, A6000 (dual) |
Choose the preset that matches your GPU memory capacity. Always select a preset that matches or is lower than your available VRAM.
ai:
preset: "baseline-48g" # For a system with L40s (48GB)
3.2 Alternative Presets
Zylon also provides alternative presets that offer specialized configurations trading certain capabilities for others. These are optional and should only be used when you have specific requirements that differ from the standard presets.
# For document and image processing
ai:
preset: "alternatives.baseline-96g-vision"
# For extended context processing
ai:
preset: "alternatives.baseline-48g-context"
Vision-Enabled Alternatives
These presets are only available in Zylon versions later than v1.44.0.
These presets include specialized computer vision capabilities in the ingestion pipeline, allowing the system to process and understand images, documents, and visual content.
Useful for document digitization, image analysis or slide processing.
Preset | Required GPU Memory | Trade-off |
---|
alternatives.baseline-48g-vision | 48GB | Smaller model (Qwen 3 14B) |
alternatives.baseline-96g-vision | 96GB | Smaller model (Qwen 3 14B) |
When to use vision-enabled presets:
- Processing scanned documents, necessity of slide understanding
- Analyzing charts, graphs, and visual data
- Image understanding and description tasks
Context-Optimized Alternatives
These presets use smaller AI models to provide significantly larger context windows.
Preset | Required GPU Memory | Trade-off |
---|
alternatives.baseline-48g-context | 48GB | Smaller model (Qwen 3 14B) |
alternatives.baseline-96g-context | 96GB | Smaller model (Qwen 3 14B) |
3.3 Experimental Presets
Experimental presets are under active development and may not be stable. Use only in testing environments.
Experimental presets provide access to cutting-edge models and configurations that are being evaluated for future releases. These presets may have different performance characteristics or stability compared to baseline presets.
Preset | Required GPU Memory | Model Family | Status |
---|
experimental.mistral-24g | 24GB | Mistral | Beta |
experimental.mistral-48g | 48GB | Mistral | Beta |
experimental.gpt-oss-24g | 24GB | GPT-OSS | Beta |
experimental.gpt-oss-48g | 48GB | GPT-OSS | Beta |
experimental.gemma-24g | 24GB | Gemma 3 | Alpha |
Usage Example:
ai:
preset: "experimental.gpt-oss-24g"
Important Notes:
- Experimental presets may be removed or significantly changed between versions
- Performance and stability are not guaranteed
- Not recommended for production environments
- May require additional configuration parameters
When to use context-optimized presets:
- Extended conversation sessions
- Complex analysis requiring large amounts of context
Important: Using more context windows does not always yield better results.
4. Enhanced Capabilities (Optional)
Zylon supports additional capabilities that can be combined with any base or alternative preset. These capabilities extend the functionality but are not enabled by default.
Capability | Description | Example Use Cases |
---|
multilingual | Enhanced support for languages beyond English | International documents, non-English content processing |
Capabilities are added to presets using a comma-separated format: <base_preset>,<capability1>,<capability2>
# Base preset with multilingual capability
ai:
preset: "baseline-24g,capabilities.multilingual"
# Alternative preset with multilingual capability
ai:
preset: "alternatives.baseline-48g-context,capabilities.multilingual"
5. Multi-GPU Configuration (Optional)
If your system has multiple GPUs, you can combine their memory capacity. Select the preset based on total combined VRAM across all GPUs.
Multi-GPU Setup Steps
- Calculate total VRAM: Add up the memory of all GPUs
- Select the appropriate preset for the total memory
- Configure the number of GPUs in your configuration
ai:
preset: "baseline-48g"
numGPUs: 2 # Using 2 GPUs with 24GB each (48GB total)
Multi-GPU Examples
Hardware Setup | Individual GPU Memory | Total VRAM | Recommended Preset | Configuration |
---|
2x RTX 4090 | 24GB each | 48GB | baseline-48g | numGPUs: 2 |
2x L4 | 24GB each | 48GB | baseline-48g | numGPUs: 2 |
4x RTX 4090 | 24GB each | 96GB | baseline-96g | numGPUs: 4 |
6. Customizing the Default Presets
Custom configurations can cause system malfunction. Perform thorough testing before deployment.
Zylon supports custom model configurations for customers who need to use specialized LLM/Embeddings models. This advanced feature allows you to override the default models while maintaining system compatibility.
6.1 Supported Model Families
Zylon supports the following model families for custom configurations:
- Qwen 3 -
https://huggingface.co/Qwen/Qwen3-14B
(default in baseline presets, or any model from the Qwen 3 family)
- Mistral Small 3 -
https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501
- Gemma 3 -
https://huggingface.co/google/gemma-3-12b-it
(or any from the Gemma 3 family)
- Gemma 3n -
https://huggingface.co/google/gemma-3n-E4B-it
(or any from the Gemma 3n family)
- GPT-OSS -
https://huggingface.co/openai/gpt-oss-20b
(or any from the GPT-OSS family)
6.2 Custom Model Configuration
To customize the models used by a preset, modify your /etc/config/zylon-config.yaml
file by adding a config
section with model specifications.
Basic Structure:
ai:
preset: "<preset>"
config:
models:
- id: llm
modelRepo: <huggingface-model-url>
Default Models that Must Be Customized:
Primary LLM (id: llm)
- This is the main language model that handles all text generation tasks
Embeddings Model (id: embed)
- This handles document embeddings and semantic search
Parameters
Parameter | Description | Valid Values |
---|
id | Model identifier | llm, embed |
name | Custom model name | string |
type | Model type | llm, embedding |
contextWindow | Maximum context length | integer |
modelRepo | HuggingFace model path | string |
gpuMemoryUtilization | Fraction of GPU memory to use | 0.0-1.0 |
samplingParams | Default sampling parameters | SamplingParams |
reasoningSamplingParams | Reasoning sampling parameters | SamplingParams |
tokenizer | HuggingFace tokenizer path (llms only) | string |
promptStyle | Prompt formatting style (llms only) | qwen, mistral, gemma, gpt-oss |
supportReasoning | Enable reasoning capabilities (llms only) | boolean |
supportImage | Number of supported images (llms only) | integer |
supportAudio | Number of supported audios (llms only) | integer |
vectorDim | Vector dimensions (embeddings only) | integer |
Sampling parameters:
Parameter | Description | Range |
---|
temperature | Randomness in text generation | 0.0-2.0 |
maxTokens | Maximum tokens in response | 1-8192 |
minP | Minimum probability threshold | 0.0-1.0 |
topP | Nucleus sampling threshold | 0.0-1.0 |
topK | Top-K sampling limit | 1-100 |
repetitionPenalty | Penalty for repeated tokens | 1.0-2.0 |
presencePenalty | Penalty for token presence | -2.0-2.0 |
frequencyPenalty | Penalty for token frequency | -2.0-2.0 |
6.3 Configuration Examples
Example 1: Custom Fine-tuned GPT-OSS Model
ai:
preset: "experimental.gpt-oss-24g"
config:
models:
# Main LLM with custom fine-tuned model
- id: llm
modelRepo: "Jinx-org/Jinx-gpt-oss-20b-mxfp4"
tokenizer: "openai/gpt-oss-20b"
Example 2: Custom Qwen 3 with Different Parameters
ai:
preset: "baseline-24g"
config:
models:
- id: llm
samplingParams:
temperature: 0.7
maxTokens: 4096
7. Deprecated Presets (Legacy Support)
For customers that require older configurations, deprecated presets are available but not recommended for new installations.
Preset Pattern | Description | Recommendation |
---|
deprecated.<size>g.20250710 | Pre-Qwen 3 model configurations | Upgrade to current presets when possible |
Example: deprecated.24g.20250710
for legacy 24GB configuration
Eventually, we can decide to remove completely a deprecated preset. Please be aware that it can happen.
8. Troubleshooting
Engine fails to start with memory error:
- Verify your actual GPU memory with
nvidia-smi
- Try the next lower preset (e.g.,
baseline-24g
instead of baseline-32g
)
- Remove optional capabilities to reduce memory usage
- Check for other applications using GPU memory.
- Reboot the machine
Poor performance or slow responses:
- Ensure you’re using the correct preset for your hardware
- Consider decreasing to a lower-tier preset
- Consider communicate with Zylon engieneers to understand what it is happening