The Zylon AI inferencing engine is the core component that runs artificial intelligence models on your hardware. To ensure optimal performance and prevent startup failures, you must configure the system with the correct preset based on your available GPU (Graphics Processing Unit) memory.

1. What are AI Presets?

AI presets are pre-configured settings that optimize the AI models and memory allocation for your specific hardware setup. Each preset is carefully tuned to:
  • Load the appropriate AI model size for your GPU memory
  • Allocate memory efficiently to prevent crashes
  • Balance performance with available resources
  • Enable specific capabilities when needed
Important: Selecting an incorrect preset will prevent the inference engine from starting. The system does not automatically detect your GPU capacity, so manual configuration is required.

2. Understanding GPU Memory Requirements

Your GPU (Graphics Processing Unit) has a specific amount of VRAM (Video Random Access Memory) that determines which AI models can run effectively. AI models require substantial memory to operate, and larger models with better capabilities need more VRAM. How to check your GPU memory:
  • Use nvidia-smi command
  • Refer to your hardware documentation

3. Presets

Set the AI preset in your Zylon configuration file using the ai.preset property. The default configuration uses a 24GB setup.

3.1 Base Presets

PresetRequired GPU MemoryCompatible Hardware Examples
baseline-24g24GBRTX 4090, L4, RTX 3090 Ti
baseline-32g32GBRTX 5090
baseline-48g48GBRTX A6000, A40, L40, L40s
baseline-96g96GBA100 80GB, H100, A6000 (dual)
Choose the preset that matches your GPU memory capacity. Always select a preset that matches or is lower than your available VRAM.
ai:
  preset: "baseline-48g"  # For a system with L40s (48GB)

3.2 Alternative Presets

Zylon also provides alternative presets that offer specialized configurations trading certain capabilities for others. These are optional and should only be used when you have specific requirements that differ from the standard presets.
# For document and image processing
ai:
  preset: "alternatives.baseline-96g-vision"

# For extended context processing
ai:
  preset: "alternatives.baseline-48g-context"

Vision-Enabled Alternatives

These presets are only available in Zylon versions later than v1.44.0.
These presets include specialized computer vision capabilities in the ingestion pipeline, allowing the system to process and understand images, documents, and visual content. Useful for document digitization, image analysis or slide processing.
PresetRequired GPU MemoryTrade-off
alternatives.baseline-48g-vision48GBSmaller model (Qwen 3 14B)
alternatives.baseline-96g-vision96GBSmaller model (Qwen 3 14B)
When to use vision-enabled presets:
  • Processing scanned documents, necessity of slide understanding
  • Analyzing charts, graphs, and visual data
  • Image understanding and description tasks

Context-Optimized Alternatives

These presets use smaller AI models to provide significantly larger context windows.
PresetRequired GPU MemoryTrade-off
alternatives.baseline-48g-context48GBSmaller model (Qwen 3 14B)
alternatives.baseline-96g-context96GBSmaller model (Qwen 3 14B)

3.3 Experimental Presets

Experimental presets are under active development and may not be stable. Use only in testing environments.
Experimental presets provide access to cutting-edge models and configurations that are being evaluated for future releases. These presets may have different performance characteristics or stability compared to baseline presets.
PresetRequired GPU MemoryModel FamilyStatus
experimental.mistral-24g24GBMistralBeta
experimental.mistral-48g48GBMistralBeta
experimental.gpt-oss-24g24GBGPT-OSSBeta
experimental.gpt-oss-48g48GBGPT-OSSBeta
experimental.gemma-24g24GBGemma 3Alpha
Usage Example:
ai:
  preset: "experimental.gpt-oss-24g"
Important Notes:
  • Experimental presets may be removed or significantly changed between versions
  • Performance and stability are not guaranteed
  • Not recommended for production environments
  • May require additional configuration parameters
When to use context-optimized presets:
  • Extended conversation sessions
  • Complex analysis requiring large amounts of context
Important: Using more context windows does not always yield better results.

4. Enhanced Capabilities (Optional)

Zylon supports additional capabilities that can be combined with any base or alternative preset. These capabilities extend the functionality but are not enabled by default.
CapabilityDescriptionExample Use Cases
multilingualEnhanced support for languages beyond EnglishInternational documents, non-English content processing
Capabilities are added to presets using a comma-separated format: <base_preset>,<capability1>,<capability2>
# Base preset with multilingual capability
ai:
  preset: "baseline-24g,capabilities.multilingual"

# Alternative preset with multilingual capability
ai:
  preset: "alternatives.baseline-48g-context,capabilities.multilingual"

5. Multi-GPU Configuration (Optional)

If your system has multiple GPUs, you can combine their memory capacity. Select the preset based on total combined VRAM across all GPUs.

Multi-GPU Setup Steps

  1. Calculate total VRAM: Add up the memory of all GPUs
  2. Select the appropriate preset for the total memory
  3. Configure the number of GPUs in your configuration
ai:
  preset: "baseline-48g"
  numGPUs: 2  # Using 2 GPUs with 24GB each (48GB total)

Multi-GPU Examples

Hardware SetupIndividual GPU MemoryTotal VRAMRecommended PresetConfiguration
2x RTX 409024GB each48GBbaseline-48gnumGPUs: 2
2x L424GB each48GBbaseline-48gnumGPUs: 2
4x RTX 409024GB each96GBbaseline-96gnumGPUs: 4

6. Customizing the Default Presets

Custom configurations can cause system malfunction. Perform thorough testing before deployment.
Zylon supports custom model configurations for customers who need to use specialized LLM/Embeddings models. This advanced feature allows you to override the default models while maintaining system compatibility.

6.1 Supported Model Families

Zylon supports the following model families for custom configurations:
  1. Qwen 3 - https://huggingface.co/Qwen/Qwen3-14B (default in baseline presets, or any model from the Qwen 3 family)
  2. Mistral Small 3 - https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501
  3. Gemma 3 - https://huggingface.co/google/gemma-3-12b-it (or any from the Gemma 3 family)
  4. Gemma 3n - https://huggingface.co/google/gemma-3n-E4B-it (or any from the Gemma 3n family)
  5. GPT-OSS - https://huggingface.co/openai/gpt-oss-20b (or any from the GPT-OSS family)

6.2 Custom Model Configuration

To customize the models used by a preset, modify your /etc/config/zylon-config.yaml file by adding a config section with model specifications. Basic Structure:
ai:
  preset: "<preset>"
  config:
    models:
      - id: llm
        modelRepo: <huggingface-model-url>

Default Models that Must Be Customized:

  1. Primary LLM (id: llm) - This is the main language model that handles all text generation tasks
  2. Embeddings Model (id: embed) - This handles document embeddings and semantic search

Parameters

ParameterDescriptionValid Values
idModel identifierllm, embed
nameCustom model namestring
typeModel typellm, embedding
contextWindowMaximum context lengthinteger
modelRepoHuggingFace model pathstring
gpuMemoryUtilizationFraction of GPU memory to use0.0-1.0
samplingParamsDefault sampling parametersSamplingParams
reasoningSamplingParamsReasoning sampling parametersSamplingParams
tokenizerHuggingFace tokenizer path (llms only)string
promptStylePrompt formatting style (llms only)qwen, mistral, gemma, gpt-oss
supportReasoningEnable reasoning capabilities (llms only)boolean
supportImageNumber of supported images (llms only)integer
supportAudioNumber of supported audios (llms only)integer
vectorDimVector dimensions (embeddings only)integer
Sampling parameters:
ParameterDescriptionRange
temperatureRandomness in text generation0.0-2.0
maxTokensMaximum tokens in response1-8192
minPMinimum probability threshold0.0-1.0
topPNucleus sampling threshold0.0-1.0
topKTop-K sampling limit1-100
repetitionPenaltyPenalty for repeated tokens1.0-2.0
presencePenaltyPenalty for token presence-2.0-2.0
frequencyPenaltyPenalty for token frequency-2.0-2.0

6.3 Configuration Examples

Example 1: Custom Fine-tuned GPT-OSS Model

ai:
  preset: "experimental.gpt-oss-24g"
  config:
    models:
      # Main LLM with custom fine-tuned model
      - id: llm
        modelRepo: "Jinx-org/Jinx-gpt-oss-20b-mxfp4"
        tokenizer: "openai/gpt-oss-20b"

Example 2: Custom Qwen 3 with Different Parameters

ai:
  preset: "baseline-24g"
  config:
    models:
      - id: llm
        samplingParams:
          temperature: 0.7
          maxTokens: 4096

7. Deprecated Presets (Legacy Support)

For customers that require older configurations, deprecated presets are available but not recommended for new installations.
Preset PatternDescriptionRecommendation
deprecated.<size>g.20250710Pre-Qwen 3 model configurationsUpgrade to current presets when possible
Example: deprecated.24g.20250710 for legacy 24GB configuration
Eventually, we can decide to remove completely a deprecated preset. Please be aware that it can happen.

8. Troubleshooting

Engine fails to start with memory error:
  1. Verify your actual GPU memory with nvidia-smi
  2. Try the next lower preset (e.g., baseline-24g instead of baseline-32g)
  3. Remove optional capabilities to reduce memory usage
  4. Check for other applications using GPU memory.
  5. Reboot the machine
Poor performance or slow responses:
  1. Ensure you’re using the correct preset for your hardware
  2. Consider decreasing to a lower-tier preset
  3. Consider communicate with Zylon engieneers to understand what it is happening