AI Presets

The Zylon AI inferencing engine is the core component that runs artificial intelligence models on your hardware. To ensure optimal performance and prevent startup failures, you must configure the system with the correct preset based on your available GPU (Graphics Processing Unit) memory.

1. What are AI Presets?

AI presets are pre-configured settings that optimize the AI models and memory allocation for your specific hardware setup. Each preset is carefully tuned to:

Load the appropriate AI model size for your GPU/RAM memory
Allocate memory efficiently to prevent crashes
Balance performance with available resources
Enable specific capabilities when needed

Important: Selecting an incorrect preset will prevent the inference engine from starting. The system does not automatically detect your GPU capacity, so manual configuration is required.

2. Understanding GPU Memory Requirements

Your GPU (Graphics Processing Unit) has a specific amount of VRAM (Video Random Access Memory) that determines which AI models can run effectively. AI models require substantial memory to operate, and larger models with better capabilities need more VRAM. How to check your GPU memory:

Use nvidia-smi command
Refer to your hardware documentation

3. Presets

Set the AI preset in your Zylon configuration file using the ai.preset property. The default configuration uses a 24GB setup.

3.1 Base Presets

Preset	Required GPU Memory	Compatible Hardware Examples
`baseline-24g`	24GB	RTX 4090, L4, RTX 3090 Ti
`baseline-32g`	32GB	RTX 5090
`baseline-48g`	48GB	RTX A6000, A40, L40, L40s
`baseline-96g`	96GB	A100 80GB, H100, A6000 (dual)

Choose the preset that matches your GPU memory capacity. Always select a preset that matches or is lower than your available VRAM.

ai:
  preset: "baseline-48g"  # For a system with L40s (48GB)

3.2 Alternative Presets

Zylon also provides alternative presets that offer specialized configurations trading certain capabilities for others. These are optional and should only be used when you have specific requirements that differ from the standard presets.

# For document and image processing
ai:
  preset: "alternatives.baseline-96g-vision"

# For extended context processing
ai:
  preset: "alternatives.baseline-48g-context"

Vision-Enabled Alternatives

These presets are only available in Zylon versions later than v1.44.0.

These presets include specialized computer vision capabilities in the ingestion pipeline, allowing the system to process and understand images, documents, and visual content. Useful for document digitization, image analysis or slide processing.

Preset	Required GPU Memory	Trade-off
`alternatives.baseline-48g-vision`	48GB	Smaller model (Qwen 3 14B)
`alternatives.baseline-96g-vision`	96GB	Smaller model (Qwen 3 14B)

When to use vision-enabled presets:

Processing scanned documents, necessity of slide understanding
Analyzing charts, graphs, and visual data
Image understanding and description tasks

The use of these presets probably requires modifying the inference server’s shared memory. Check here how to do it

Context-Optimized Alternatives

These presets use smaller AI models to provide significantly larger context windows.

Preset	Required GPU Memory	Trade-off
`alternatives.baseline-48g-context`	48GB	Smaller model (Qwen 3 14B)
`alternatives.baseline-96g-context`	96GB	Smaller model (Qwen 3 14B)

3.3 Experimental Presets

Experimental presets are under active development and may not be stable. Use only in testing environments.

Experimental presets provide access to cutting-edge models and configurations that are being evaluated for future releases. These presets may have different performance characteristics or stability compared to baseline presets.

Preset	Required GPU Memory	Model Family	Status
`experimental.mistral-24g`	24GB	Mistral	Beta
`experimental.mistral-48g`	48GB	Mistral	Beta
`experimental.gpt-oss-24g`	24GB	GPT-OSS	Beta
`experimental.gpt-oss-48g`	48GB	GPT-OSS	Beta
`experimental.gemma-24g`	24GB	Gemma 3	Alpha

Usage Example:

ai:
  preset: "experimental.gpt-oss-24g"

Important Notes:

Experimental presets may be removed or significantly changed between versions
Performance and stability are not guaranteed
Not recommended for production environments
May require additional configuration parameters

When to use context-optimized presets:

Extended conversation sessions
Complex analysis requiring large amounts of context

Important: Using more context windows does not always yield better results.

4. Enhanced Capabilities (Optional)

Zylon supports additional capabilities that can be combined with any base or alternative preset. These capabilities extend the functionality but are not enabled by default.

Capability	Description	Example Use Cases
`multilingual`	Enhanced support for languages beyond English	International documents, non-English content processing

Capabilities are added to presets using a comma-separated format: <base_preset>,<capability1>,<capability2>

# Base preset with multilingual capability
ai:
  preset: "baseline-24g,capabilities.multilingual"

# Alternative preset with multilingual capability
ai:
  preset: "alternatives.baseline-48g-context,capabilities.multilingual"

5. Multi-GPU Configuration (Optional)

If your system has multiple GPUs, you can combine their memory capacity. Select the preset based on total combined VRAM across all GPUs.

Multi-GPU Setup Steps

Calculate total VRAM: Add up the memory of all GPUs
Select the appropriate preset for the total memory
Configure the number of GPUs in your configuration

ai:
  preset: "baseline-48g"
  numGPUs: 2  # Using 2 GPUs with 24GB each (48GB total)

Multi-GPU Examples

Hardware Setup	Individual GPU Memory	Total VRAM	Recommended Preset	Configuration
2x RTX 4090	24GB each	48GB	`baseline-48g`	`numGPUs: 2`
2x L4	24GB each	48GB	`baseline-48g`	`numGPUs: 2`
4x RTX 4090	24GB each	96GB	`baseline-96g`	`numGPUs: 4`

6. Customizing the Default Presets

Custom configurations can cause system malfunction. Perform thorough testing before deployment.

Zylon supports custom model configurations for customers who need to use specialized LLM/Embeddings models. This advanced feature allows you to override the default models while maintaining system compatibility.

6.1 Supported Model Families

Zylon supports the following model families for custom configurations:

Qwen 3 - https://huggingface.co/Qwen/Qwen3-14B (default in baseline presets, or any model from the Qwen 3 family)
Mistral Small 3 - https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501
Gemma 3 - https://huggingface.co/google/gemma-3-12b-it (or any from the Gemma 3 family)
Gemma 3n - https://huggingface.co/google/gemma-3n-E4B-it (or any from the Gemma 3n family)
GPT-OSS - https://huggingface.co/openai/gpt-oss-20b (or any from the GPT-OSS family)

6.2 Custom Model Configuration

To customize the models used by a preset, modify your /etc/config/zylon-config.yaml file by adding a config section with model specifications. Basic Structure:

ai:
  preset: "<preset>"
  config:
    models:
      - id: llm
        modelRepo: <huggingface-model-url>

Default Models that Must Be Customized:

Primary LLM (id: llm) - This is the main language model that handles all text generation tasks
Embeddings Model (id: embed) - This handles document embeddings and semantic search

Parameters

Parameter	Description	Valid Values
`id`	Model identifier	llm, embed
`name`	Custom model name	string
`type`	Model type	llm, embedding
`contextWindow`	Maximum context length	integer
`modelRepo`	HuggingFace model path	string
`gpuMemoryUtilization`	Fraction of GPU memory to use	0.0-1.0
`samplingParams`	Default sampling parameters	SamplingParams
`reasoningSamplingParams`	Reasoning sampling parameters	SamplingParams
`tokenizer`	HuggingFace tokenizer path (llms only)	string
`promptStyle`	Prompt formatting style (llms only)	qwen, mistral, gemma, gpt-oss
`supportReasoning`	Enable reasoning capabilities (llms only)	boolean
`supportImage`	Number of supported images (llms only)	integer
`supportAudio`	Number of supported audios (llms only)	integer
`vectorDim`	Vector dimensions (embeddings only)	integer

Sampling parameters:

Parameter	Description	Range
`temperature`	Randomness in text generation	0.0-2.0
`maxTokens`	Maximum tokens in response	1-8192
`minP`	Minimum probability threshold	0.0-1.0
`topP`	Nucleus sampling threshold	0.0-1.0
`topK`	Top-K sampling limit	1-100
`repetitionPenalty`	Penalty for repeated tokens	1.0-2.0
`presencePenalty`	Penalty for token presence	-2.0-2.0
`frequencyPenalty`	Penalty for token frequency	-2.0-2.0

6.3 Configuration Examples

Example 1: Custom Fine-tuned GPT-OSS Model

ai:
  preset: "experimental.gpt-oss-24g"
  config:
    models:
      # Main LLM with custom fine-tuned model
      - id: llm
        modelRepo: "Jinx-org/Jinx-gpt-oss-20b-mxfp4"
        tokenizer: "openai/gpt-oss-20b"

Example 2: Custom Qwen 3 with Different Parameters

ai:
  preset: "baseline-24g"
  config:
    models:
      - id: llm
        samplingParams:
          temperature: 0.7
          maxTokens: 4096

7. Deprecated Presets (Legacy Support)

For customers that require older configurations, deprecated presets are available but not recommended for new installations.

Preset Pattern	Description	Recommendation
`deprecated.<size>g.20250710`	Pre-Qwen 3 model configurations	Upgrade to current presets when possible

Example: deprecated.24g.20250710 for legacy 24GB configuration

8. Shared memory

Our Inference Server - aka Triton Inference Server - uses shared memory to enable zero-copy data transfer between Zylon services and the inference engine. This eliminates the overhead of serializing and deserializing data across process boundaries, significantly improving inference throughput and reducing latency for high-volume workloads. By default, the inference server allocates 2GB of RAM for shared memory. This is sufficient for most text-based inference workloads, but you may encounter Shared memory allocation failed errors in scenarios such as:

Large request queues: When processing high volumes of concurrent requests, the queued input can exceed available shared memory
Image-based models: Vision workloads often require multiple megabytes per image. A batch of high-resolution images can quickly exhaust the default allocation

Updating Shared Memory Allocation

To increase the shared memory limit, update Zylon configuration file and apply the changes.

triton:
  sharedMemory:
    limit: "4Gi"  # Increase from default 2Gi

Allocating excessive shared memory can cause the rest pods to be OOMKilled
Be conservative with increases—start with small increments (e.g., 2Gi → 4Gi) and adjust based on observed metrics

9. Troubleshooting

Engine fails to start with memory error:

Verify your actual GPU memory with nvidia-smi
Try the next lower preset (e.g., baseline-24g instead of baseline-32g)
Remove optional capabilities to reduce memory usage
Check for other applications using GPU memory.
Reboot the machine

Poor performance or slow responses:

Ensure you’re using the correct preset for your hardware
Consider decreasing to a lower-tier preset
Consider communicate with Zylon engieneers to understand what it is happening

Getting Started

Installation

Zylon Instance Configuration

1. What are AI Presets?

2. Understanding GPU Memory Requirements

3. Presets

3.1 Base Presets

3.2 Alternative Presets

Vision-Enabled Alternatives

Context-Optimized Alternatives

3.3 Experimental Presets

4. Enhanced Capabilities (Optional)

5. Multi-GPU Configuration (Optional)

Multi-GPU Setup Steps

Multi-GPU Examples

6. Customizing the Default Presets

6.1 Supported Model Families

6.2 Custom Model Configuration

Default Models that Must Be Customized:

Parameters

6.3 Configuration Examples

Example 1: Custom Fine-tuned GPT-OSS Model

Example 2: Custom Qwen 3 with Different Parameters

7. Deprecated Presets (Legacy Support)

8. Shared memory

Updating Shared Memory Allocation

9. Troubleshooting

Getting Started

Installation

Zylon Instance Configuration

​1. What are AI Presets?

​2. Understanding GPU Memory Requirements

​3. Presets

​3.1 Base Presets

​3.2 Alternative Presets

​Vision-Enabled Alternatives

​Context-Optimized Alternatives

​3.3 Experimental Presets

​4. Enhanced Capabilities (Optional)

​5. Multi-GPU Configuration (Optional)

​Multi-GPU Setup Steps

​Multi-GPU Examples

​6. Customizing the Default Presets

​6.1 Supported Model Families

​6.2 Custom Model Configuration

​Default Models that Must Be Customized:

​Parameters

​6.3 Configuration Examples

​Example 1: Custom Fine-tuned GPT-OSS Model

​Example 2: Custom Qwen 3 with Different Parameters

​7. Deprecated Presets (Legacy Support)

​8. Shared memory

​Updating Shared Memory Allocation

​9. Troubleshooting

1. What are AI Presets?

2. Understanding GPU Memory Requirements

3. Presets

3.1 Base Presets

3.2 Alternative Presets

Vision-Enabled Alternatives

Context-Optimized Alternatives

3.3 Experimental Presets

4. Enhanced Capabilities (Optional)

5. Multi-GPU Configuration (Optional)

Multi-GPU Setup Steps

Multi-GPU Examples

6. Customizing the Default Presets

6.1 Supported Model Families

6.2 Custom Model Configuration

Default Models that Must Be Customized:

Parameters

6.3 Configuration Examples

Example 1: Custom Fine-tuned GPT-OSS Model

Example 2: Custom Qwen 3 with Different Parameters

7. Deprecated Presets (Legacy Support)

8. Shared memory

Updating Shared Memory Allocation

9. Troubleshooting