Advanced Customization

Custom configurations can cause system malfunction. Perform thorough testing before deployment.

When to Use Custom Configurations

Consider advanced customization when:

You have fine-tuned models optimized for specific domains or tasks
You need different sampling parameters than the preset defaults
You want to run multiple specialized models simultaneously (e.g., one for vision, one for text, one for reasoning)
You require models from alternative model families not included in standard presets

Before proceeding, ensure you:

Understand GPU memory management principles
Have access to compatible HuggingFace model repositories
Know your hardware limitations
Have a testing environment for validation

Supported Model Families

Zylon supports these model families for custom configurations:

Model Family	Example Repository	Use Cases
Qwen 3	`Qwen/Qwen3-14B`	General purpose (default)
Mistral Small	`mistralai/Mistral-Small-24B-Instruct-2501`	High-quality text generation
Gemma 3	`google/gemma-3-12b-it`	Efficient inference
Gemma 3n	`google/gemma-3n-E4B-it`	Optimized small models
GPT-OSS	`openai/gpt-oss-20b`	Alternative architecture

Only models from these families are officially supported. Using unsupported families may result in system instability.

Understanding Configuration Structure

All custom configurations follow this pattern:

ai:
  preset: "<base-preset>"           # Start with a base preset
  numGPUs: <number>                 # Optional: for multi-GPU setups
  config:
    models:
      - id: llm                      # Mandatory: primary language model
        # ... configuration
      - id: embed                    # Mandatory: embeddings model
        # ... configuration
      - id: <custom-id>              # Optional: additional models
        # ... configuration

Key principles:

Every configuration must include llm and embed models
Each model needs a unique id
GPU memory must be managed manually when adding/deleting models

Use Case 1: Customizing Existing Models

Goal: Modify the preset’s default models without adding new ones. This is useful for using fine-tuned versions of existing models or adjusting inference parameters.

When to Use This Approach

Swapping the default model for a fine-tuned version (e.g., Qwen3-14B-Medical instead of Qwen3-14B)
Changing sampling parameters (temperature, max tokens, etc.) for different behavior
Using a different embeddings model for improved semantic search
Adjusting context window size based on your use case

How It Works

Since you’re not adding models, you don’t need to worry about memory reallocation. Simply specify the model changes in the config section, and the preset handles memory allocation automatically.

Configuration Schema

ai:
  preset: "<preset>"
  config:
    models:
      - id: llm | embed                    # Which model to customize
        modelRepo: string                  # HuggingFace model path
        tokenizer: string                  # Optional: tokenizer path (LLMs only)
        promptStyle: string                # Optional: qwen, mistral, gemma, gpt-oss
        contextWindow: integer             # Optional: max context length
        samplingParams:                    # Optional: inference parameters
          temperature: float (0.0-2.0)
          maxTokens: integer (1-8192)
          topP: float (0.0-1.0)
          # ... other sampling parameters

Examples

Example 1: Using a Fine-Tuned Model

Replace the default model with your domain-specific fine-tuned version:

ai:
  preset: "baseline-24g"
  config:
    models:
      - id: llm
        modelRepo: "your-org/qwen3-14b-medical-finetuned"
        tokenizer: "Qwen/Qwen3-14B-Instruct"  # Use original tokenizer

Example 2: Adjusting Sampling Parameters

Modify inference behavior without changing the model:

ai:
  preset: "baseline-48g"
  config:
    models:
      - id: llm
        samplingParams:
          temperature: 0.3        # More deterministic
          maxTokens: 2048         # Shorter responses
          topP: 0.85              # Focused sampling
          repetitionPenalty: 1.3  # Reduce repetition

Example 3: Using Alternative Model Family

Switch to a different model family while keeping the same memory footprint:

ai:
  preset: "experimental.gpt-oss-24g"
  config:
    models:
      - id: llm
        modelRepo: "your-org/gpt-oss-20b-finetuned"
        tokenizer: "openai/gpt-oss-20b"
        promptStyle: gpt-oss

Example 4: Custom Embeddings Model

Use specialized embeddings for domain-specific semantic search:

ai:
  preset: "baseline-48g"
  config:
    models:
      - id: embed
        modelRepo: "your-org/legal-embeddings-v1"
        vectorDim: 1024

Use Case 2: Adding New Models

Goal: Run multiple specialized models simultaneously. This is more complex because you must manually manage GPU memory allocation across all models.

When to Use This Approach

Running a vision model alongside your primary text model
Using different models for different tasks (e.g., reasoning model + fast response model)
Creating specialized pipelines that require multiple model types
Building multi-modal systems that process text, images, audio, and other data types

Understanding GPU Memory Management

The critical concept: GPU memory is a fixed resource that must be manually divided among all models. Each model uses a fraction of total GPU memory, controlled by gpuMemoryUtilization (a value between 0.0 and 1.0). The sum of all models’ memory allocations cannot exceed 0.95 (reserving 5% for system overhead). Default allocation for baseline-24g:

llm:   0.85  (85% of 24GB = ~20.4GB)
embed: 0.10  (10% of 24GB = ~2.4GB)
─────────────
Total: 0.95  (with 0.05 reserved for system)

To add a new model, you must:

Reduce existing models’ allocations to free memory
Assign the freed memory to the new model
Adjust context windows if memory is significantly reduced

Understanding KV Cache

To understand why memory allocation affects context windows, you need to know about KV Cache. What is KV Cache? During inference, language models store intermediate computations (Keys and Values) for each token they process. This is called the KV Cache, and it’s what allows models to maintain context across a conversation or document without recomputing everything from scratch. The KV Cache grows with:

Context length: More tokens in context = more cache storage needed
Model size: Larger models require more cache per token
Batch size: Processing multiple requests simultaneously multiplies cache requirements

Memory allocation breakdown: When you allocate GPU memory to a model, that memory is divided between:

Model weights: The model parameters (fixed size, ~2 bytes per parameter for FP16)
KV Cache: Storage for context tokens (grows with context length)
Activation memory: Temporary computation space during inference

Example for a 14B parameter model:

Full allocation (0.85 on 24GB = 20.4GB):
├─ Model weights: ~14GB (quantized; original FP32 size would be ~28GB)
├─ KV Cache: ~5GB (supports 16k context)
└─ Activations: ~1.4GB

Reduced allocation (0.50 on 24GB = 12GB):
├─ Model weights: ~14GB (same, but tighter)
├─ KV Cache: ~2GB (now only supports 8k context)
└─ Activations: minimal headroom

Why this matters: If you reduce total memory allocation from 0.85 to 0.50, the model weights still need the same space, but you have significantly less room for KV Cache. This means you must reduce the contextWindow parameter proportionally to avoid out-of-memory errors during inference. Default baseline KV Cache allocations:

Preset	Memory Allocation	Context Window	Approx KV Cache
baseline-24g/48g	0.85	16384 (16k)	~4-6GB
baseline-96g	0.85	32768 (32k)	~8-12GB

Setting contextWindow too high for the allocated memory will cause out-of-memory errors during inference, especially during long conversations or when processing large documents. The errors typically appear as “CUDA out of memory” in the Triton logs.

Step-by-Step Process

Step 1: Know Your GPU Memory

First, identify your total available GPU memory:

nvidia-smi

Common configurations:

24GB: RTX 4090, L4
48GB: RTX A6000, L40, L40s
80-96GB: A100, H100

Reserve 5% for system overhead, leaving 95% for models:

24GB → 22.8GB usable
48GB → 45.6GB usable
96GB → 91.2GB usable

Step 2: Calculate Model Memory Requirements

Model memory depends on parameter count and quantization. Use this table to estimate:

Model Size	FP16 (full precision)	FP8	FP4/AWQ	Typical Use
3-4B	6-8 GB	3-4 GB	2-3 GB	Fast inference, reasoning
7B	14-16 GB	7-8 GB	4-5 GB	Vision models, specialized tasks
14B	28-32 GB	14-16 GB	8-10 GB	Common usage
20B	40-44 GB	20-22 GB	12-14 GB	High-quality generation
32B	64-68 GB	32-34 GB	18-20 GB	Advanced reasoning
70B	140-150 GB	70-75 GB	40-45 GB	Complex tasks

Quantization notes:

FP16: Full precision, best quality, highest memory
FP8: 50% memory reduction, minimal quality loss
FP4/AWQ: 70-75% memory reduction, slight quality degradation
Most HuggingFace models default to FP16 unless specified (e.g., -AWQ, -GPTQ suffix)

Example calculations for 24GB GPU (22.8GB usable):

Scenario 1: Primary + Vision
- Qwen3-14B (FP16): 28GB → Too large alone
- Qwen3-14B (FP4): 10GB → Fits
- Qwen2.5-VL-7B (AWQ): 5GB → Fits
- Embeddings: 2-3GB → Fits
- Total: 17-18GB → ✓ Fits in 22.8GB

Scenario 2: Multiple smaller models
- Gemma-3n-4B: 3GB → Fits
- Qwen2.5-VL-7B (AWQ): 5GB → Fits
- GPT-OSS-20B (FP4): 12GB → Fits
- Embeddings: 2GB → Fits
- Total: 22GB → ✓ Fits in 22.8GB

Step 3: Convert GB to Memory Utilization Percentages

Once you know the GB requirements, convert to gpuMemoryUtilization: Formula: gpuMemoryUtilization = (Model GB / Total GPU GB) Example for 24GB GPU:

Model	Memory (GB)	Calculation	gpuMemoryUtilization
Qwen3-14B (FP4)	10 GB	10 / 24 = 0.417	0.42
Qwen2.5-VL-7B	5 GB	5 / 24 = 0.208	0.21
GPT-OSS-20B (FP4)	12 GB	12 / 24 = 0.500	0.50
Embeddings	2.5 GB	2.5 / 24 = 0.104	0.10
Total	29.5 GB	29.5 / 24	1.23 (over 1.00 ✗)

Example for 48GB GPU:

Model	Memory (GB)	Calculation	gpuMemoryUtilization
Qwen3-14B (FP16)	30 GB	30 / 48 = 0.625	0.63
Qwen2.5-VL-7B	5 GB	5 / 48 = 0.104	0.10
Gemma-3n-4B	3 GB	3 / 48 = 0.063	0.07
Embeddings	5 GB	5 / 48 = 0.104	0.10
Total	43 GB	43 / 48	0.90 (under 1.00 ✓)

Always round down slightly to leave headroom. If calculation gives 0.417, use 0.40 or 0.42.

Step 4: Adjust Context Windows Based on Memory

When you reduce a model’s memory allocation, you must also reduce its contextWindow because there’s less space available for KV Cache. Rule of thumb: Context window scales roughly linearly with memory footprint. Examples:

Scenario 1: 24GB GPU with vision model
- Primary LLM: 0.85 → 0.50
  Context: 16384 → 8192
- Vision: 0.25
  Context: 2048

Scenario 2: 48GB GPU with multiple models  
- Primary LLM: 0.85 → 0.60
  Context: 16384 → 12288
- Reasoning: 0.15
  Context: 4096
- Vision: 0.10
  Context: 2048

Step 5: Write Complete Configuration

Now combine all models with their calculated allocations:

ai:
  preset: "baseline-24g"
  config:
    models:
      - id: llm
        gpuMemoryUtilization: 0.42  # 10GB for Qwen3-14B FP4
        contextWindow: 4096          # Reduced from 8192
        
      - id: llmvision
        gpuMemoryUtilization: 0.21  # 5GB for vision model
        contextWindow: 1024          # Small for vision tasks
        
      - id: llmfast
        gpuMemoryUtilization: 0.12  # 3GB for fast model
        contextWindow: 2048
        
      - id: embed
        gpuMemoryUtilization: 0.10  # 2.5GB for embeddings

Quantized models in HuggingFace: Look for suffixes like -AWQ or -GPTQ in the model name. If there’s no suffix, assume FP16.Examples:

Qwen/Qwen3-14B-Instruct → FP16 (28-32GB)
Qwen/Qwen3-14B-Instruct-AWQ → FP4 (8-10GB)
mistralai/Mistral-Small-24B-Instruct-2501 → FP16 (44-48GB)

Configuration Schema

ai:
  preset: "<base-preset>"
  numGPUs: integer                         # Optional: for multi-GPU
  config:
    models:
      - id: string                         # Required: unique identifier
        name: string                       # Optional: display name
        type: llm | embedding              # Required: model type
        modelRepo: string                  # Required: HuggingFace path
        tokenizer: string                  # Optional: tokenizer path (LLMs)
        promptStyle: string                # Optional: qwen, mistral, gemma, gpt-oss
        contextWindow: integer             # Optional: max context length
        gpuMemoryUtilization: float        # Required when adding models (0.0-1.0)
        supportReasoning: boolean          # Optional: enable reasoning (LLMs)
        multimodal:                        # Optional: multimodal support (LLMs)
          images:
            enabled: boolean
            maxNumber: integer
        samplingParams:                    # Optional: inference parameters
          temperature: float (0.0-2.0)
          maxTokens: integer (1-8192)
          minP: float (0.0-1.0)
          topP: float (0.0-1.0)
          topK: integer (1-100)
          repetitionPenalty: float (1.0-2.0)
          presencePenalty: float (-2.0-2.0)
          frequencyPenalty: float (-2.0-2.0)

Critical Rules:

Sum of all gpuMemoryUtilization must not exceed 1.00
Each id must be unique
llm and embed are mandatory and cannot be removed
Reducing memory allocation requires reducing contextWindow proportionally

Complete Example: Multi-Model Setup

This example demonstrates adding vision and reasoning models to handle different workload types: Scenario: You want three models:

Primary LLM for general text tasks
Vision LLM for image understanding
Fast Audio Model for transcription tasks

Memory allocation strategy:

Primary LLM:   0.50  (reduced from 0.85)
Vision model:  0.25  (new)
Audio model:   0.10  (new)
Embeddings:    0.10  (unchanged)
───────────────────────────────
Total:         0.95

Complete configuration:

ai:
  preset: "baseline-24g"
  numGPUs: 1
  config:
    models:
      # Primary LLM - Handles general text generation
      - id: llm
        name: qwen-3-14b-awq
        type: llm
        contextWindow: 9600              # Reduced from 16384 (~41%) to match memory allocation
        promptStyle: qwen
        gpuMemoryUtilization: 0.50       # Reduced from default 0.85
        supportReasoning: true
        samplingParams:
          temperature: 0.7
          maxTokens: 4096
          topP: 0.9

      # Embeddings - Mandatory for document processing (unchanged)
      - id: embed
        gpuMemoryUtilization: 0.10

      # Vision LLM - Handles image understanding tasks
      - id: llmvision
        name: qwen-2-5-vl-7b-awq
        type: llm
        contextWindow: 1024              # Smaller for image tasks
        promptStyle: qwen
        gpuMemoryUtilization: 0.25       # New allocation
        multimodal:
          images:
            enabled: true
            maxNumber: 1
        supportReasoning: false
        samplingParams:
          temperature: 0.1               # More deterministic for vision
          maxTokens: 2048
          topP: 0.85

      # Fast audio model - Handles transcription tasks
      - id: llmaudio
        name: gemma-3n-e4b
        type: llm
        contextWindow: 2048
        modelRepo: "<repo>/gemma-3n-4b-it-audio"  # Fine-tuned for audio
        promptStyle: gemma
        gpuMemoryUtilization: 0.10                # New allocation
        supportReasoning: true
        samplingParams:
          temperature: 0.5
          maxTokens: 1024
          topP: 0.9

# Memory verification: 0.50 + 0.25 + 0.10 + 0.10 = 0.95 ✓

Best Practices for Multi-Model Setups

Start minimal: Begin with smallest viable allocations, increase based on actual usage
Monitor continuously: Use nvidia-smi to track real memory consumption
Test individually: Validate each model works before combining. Best to isolate issues instead of debugging multiple models at once
Plan for headroom: Don’t allocate the full memory. Leave some buffer for memory spikes
Stress test: Simulate peak workloads to ensure stability under load

Configuration Parameter Reference

Complete reference for all available parameters.

Core Parameters (All Models)

Parameter	Type	Required	Description
`id`	string	Yes	Unique model identifier
`type`	string	Yes	Model type: `llm` or `embedding`
`modelRepo`	string	Yes	HuggingFace model path
`name`	string	No	Custom display name
`gpuMemoryUtilization`	float	No	GPU memory fraction (0.0-1.0)

LLM-Specific Parameters

Parameter	Type	Description
`contextWindow`	integer	Maximum context length
`tokenizer`	string	HuggingFace tokenizer path
`promptStyle`	string	Format: `qwen`, `mistral`, `gemma`, `gpt-oss`
`supportReasoning`	boolean	Enable reasoning capabilities
`supportImage`	integer	Number of supported images
`supportAudio`	integer	Number of supported audio inputs
`samplingParams`	object	Default sampling configuration
`reasoningSamplingParams`	object	Sampling for reasoning mode

Embedding-Specific Parameters

Parameter	Type	Description
`vectorDim`	integer	Output vector dimensions

Sampling Parameters

Parameter	Type	Range	Description
`temperature`	float	0.0-2.0	Randomness in generation
`maxTokens`	int	1-8192	Maximum response tokens
`minP`	float	0.0-1.0	Minimum probability threshold
`topP`	float	0.0-1.0	Nucleus sampling threshold
`topK`	int	1-100	Top-K sampling limit
`repetitionPenalty`	float	1.0-2.0	Penalty for repeated tokens
`presencePenalty`	float	-2.0-2.0	Penalty for token presence
`frequencyPenalty`	float	-2.0-2.0	Penalty for token frequency

Multimodal Parameters (LLMs)

multimodal:
  images:
    enabled: boolean      # Enable image input
    maxNumber: integer    # Max images per request
  audio:
    enabled: boolean      # Enable audio input
    maxNumber: integer    # Max audio files per request

Validation Checklist

Before deploying custom configurations:

All id values are unique
llm and embed models are present
Sum of gpuMemoryUtilization ≤ 0.95
promptStyle matches model family
contextWindow appropriate for memory allocation
tokenizer matches or is compatible with model
Configuration tested in staging environment
Monitoring in place for memory usage

Common Pitfalls

Exceeding 1.0 memory allocation: Always verify your math
Not reducing context windows: Large contexts need more memory, adjust accordingly
Mismatched tokenizers: Use compatible tokenizers for each model
Wrong prompt style: Each model family requires specific formatting
No testing: Always validate in non-production first

Getting Started

Installation

Zylon Instance Configuration

Advanced Customization

When to Use Custom Configurations

Supported Model Families

Understanding Configuration Structure

Use Case 1: Customizing Existing Models

When to Use This Approach

How It Works

Configuration Schema

Examples

Example 1: Using a Fine-Tuned Model

Example 2: Adjusting Sampling Parameters

Example 3: Using Alternative Model Family

Example 4: Custom Embeddings Model

Use Case 2: Adding New Models

When to Use This Approach

Understanding GPU Memory Management

Understanding KV Cache

Step-by-Step Process

Step 1: Know Your GPU Memory

Step 2: Calculate Model Memory Requirements

Step 3: Convert GB to Memory Utilization Percentages

Step 4: Adjust Context Windows Based on Memory

Step 5: Write Complete Configuration

Configuration Schema

Complete Example: Multi-Model Setup

Best Practices for Multi-Model Setups

Configuration Parameter Reference

Core Parameters (All Models)

LLM-Specific Parameters

Embedding-Specific Parameters

Sampling Parameters

Multimodal Parameters (LLMs)

Validation Checklist

Common Pitfalls

Getting Started

Installation

Zylon Instance Configuration

​When to Use Custom Configurations

​Supported Model Families

​Understanding Configuration Structure

​Use Case 1: Customizing Existing Models

​When to Use This Approach

​How It Works

​Configuration Schema

​Examples

​Example 1: Using a Fine-Tuned Model

​Example 2: Adjusting Sampling Parameters

​Example 3: Using Alternative Model Family

​Example 4: Custom Embeddings Model

​Use Case 2: Adding New Models

​When to Use This Approach

​Understanding GPU Memory Management

​Understanding KV Cache

​Step-by-Step Process

​Step 1: Know Your GPU Memory

​Step 2: Calculate Model Memory Requirements

​Step 3: Convert GB to Memory Utilization Percentages

​Step 4: Adjust Context Windows Based on Memory

​Step 5: Write Complete Configuration

​Configuration Schema

​Complete Example: Multi-Model Setup

​Best Practices for Multi-Model Setups

​Configuration Parameter Reference

​Core Parameters (All Models)

​LLM-Specific Parameters

​Embedding-Specific Parameters

​Sampling Parameters

​Multimodal Parameters (LLMs)

​Validation Checklist

​Common Pitfalls

When to Use Custom Configurations

Supported Model Families

Understanding Configuration Structure

Use Case 1: Customizing Existing Models

When to Use This Approach

How It Works

Configuration Schema

Examples

Example 1: Using a Fine-Tuned Model

Example 2: Adjusting Sampling Parameters

Example 3: Using Alternative Model Family

Example 4: Custom Embeddings Model

Use Case 2: Adding New Models

When to Use This Approach

Understanding GPU Memory Management

Understanding KV Cache

Step-by-Step Process

Step 1: Know Your GPU Memory

Step 2: Calculate Model Memory Requirements

Step 3: Convert GB to Memory Utilization Percentages

Step 4: Adjust Context Windows Based on Memory

Step 5: Write Complete Configuration

Configuration Schema

Complete Example: Multi-Model Setup

Best Practices for Multi-Model Setups

Configuration Parameter Reference

Core Parameters (All Models)

LLM-Specific Parameters

Embedding-Specific Parameters

Sampling Parameters

Multimodal Parameters (LLMs)

Validation Checklist

Common Pitfalls