Skip to main content
Custom configurations can cause system malfunction. Perform thorough testing before deployment.

When to Use Custom Configurations

Consider advanced customization when:
  • You have fine-tuned models optimized for specific domains or tasks
  • You need different sampling parameters than the preset defaults
  • You want to run multiple specialized models simultaneously (e.g., one for vision, one for text, one for reasoning)
  • You require models from alternative model families not included in standard presets
Before proceeding, ensure you:
  • Understand GPU memory management principles
  • Have access to compatible HuggingFace model repositories
  • Know your hardware limitations
  • Have a testing environment for validation

Supported Model Families

Zylon supports these model families for custom configurations:
Model FamilyExample RepositoryUse Cases
Qwen 3Qwen/Qwen3-14BGeneral purpose (default)
Mistral Smallmistralai/Mistral-Small-24B-Instruct-2501High-quality text generation
Gemma 3google/gemma-3-12b-itEfficient inference
Gemma 3ngoogle/gemma-3n-E4B-itOptimized small models
GPT-OSSopenai/gpt-oss-20bAlternative architecture
Only models from these families are officially supported. Using unsupported families may result in system instability.

Understanding Configuration Structure

All custom configurations follow this pattern:
ai:
  preset: "<base-preset>"           # Start with a base preset
  numGPUs: <number>                 # Optional: for multi-GPU setups
  config:
    models:
      - id: llm                      # Mandatory: primary language model
        # ... configuration
      - id: embed                    # Mandatory: embeddings model
        # ... configuration
      - id: <custom-id>              # Optional: additional models
        # ... configuration
Key principles:
  • Every configuration must include llm and embed models
  • Each model needs a unique id
  • GPU memory must be managed manually when adding/deleting models

Use Case 1: Customizing Existing Models

Goal: Modify the preset’s default models without adding new ones. This is useful for using fine-tuned versions of existing models or adjusting inference parameters.

When to Use This Approach

  • Swapping the default model for a fine-tuned version (e.g., Qwen3-14B-Medical instead of Qwen3-14B)
  • Changing sampling parameters (temperature, max tokens, etc.) for different behavior
  • Using a different embeddings model for improved semantic search
  • Adjusting context window size based on your use case

How It Works

Since you’re not adding models, you don’t need to worry about memory reallocation. Simply specify the model changes in the config section, and the preset handles memory allocation automatically.

Configuration Schema

ai:
  preset: "<preset>"
  config:
    models:
      - id: llm | embed                    # Which model to customize
        modelRepo: string                  # HuggingFace model path
        tokenizer: string                  # Optional: tokenizer path (LLMs only)
        promptStyle: string                # Optional: qwen, mistral, gemma, gpt-oss
        contextWindow: integer             # Optional: max context length
        samplingParams:                    # Optional: inference parameters
          temperature: float (0.0-2.0)
          maxTokens: integer (1-8192)
          topP: float (0.0-1.0)
          # ... other sampling parameters

Examples

Example 1: Using a Fine-Tuned Model

Replace the default model with your domain-specific fine-tuned version:
ai:
  preset: "baseline-24g"
  config:
    models:
      - id: llm
        modelRepo: "your-org/qwen3-14b-medical-finetuned"
        tokenizer: "Qwen/Qwen3-14B-Instruct"  # Use original tokenizer

Example 2: Adjusting Sampling Parameters

Modify inference behavior without changing the model:
ai:
  preset: "baseline-48g"
  config:
    models:
      - id: llm
        samplingParams:
          temperature: 0.3        # More deterministic
          maxTokens: 2048         # Shorter responses
          topP: 0.85              # Focused sampling
          repetitionPenalty: 1.3  # Reduce repetition

Example 3: Using Alternative Model Family

Switch to a different model family while keeping the same memory footprint:
ai:
  preset: "experimental.gpt-oss-24g"
  config:
    models:
      - id: llm
        modelRepo: "your-org/gpt-oss-20b-finetuned"
        tokenizer: "openai/gpt-oss-20b"
        promptStyle: gpt-oss

Example 4: Custom Embeddings Model

Use specialized embeddings for domain-specific semantic search:
ai:
  preset: "baseline-48g"
  config:
    models:
      - id: embed
        modelRepo: "your-org/legal-embeddings-v1"
        vectorDim: 1024

Use Case 2: Adding New Models

Goal: Run multiple specialized models simultaneously. This is more complex because you must manually manage GPU memory allocation across all models.

When to Use This Approach

  • Running a vision model alongside your primary text model
  • Using different models for different tasks (e.g., reasoning model + fast response model)
  • Creating specialized pipelines that require multiple model types
  • Building multi-modal systems that process text, images, audio, and other data types

Understanding GPU Memory Management

The critical concept: GPU memory is a fixed resource that must be manually divided among all models. Each model uses a fraction of total GPU memory, controlled by gpuMemoryUtilization (a value between 0.0 and 1.0). The sum of all models’ memory allocations cannot exceed 0.95 (reserving 5% for system overhead). Default allocation for baseline-24g:
llm:   0.85  (85% of 24GB = ~20.4GB)
embed: 0.10  (10% of 24GB = ~2.4GB)
─────────────
Total: 0.95  (with 0.05 reserved for system)
To add a new model, you must:
  1. Reduce existing models’ allocations to free memory
  2. Assign the freed memory to the new model
  3. Adjust context windows if memory is significantly reduced

Understanding KV Cache

To understand why memory allocation affects context windows, you need to know about KV Cache. What is KV Cache? During inference, language models store intermediate computations (Keys and Values) for each token they process. This is called the KV Cache, and it’s what allows models to maintain context across a conversation or document without recomputing everything from scratch. The KV Cache grows with:
  • Context length: More tokens in context = more cache storage needed
  • Model size: Larger models require more cache per token
  • Batch size: Processing multiple requests simultaneously multiplies cache requirements
Memory allocation breakdown: When you allocate GPU memory to a model, that memory is divided between:
  1. Model weights: The model parameters (fixed size, ~2 bytes per parameter for FP16)
  2. KV Cache: Storage for context tokens (grows with context length)
  3. Activation memory: Temporary computation space during inference
Example for a 14B parameter model:
Full allocation (0.85 on 24GB = 20.4GB):
├─ Model weights: ~14GB (quantized; original FP32 size would be ~28GB)
├─ KV Cache: ~5GB (supports 16k context)
└─ Activations: ~1.4GB

Reduced allocation (0.50 on 24GB = 12GB):
├─ Model weights: ~14GB (same, but tighter)
├─ KV Cache: ~2GB (now only supports 8k context)
└─ Activations: minimal headroom
Why this matters: If you reduce total memory allocation from 0.85 to 0.50, the model weights still need the same space, but you have significantly less room for KV Cache. This means you must reduce the contextWindow parameter proportionally to avoid out-of-memory errors during inference. Default baseline KV Cache allocations:
PresetMemory AllocationContext WindowApprox KV Cache
baseline-24g/48g0.8516384 (16k)~4-6GB
baseline-96g0.8532768 (32k)~8-12GB
Setting contextWindow too high for the allocated memory will cause out-of-memory errors during inference, especially during long conversations or when processing large documents. The errors typically appear as “CUDA out of memory” in the Triton logs.

Step-by-Step Process

Step 1: Know Your GPU Memory

First, identify your total available GPU memory:
nvidia-smi
Common configurations:
  • 24GB: RTX 4090, L4
  • 48GB: RTX A6000, L40, L40s
  • 80-96GB: A100, H100
Reserve 5% for system overhead, leaving 95% for models:
  • 24GB → 22.8GB usable
  • 48GB → 45.6GB usable
  • 96GB → 91.2GB usable

Step 2: Calculate Model Memory Requirements

Model memory depends on parameter count and quantization. Use this table to estimate:
Model SizeFP16 (full precision)FP8FP4/AWQTypical Use
3-4B6-8 GB3-4 GB2-3 GBFast inference, reasoning
7B14-16 GB7-8 GB4-5 GBVision models, specialized tasks
14B28-32 GB14-16 GB8-10 GBCommon usage
20B40-44 GB20-22 GB12-14 GBHigh-quality generation
32B64-68 GB32-34 GB18-20 GBAdvanced reasoning
70B140-150 GB70-75 GB40-45 GBComplex tasks
Quantization notes:
  • FP16: Full precision, best quality, highest memory
  • FP8: 50% memory reduction, minimal quality loss
  • FP4/AWQ: 70-75% memory reduction, slight quality degradation
  • Most HuggingFace models default to FP16 unless specified (e.g., -AWQ, -GPTQ suffix)
Example calculations for 24GB GPU (22.8GB usable):
Scenario 1: Primary + Vision
- Qwen3-14B (FP16): 28GB → Too large alone
- Qwen3-14B (FP4): 10GB → Fits
- Qwen2.5-VL-7B (AWQ): 5GB → Fits
- Embeddings: 2-3GB → Fits
- Total: 17-18GB → ✓ Fits in 22.8GB

Scenario 2: Multiple smaller models
- Gemma-3n-4B: 3GB → Fits
- Qwen2.5-VL-7B (AWQ): 5GB → Fits
- GPT-OSS-20B (FP4): 12GB → Fits
- Embeddings: 2GB → Fits
- Total: 22GB → ✓ Fits in 22.8GB

Step 3: Convert GB to Memory Utilization Percentages

Once you know the GB requirements, convert to gpuMemoryUtilization: Formula: gpuMemoryUtilization = (Model GB / Total GPU GB) Example for 24GB GPU:
ModelMemory (GB)CalculationgpuMemoryUtilization
Qwen3-14B (FP4)10 GB10 / 24 = 0.4170.42
Qwen2.5-VL-7B5 GB5 / 24 = 0.2080.21
GPT-OSS-20B (FP4)12 GB12 / 24 = 0.5000.50
Embeddings2.5 GB2.5 / 24 = 0.1040.10
Total29.5 GB29.5 / 241.23 (over 1.00 ✗)
Example for 48GB GPU:
ModelMemory (GB)CalculationgpuMemoryUtilization
Qwen3-14B (FP16)30 GB30 / 48 = 0.6250.63
Qwen2.5-VL-7B5 GB5 / 48 = 0.1040.10
Gemma-3n-4B3 GB3 / 48 = 0.0630.07
Embeddings5 GB5 / 48 = 0.1040.10
Total43 GB43 / 480.90 (under 1.00 ✓)
Always round down slightly to leave headroom. If calculation gives 0.417, use 0.40 or 0.42.

Step 4: Adjust Context Windows Based on Memory

When you reduce a model’s memory allocation, you must also reduce its contextWindow because there’s less space available for KV Cache. Rule of thumb: Context window scales roughly linearly with memory footprint. Examples:
Scenario 1: 24GB GPU with vision model
- Primary LLM: 0.85 → 0.50
  Context: 16384 → 8192
- Vision: 0.25
  Context: 2048

Scenario 2: 48GB GPU with multiple models  
- Primary LLM: 0.85 → 0.60
  Context: 16384 → 12288
- Reasoning: 0.15
  Context: 4096
- Vision: 0.10
  Context: 2048

Step 5: Write Complete Configuration

Now combine all models with their calculated allocations:
ai:
  preset: "baseline-24g"
  config:
    models:
      - id: llm
        gpuMemoryUtilization: 0.42  # 10GB for Qwen3-14B FP4
        contextWindow: 4096          # Reduced from 8192
        
      - id: llmvision
        gpuMemoryUtilization: 0.21  # 5GB for vision model
        contextWindow: 1024          # Small for vision tasks
        
      - id: llmfast
        gpuMemoryUtilization: 0.12  # 3GB for fast model
        contextWindow: 2048
        
      - id: embed
        gpuMemoryUtilization: 0.10  # 2.5GB for embeddings
Quantized models in HuggingFace: Look for suffixes like -AWQ or -GPTQ in the model name. If there’s no suffix, assume FP16.Examples:
  • Qwen/Qwen3-14B-Instruct → FP16 (28-32GB)
  • Qwen/Qwen3-14B-Instruct-AWQ → FP4 (8-10GB)
  • mistralai/Mistral-Small-24B-Instruct-2501 → FP16 (44-48GB)

Configuration Schema

ai:
  preset: "<base-preset>"
  numGPUs: integer                         # Optional: for multi-GPU
  config:
    models:
      - id: string                         # Required: unique identifier
        name: string                       # Optional: display name
        type: llm | embedding              # Required: model type
        modelRepo: string                  # Required: HuggingFace path
        tokenizer: string                  # Optional: tokenizer path (LLMs)
        promptStyle: string                # Optional: qwen, mistral, gemma, gpt-oss
        contextWindow: integer             # Optional: max context length
        gpuMemoryUtilization: float        # Required when adding models (0.0-1.0)
        supportReasoning: boolean          # Optional: enable reasoning (LLMs)
        multimodal:                        # Optional: multimodal support (LLMs)
          images:
            enabled: boolean
            maxNumber: integer
        samplingParams:                    # Optional: inference parameters
          temperature: float (0.0-2.0)
          maxTokens: integer (1-8192)
          minP: float (0.0-1.0)
          topP: float (0.0-1.0)
          topK: integer (1-100)
          repetitionPenalty: float (1.0-2.0)
          presencePenalty: float (-2.0-2.0)
          frequencyPenalty: float (-2.0-2.0)
Critical Rules:
  • Sum of all gpuMemoryUtilization must not exceed 1.00
  • Each id must be unique
  • llm and embed are mandatory and cannot be removed
  • Reducing memory allocation requires reducing contextWindow proportionally

Complete Example: Multi-Model Setup

This example demonstrates adding vision and reasoning models to handle different workload types: Scenario: You want three models:
  1. Primary LLM for general text tasks
  2. Vision LLM for image understanding
  3. Fast Audio Model for transcription tasks
Memory allocation strategy:
Primary LLM:   0.50  (reduced from 0.85)
Vision model:  0.25  (new)
Audio model:   0.10  (new)
Embeddings:    0.10  (unchanged)
───────────────────────────────
Total:         0.95
Complete configuration:
ai:
  preset: "baseline-24g"
  numGPUs: 1
  config:
    models:
      # Primary LLM - Handles general text generation
      - id: llm
        name: qwen-3-14b-awq
        type: llm
        contextWindow: 9600              # Reduced from 16384 (~41%) to match memory allocation
        promptStyle: qwen
        gpuMemoryUtilization: 0.50       # Reduced from default 0.85
        supportReasoning: true
        samplingParams:
          temperature: 0.7
          maxTokens: 4096
          topP: 0.9

      # Embeddings - Mandatory for document processing (unchanged)
      - id: embed
        gpuMemoryUtilization: 0.10

      # Vision LLM - Handles image understanding tasks
      - id: llmvision
        name: qwen-2-5-vl-7b-awq
        type: llm
        contextWindow: 1024              # Smaller for image tasks
        promptStyle: qwen
        gpuMemoryUtilization: 0.25       # New allocation
        multimodal:
          images:
            enabled: true
            maxNumber: 1
        supportReasoning: false
        samplingParams:
          temperature: 0.1               # More deterministic for vision
          maxTokens: 2048
          topP: 0.85

      # Fast audio model - Handles transcription tasks
      - id: llmaudio
        name: gemma-3n-e4b
        type: llm
        contextWindow: 2048
        modelRepo: "<repo>/gemma-3n-4b-it-audio"  # Fine-tuned for audio
        promptStyle: gemma
        gpuMemoryUtilization: 0.10                # New allocation
        supportReasoning: true
        samplingParams:
          temperature: 0.5
          maxTokens: 1024
          topP: 0.9

# Memory verification: 0.50 + 0.25 + 0.10 + 0.10 = 0.95 ✓

Best Practices for Multi-Model Setups

  1. Start minimal: Begin with smallest viable allocations, increase based on actual usage
  2. Monitor continuously: Use nvidia-smi to track real memory consumption
  3. Test individually: Validate each model works before combining. Best to isolate issues instead of debugging multiple models at once
  4. Plan for headroom: Don’t allocate the full memory. Leave some buffer for memory spikes
  5. Stress test: Simulate peak workloads to ensure stability under load

Configuration Parameter Reference

Complete reference for all available parameters.

Core Parameters (All Models)

ParameterTypeRequiredDescription
idstringYesUnique model identifier
typestringYesModel type: llm or embedding
modelRepostringYesHuggingFace model path
namestringNoCustom display name
gpuMemoryUtilizationfloatNoGPU memory fraction (0.0-1.0)

LLM-Specific Parameters

ParameterTypeDescription
contextWindowintegerMaximum context length
tokenizerstringHuggingFace tokenizer path
promptStylestringFormat: qwen, mistral, gemma, gpt-oss
supportReasoningbooleanEnable reasoning capabilities
supportImageintegerNumber of supported images
supportAudiointegerNumber of supported audio inputs
samplingParamsobjectDefault sampling configuration
reasoningSamplingParamsobjectSampling for reasoning mode

Embedding-Specific Parameters

ParameterTypeDescription
vectorDimintegerOutput vector dimensions

Sampling Parameters

ParameterTypeRangeDescription
temperaturefloat0.0-2.0Randomness in generation
maxTokensint1-8192Maximum response tokens
minPfloat0.0-1.0Minimum probability threshold
topPfloat0.0-1.0Nucleus sampling threshold
topKint1-100Top-K sampling limit
repetitionPenaltyfloat1.0-2.0Penalty for repeated tokens
presencePenaltyfloat-2.0-2.0Penalty for token presence
frequencyPenaltyfloat-2.0-2.0Penalty for token frequency

Multimodal Parameters (LLMs)

multimodal:
  images:
    enabled: boolean      # Enable image input
    maxNumber: integer    # Max images per request
  audio:
    enabled: boolean      # Enable audio input
    maxNumber: integer    # Max audio files per request

Validation Checklist

Before deploying custom configurations:
  • All id values are unique
  • llm and embed models are present
  • Sum of gpuMemoryUtilization ≤ 0.95
  • promptStyle matches model family
  • contextWindow appropriate for memory allocation
  • tokenizer matches or is compatible with model
  • Configuration tested in staging environment
  • Monitoring in place for memory usage

Common Pitfalls

  1. Exceeding 1.0 memory allocation: Always verify your math
  2. Not reducing context windows: Large contexts need more memory, adjust accordingly
  3. Mismatched tokenizers: Use compatible tokenizers for each model
  4. Wrong prompt style: Each model family requires specific formatting
  5. No testing: Always validate in non-production first