Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.zylon.ai/llms.txt

Use this file to discover all available pages before exploring further.

Platform metrics are the operational metrics generated by the inference stack running in your cluster. They include:
  • Triton metrics such as request rate, failures, queue depth, and latency
  • vLLM metrics such as scheduler state, KV cache pressure, and token throughput
  • GPU metrics from the DCGM exporter
  • Node metrics from node_exporter

Before you enable them

Platform metrics require the monitoring stack:
observability:
  monitoring: true

Enable platform metrics

observability:
  monitoring: true
  platformMetrics:
    enabled: true
    generationIntervalMs: 2000
    gpu:
      enabled: true
    inference:
      counterLatencies: true
      histogramLatencies: true
      summaryLatencies: true
      summaryQuantiles: ""

Configuration options

FlagDefaultWhat it controls
platformMetrics.enabledfalseTurns platform metric collection on
platformMetrics.generationIntervalMs2000Triton metric generation interval
platformMetrics.gpu.enabledtrueIncludes GPU metrics
platformMetrics.inference.counterLatenciestrueEnables cumulative latency counters
platformMetrics.inference.histogramLatenciestrueEnables latency histograms
platformMetrics.inference.summaryLatenciestrueEnables sliding-window latency summaries
platformMetrics.inference.summaryQuantiles""Overrides Triton’s default summary quantiles

What you get

Triton

Main metric family: nv_* Examples:
  • nv_inference_request_success
  • nv_inference_request_failure
  • nv_inference_pending_request_count
  • nv_inference_request_duration_us
  • nv_inference_compute_infer_duration_us

vLLM

Main metric families:
  • vllm_llms_v1:*
  • vllm_embeddings_v1:*
Examples:
  • vllm_llms_v1:num_requests_running
  • vllm_llms_v1:kv_cache_usage_perc
  • vllm_llms_v1:time_to_first_token_seconds_bucket
  • vllm_llms_v1:generation_tokens_total

GPU

Examples:
  • nv_gpu_utilization
  • nv_gpu_memory_used_bytes
  • nv_gpu_power_usage

Node

Main metric family: node_* Examples:
  • node_cpu_seconds_total
  • node_memory_MemAvailable_bytes
  • node_filesystem_avail_bytes

Next step

After platform metrics are enabled, you can either: For the complete upstream metric lists, see the Triton metrics reference and the vLLM metrics reference.