How to configure observability

Observability helps you answer three questions:

is Zylon healthy?
what is it doing?
where should its metrics go?

Zylon observability has five parts:

Crash reporting tells Zylon when the platform fails, so support can diagnose the problem.
Usage metrics send anonymous product telemetry to Zylon.
Monitoring installs the local monitoring stack inside your cluster.
Platform metrics are the actual technical metrics from Triton, vLLM, GPUs, and nodes.
Destinations send those metrics to your own monitoring backend.

Getting started

For most setups, think about observability in this order:

Enable monitoring if you want metrics at all.
Enable platformMetrics if you want Triton, vLLM, GPU, and node metrics.
Add destinations if you want to send those metrics to your own backend.
Keep or disable crashReporting and usageMetrics depending on whether you want Zylon telemetry.

Minimal example:

observability:
  monitoring: true
  platformMetrics:
    enabled: true

That gives you local metrics in the in-cluster monitoring stack.

Crash reporting

observability:
  crashReporting: true

observability.crashReporting controls whether Zylon sends crash diagnostics to Sentry. Enable it if you want Zylon support to have failure information when the platform breaks. Disable it if you do not want any crash diagnostics sent to Zylon.

Usage metrics

observability:
  usageMetrics: true

observability.usageMetrics controls whether Zylon sends anonymous product telemetry to Zylon-managed observability services. This is product-level telemetry, not the detailed Triton or vLLM metrics you use for operating the cluster. Disable it if you do not want to send usage telemetry to Zylon.

Monitoring

Monitoring must be enabled if you want local metrics or external metric forwarding.

observability:
  monitoring: true

observability.monitoring installs the in-cluster monitoring stack, including Prometheus, Grafana, and k8s-monitoring. This is the base for everything else related to metrics. If monitoring is disabled, you cannot inspect platform metrics locally and you cannot forward them to your own destinations.

Platform metrics

Platform metrics require monitoring:

observability:
  monitoring: true
  platformMetrics:
    enabled: true

observability.platformMetrics.enabled turns on the operational metrics generated by the inference stack. These are the metrics you use to understand request rate, failures, latency, queue depth, scheduler pressure, GPU usage, and host health. They come from Triton, vLLM, the GPU exporter, and node_exporter. For the full metric configuration, see Platform Metrics.

External destinations

External destinations also require monitoring:

observability:
  monitoring: true

k8s-monitoring:
  extraDestinations:
    my-prometheus:
      type: prometheus
      url: https://prometheus.example.com/api/v1/write

k8s-monitoring.extraDestinations forwards the metrics collected in your cluster to your own monitoring backend. Use it only when you want to send metrics somewhere outside the built-in monitoring stack, for example to Prometheus, Grafana Cloud, or an OTLP collector. For destination setup, see Metrics Destinations.

If your cluster restricts outbound traffic, telemetry and external destinations may require domain or endpoint allowlisting. If you disable usageMetrics, Zylon’s telemetry domains are not needed.

For the core Zylon configuration:

Platform Metrics: enable Triton, vLLM, GPU, and node metrics
Metrics Destinations: send metrics to Prometheus, Grafana-compatible backends, or OTLP

If you use your own Grafana instance, the dashboard is a separate optional step:

External Grafana Dashboard: import the reference dashboard into your own Grafana instance

Getting Started

Installation

Maintenance & Operations

Backoffice

Configuration

Troubleshooting

Security

How to configure observability

Getting started

Crash reporting

Usage metrics

Monitoring

Platform metrics

External destinations

Next pages

Getting Started

Installation

Maintenance & Operations

Backoffice

Configuration

Troubleshooting

Security

Documentation Index

​Getting started

​Crash reporting

​Usage metrics

​Monitoring

​Platform metrics

​External destinations

​Next pages

Getting started

Crash reporting

Usage metrics

Monitoring

Platform metrics

External destinations

Next pages