Chapter 14

AI Reliability Engineering

eBPF kernel-level telemetry, NVLink + NCCL topology, token-path tracing, chaos simulation, and vendor-agnostic natural language querying for AI/GPU training infrastructure.

14.1 Overview

Target Audience

This chapter is written for:

AI Infrastructure Engineers managing GPU cluster networking (InfiniBand fabrics, NVLink topologies, NCCL collective operations)
MLOps Engineers responsible for training job reliability, SLO tracking, and post-mortem root cause analysis
Site Reliability Engineers supporting LLM inference serving infrastructure

Prerequisites

Requirement	Notes
Enterprise license	All features in this chapter require the Enterprise tier
InfluxDB 2.x	Required for time-series metrics (eBPF latency histograms, NVLink throughput, NCCL hang detection). Set `INFLUXDB_URL`, `INFLUXDB_TOKEN`, `INFLUXDB_ORG`, `INFLUXDB_BUCKET`.
Enterprise container image	Build with `--target enterprise`; see §13.1
LLM provider (optional)	Required only for NL querying (§14.6) and token-path correlation (§14.4). See §14.7 for provider setup.

Architecture Overview

The AI Reliability Engineering stack layers on top of the core MeshOptixIQ graph:

eBPF Collector — reads kernel TCP/network stats from /proc/net/snmp (or a demo generator) and writes p50/p99/p999 latency histograms to InfluxDB
NVLink + NCCL Engine — ingests NVLink edge data and NCCL AllReduce operation spans; detects hangs by comparing current duration against a rolling average
Tracing Store — accepts OTEL-compatible span batches via POST /tracing/spans, evaluates SLOs, and correlates high-latency spans to network path anomalies
Chaos Engine — performs BFS graph traversal to compute device/job blast radius for a simulated failure; results are stored asynchronously and polled via a result ID
AI Query Service — maintains a 10-turn conversation session, extracts query parameters from natural language, dispatches registered queries, and returns structured results

Demo mode

All features in this chapter operate in demo mode when MESHOPTIXIQ_DEMO_MODE=true and GRAPH_BACKEND=inmemory. No external services (InfluxDB, GPU hardware, real OTEL spans) are required to evaluate the feature set.

14.2 eBPF Telemetry

Pro+ License gate: ebpf_telemetry

EBPFCollector

The EBPFCollector (in collectors/ebpf_collector.py — conceptual; implemented via the API store) samples kernel-level networking statistics and computes latency percentiles for each host:

/proc/net/snmp mode — reads TcpRetransSegs, TcpInErrs, and related counters from the Linux kernel's /proc/net/snmp pseudo-file. Available on any Linux host without BPF privileges.
Demo mode — generates synthetic per-host metrics seeded from a deterministic RNG; no kernel access required.

Deploying the eBPF Agent

The eBPF agent is a lightweight Python process that runs directly on each GPU host or network node. It reads kernel TCP statistics, computes latency percentiles, and forwards them to the MeshOptixIQ API on a configurable interval. No BPF privileges or kernel modules are required — the agent uses the /proc/net/snmp pseudo-file on Linux and netstat -s on macOS.

Installation

Install the package that includes the agent on each host you want to monitor:

pip install 'meshoptixiq-network-discovery[ebpf]'

Or, if the package is already installed (e.g. on the same host running the API), the agent command is available immediately — no additional extras are needed.

Linux (systemd)

On any Linux host with Python 3.10+, the agent reads /proc/net/snmp — available on every kernel since 2.2, with no elevated privileges required.

Run manually (foreground, useful for testing):

meshq-ebpf-agent \
  --api-url  http://meshoptixiq.internal:8000 \
  --api-key  <your-api-key> \
  --host     "$(hostname -f)" \
  --interval 30

Install as a systemd service:

# /etc/systemd/system/meshoptixiq-ebpf.service
[Unit]
Description=MeshOptixIQ eBPF Telemetry Agent
After=network.target

[Service]
Type=simple
Environment="MESHOPTIXIQ_API_URL=http://meshoptixiq.internal:8000"
Environment="MESHOPTIXIQ_API_KEY=<your-api-key>"
ExecStart=meshq-ebpf-agent --host %H --interval 30
Restart=on-failure
RestartSec=15
User=nobody

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now meshoptixiq-ebpf

Run in Docker (alongside your workload container):

docker run -d \
  --name meshoptixiq-ebpf \
  --pid=host \
  --volume /proc/net/snmp:/proc/net/snmp:ro \
  -e MESHOPTIXIQ_API_URL=http://meshoptixiq.internal:8000 \
  -e MESHOPTIXIQ_API_KEY=<your-api-key> \
  ghcr.io/niccolus/meshoptixiq:latest \
  meshq-ebpf-agent --host "$(hostname -f)" --interval 30

InfiniBand hosts

On DGX / HGX nodes with InfiniBand, also pass --ib-interface mlx5_0 (adjust to your IB device) to enable per-port retransmit tracking alongside the TCP counters.

macOS

macOS does not expose /proc/net/snmp. The agent automatically falls back to netstat -s -p tcp, which provides equivalent TCP retransmit and error counters. No configuration change is needed — the agent detects the OS at startup.

Run manually:

meshq-ebpf-agent \
  --api-url  http://meshoptixiq.internal:8000 \
  --api-key  <your-api-key> \
  --host     "$(hostname -f)" \
  --interval 30

Install as a launchd daemon (runs at boot, survives logout):

# ~/Library/LaunchAgents/com.meshoptixiq.ebpf-agent.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.meshoptixiq.ebpf-agent</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/meshq-ebpf-agent</string>
    <string>--api-url</string>
    <string>http://meshoptixiq.internal:8000</string>
    <string>--api-key</string>
    <string>YOUR_API_KEY</string>
    <string>--interval</string>
    <string>30</string>
  </array>
  <key>RunAtLoad</key>
  <true/>
  <key>KeepAlive</key>
  <true/>
  <key>StandardOutPath</key>
  <string>/tmp/meshoptixiq-ebpf.log</string>
  <key>StandardErrorPath</key>
  <string>/tmp/meshoptixiq-ebpf-err.log</string>
</dict>
</plist>

launchctl load ~/Library/LaunchAgents/com.meshoptixiq.ebpf-agent.plist
launchctl start com.meshoptixiq.ebpf-agent

Agent CLI Reference

Flag	Env var	Default	Description
`--api-url`	`MESHOPTIXIQ_API_URL`	required	Base URL of the MeshOptixIQ API
`--api-key`	`MESHOPTIXIQ_API_KEY`	required	API key (`X-API-Key` header)
`--host`	`MESHOPTIXIQ_HOST`	`$(hostname)`	Identifier used to associate metrics with a device in the graph
`--interval`	`EBPF_POLL_INTERVAL_SEC`	`30`	Seconds between polls
`--ib-interface`	—	—	InfiniBand device name (e.g. `mlx5_0`) for per-port counters
`--batch-size`	—	`100`	Max metric records per `POST /ebpf/ingest` request
`--tls-verify`	`MESHOPTIXIQ_TLS_VERIFY`	`true`	Set `false` to skip TLS verification (dev only)

Matching to graph devices

The value passed to --host is matched against device hostnames already ingested in the graph. Use the same hostname or FQDN that appears in your meshq ingest output. If the host is not yet in the graph, metrics are stored but not correlated until a device record is added.

Ingest API

Method	Endpoint	Description
`POST`	`/ebpf/ingest`	Push a batch of eBPF metric records (host, timestamp, retransmits, latency values)
`GET`	`/ebpf/metrics`	Retrieve latest metrics per host; optional `?device=` filter
`GET`	`/ebpf/events`	Retrieve anomaly events (microburst, high retransmit rate) per host

Latency Histograms

Each metric record includes three latency percentiles (in microseconds):

p50_us — median round-trip latency
p99_us — 99th percentile latency (key SLO indicator)
p999_us — 99.9th percentile latency (tail latency for training collective ops)

Microburst Detection

A microburst is flagged when the p99 latency for a host exceeds 3× the rolling 5-minute average. The event is stored with a severity field (warning or critical) and surfaced via GET /ebpf/events.

Environment Variables

No additional environment variables are required in demo mode. For production:

Variable	Default	Description
`EBPF_POLL_INTERVAL_SEC`	`30`	How often to sample `/proc/net/snmp`
`EBPF_MICROBURST_MULTIPLIER`	`3.0`	Multiplier over rolling average to flag a microburst event
`EBPF_RETENTION_HOURS`	`24`	In-memory event retention window

14.3 NVLink & NCCL Topology

Enterprise License gate: nccl_silicon_mapping

NVLinkEdgeModel Fields

Each NVLink edge in the graph carries the following fields:

Field	Type	Description
`source_device`	str	Hostname of the source GPU server
`source_gpu_id`	int	GPU index on the source device (0-based)
`dest_device`	str	Hostname of the destination GPU server
`dest_gpu_id`	int	GPU index on the destination device (0-based)
`link_id`	str	Unique identifier for the NVLink connection
`bandwidth_gbps`	float	Rated bidirectional bandwidth
`link_state`	str	`active` \| `inactive` \| `error`
`p99_latency_us`	float \| None	p99 latency from InfluxDB (populated when InfluxDB is configured)

NCCL Hang Detection

A NCCL operation is marked as hanging when its current elapsed duration exceeds 3× the rolling average duration for operations of the same type (e.g., AllReduce, AllGather, ReduceScatter). The rolling average is computed over the last 20 completed operations per type.

Hanging operations appear with hanging: true in the GET /nccl/operations/active response.

Merged Topology API

Method	Endpoint	Description
`GET`	`/nccl/topology/full`	Returns NVLink edges merged with IB topology from the main graph. Includes p99 latency values from InfluxDB when available.
`GET`	`/nccl/operations/active`	Returns currently active NCCL operations with hang status
`POST`	`/nccl/operations`	Ingest a new NCCL operation record
`GET`	`/nvlink/edges`	Returns raw NVLink edges (no IB merge)

InfluxDB p99 Latency Queries

When INFLUXDB_URL is set, the topology endpoint enriches each NVLink edge with a p99 latency value queried from InfluxDB using the Flux query language:

from(bucket: "network_metrics")
  |> range(start: -5m)
  |> filter(fn: (r) => r._measurement == "nvlink_latency")
  |> filter(fn: (r) => r.link_id == "<link_id>")
  |> filter(fn: (r) => r._field == "p99_us")
  |> last()

14.4 Token-Path Tracing

Enterprise License gate: token_path_tracing

OTEL Span Ingestion

Submit batches of OTEL-compatible span objects via POST /tracing/spans:

{
  "spans": [
    {
      "trace_id": "abc123",
      "span_id": "def456",
      "service": "llm-inference",
      "operation": "generate",
      "start_time_ms": 1711234567000,
      "end_time_ms": 1711234569500,
      "duration_ms": 2500,
      "network_path": ["gpu-srv-01", "ib-sw-01", "gpu-srv-02"],
      "attributes": {
        "model": "llama-3-70b",
        "request_tokens": 1024,
        "response_tokens": 256
      }
    }
  ]
}

SLO Definition

SLOs are defined per (service, operation) pair. The default SLO threshold is 2000 ms (p99). Configure per-service SLOs via environment variable:

TRACING_SLO_llm_inference_generate=1500   # 1500 ms p99 for llm-inference/generate

p99 Filtering

The GET /tracing/slo-violations endpoint returns spans where duration_ms exceeds the p99 threshold for that service/operation pair, computed over the trailing 1-hour window.

Correlation Algorithm

When an SLO violation is detected, MeshOptixIQ correlates it to network anomalies on the span's declared network_path:

For each device in network_path, look up active eBPF events and NVLink hang flags at the time of the span
Count the number of devices with a matching anomaly (matched_anomaly_count)
Compute confidence: min(1.0, matched_anomaly_count / len(network_path))
Return correlations with confidence >= 0.25

Correlation Retrieval

Method	Endpoint	Description
`GET`	`/tracing/correlations`	List all SLO violations with network correlation data
`GET`	`/tracing/correlations?trace_id=<id>`	Correlations for a specific trace ID
`GET`	`/tracing/slo-violations`	Spans that violated their SLO threshold

14.5 Chaos Simulation

Enterprise License gate: chaos_engineering

BFS Traversal

The chaos engine performs a breadth-first search from the simulated failure node across the network graph to enumerate all affected devices. The search respects VLAN boundaries and directed link state.

Impact Score Formula

For each simulation, an impact score is computed from two weighted factors:

impact_score = 0.4 * (affected_devices / total_devices)
             + 0.6 * (affected_jobs / total_active_jobs)

A score of 0.0 indicates no impact; 1.0 indicates total network failure affecting all devices and training jobs. The 0.6 weight on jobs reflects that training job disruption is typically more costly than raw device count.

Simulation Lifecycle

Simulations are computed asynchronously. The lifecycle follows an async polling pattern:

POST /graph/chaos-simulate — submit simulation; returns {"simulation_id": "...", "status": "running"} with HTTP 202
Poll GET /graph/chaos-results/{simulation_id} — returns HTTP 202 while running, HTTP 200 when complete
On completion, the 200 response includes impact_score, affected_devices, affected_jobs, and the full BFS traversal path

Rate Limiting

Chaos simulations are rate-limited to 5 requests per minute per API key to prevent graph traversal abuse. Exceeding this limit returns HTTP 429.

Request Format

POST /graph/chaos-simulate
{
  "device": "core-sw-01",
  "failure_type": "power_failure",
  "include_jobs": true
}

Supported Failure Types

failure_type	Description
`device_failure`	Single device goes offline (default)
`power_failure`	Power domain failure — all devices sharing a PDU are removed
`rack_failure`	All devices in the same rack are removed
`switch_failure`	Switch failure — all downstream endpoints lose connectivity

14.6 Natural Language Querying

Enterprise License gate: nl_conversation

Conversation Endpoint

POST /ai/query/conversation
Authorization: Bearer <token>
Content-Type: application/json

{
  "message": "Which GPU servers have the most NCCL hang events in the last hour?",
  "session_id": "optional-session-uuid"
}

Session Management

Each conversation session maintains a 10-turn sliding window. When the window is exceeded, the oldest turn is evicted. Sessions expire after 30 minutes of inactivity. If session_id is omitted, a new session is created and returned in the response.

Parameter Extraction

The AI Query Service uses a two-phase approach to extract query parameters from natural language:

Regex phase — extracts common patterns (device names, IP addresses, VLAN IDs, time ranges) deterministically using compiled regex patterns
LLM phase (optional) — when a configured LLM provider is available, the service sends the message and extracted parameters to the LLM for refinement and disambiguation

The extracted parameters are matched against the query registry to identify the best-fit registered query. Only query names that appear in the registry YAML are eligible for dispatch — the LLM cannot invoke arbitrary queries.

Parameter Preview Endpoint

To preview parameter extraction without executing a query, use:

GET /ai/query/parameters?message=Which+devices+have+BGP+peers+down

Returns the extracted parameters and the matched query name without dispatching.

14.7 Vendor-Agnostic LLM Configuration

MeshOptixIQ does not lock you into a single LLM vendor. The provider registry auto-detects the active provider from environment variables in priority order: Anthropic → OpenAI → Ollama → OpenAI-compatible.

LLM_PROVIDER Values

LLM_PROVIDER value	Aliases	Required vars
`anthropic`	—	`ANTHROPIC_API_KEY`
`openai`	—	`OPENAI_API_KEY`
`ollama`	—	`OLLAMA_URL` (default: `http://localhost:11434`)
`openai_compatible`	`vllm`, `lmstudio`, `llama_cpp`	`LLM_BASE_URL`, `LLM_API_KEY=local`

Common Environment Variables

Variable	Description
`LLM_PROVIDER`	Override auto-detection with an explicit provider name
`LLM_MODEL`	Model identifier (e.g., `claude-3-5-sonnet-20241022`, `gpt-4o`, `llama3.2`)
`LLM_API_KEY`	API key (use `local` for local models)
`LLM_BASE_URL`	Base URL for OpenAI-compatible endpoints
`ANTHROPIC_API_KEY`	Anthropic API key (auto-selects Anthropic provider)
`OPENAI_API_KEY`	OpenAI API key (auto-selects OpenAI provider)
`OPENAI_BASE_URL`	Custom base URL for OpenAI-compatible API
`OLLAMA_URL`	Ollama server URL (auto-selects Ollama provider)

Installing AI Extras

# For Anthropic + OpenAI support
pip install 'meshoptixiq-network-discovery[ai]'

# Ollama uses stdlib urllib — no extra install needed
# Just set OLLAMA_URL

Ollama Example

# Start Ollama with a local model
ollama pull llama3.2

# Configure MeshOptixIQ to use it
export OLLAMA_URL=http://localhost:11434
export LLM_MODEL=llama3.2

# Verify the provider is detected
curl http://localhost:8000/chat/status
# {"active": true, "provider": "ollama", "model": "llama3.2"}

Local model performance

For natural language query parameter extraction, a 7B–13B parameter model (e.g., Llama 3.2, Mistral 7B) is sufficient. Larger models improve accuracy for ambiguous queries but add latency. The regex pre-pass handles most common query patterns without any LLM call.

Privacy note

When LLM_BASE_URL points to a locally-hosted model, no query data leaves your network. MeshOptixIQ does not send telemetry to any external service.

14.8 MeshQL and NL Query Catalog

MeshQL — Structured Query DSL

For automation pipelines and NOC scripts where determinism matters more than natural language flexibility, MeshQL provides a compact SQL-like syntax that compiles directly to named queries:

POST /queries/meshql
Content-Type: application/json
X-API-Key: $API_KEY

{
  "query": "SHOW NEIGHBORS OF DEVICE \"core-switch-01\"",
  "execute": true
}

Returns {"query_name": "device_neighbors", "params": {"device_name": "core-switch-01"}, "results": [...]}. Parse errors return HTTP 422.

Supported forms:

SHOW NEIGHBORS OF DEVICE "name" → device_neighbors
SHOW IMPACT IF DEVICE "name" DOWN → blast_radius_device
SHOW LOCATION OF IP "10.0.0.1" → locate_endpoint_by_ip
SHOW FIREWALL RULES FOR DEVICE "name" → firewall_rules_by_device
SHOW BGP PEERS FOR DEVICE "name" → bgp_peers
SHOW SUBNET "10.0.0.0/24" → ips_in_subnet
SHOW DEVICES [WHERE field = "value"] → all_devices
SHOW SUMMARY → summary_stats

Set execute: false to parse only (returns query_name + params without hitting the graph backend). Requires Pro+ (api_access).

NL Query Catalog

GET /ai/query/catalog returns all 109 registered queries sorted by category — no authentication required. Use it to populate autocomplete dropdowns or discover available queries:

curl https://<host>/ai/query/catalog | jq '.[0:3]'

Each entry includes name, description, category, and parameters list. The NL router uses the same catalog internally for keyword-based fast-path routing before falling back to LLM classification.