Chapter 14

AI Reliability Engineering

eBPF kernel-level telemetry, NVLink + NCCL topology, token-path tracing, chaos simulation, and vendor-agnostic natural language querying for AI/GPU training infrastructure.

14.1 Overview

Target Audience

This chapter is written for:

  • AI Infrastructure Engineers managing GPU cluster networking (InfiniBand fabrics, NVLink topologies, NCCL collective operations)
  • MLOps Engineers responsible for training job reliability, SLO tracking, and post-mortem root cause analysis
  • Site Reliability Engineers supporting LLM inference serving infrastructure

Prerequisites

RequirementNotes
Enterprise licenseAll features in this chapter require the Enterprise tier
InfluxDB 2.xRequired for time-series metrics (eBPF latency histograms, NVLink throughput, NCCL hang detection). Set INFLUXDB_URL, INFLUXDB_TOKEN, INFLUXDB_ORG, INFLUXDB_BUCKET.
Enterprise container imageBuild with --target enterprise; see §13.1
LLM provider (optional)Required only for NL querying (§14.6) and token-path correlation (§14.4). See §14.7 for provider setup.

Architecture Overview

The AI Reliability Engineering stack layers on top of the core MeshOptixIQ graph:

  1. eBPF Collector — reads kernel TCP/network stats from /proc/net/snmp (or a demo generator) and writes p50/p99/p999 latency histograms to InfluxDB
  2. NVLink + NCCL Engine — ingests NVLink edge data and NCCL AllReduce operation spans; detects hangs by comparing current duration against a rolling average
  3. Tracing Store — accepts OTEL-compatible span batches via POST /tracing/spans, evaluates SLOs, and correlates high-latency spans to network path anomalies
  4. Chaos Engine — performs BFS graph traversal to compute device/job blast radius for a simulated failure; results are stored asynchronously and polled via a result ID
  5. AI Query Service — maintains a 10-turn conversation session, extracts query parameters from natural language, dispatches registered queries, and returns structured results
Demo mode
All features in this chapter operate in demo mode when MESHOPTIXIQ_DEMO_MODE=true and GRAPH_BACKEND=inmemory. No external services (InfluxDB, GPU hardware, real OTEL spans) are required to evaluate the feature set.

14.2 eBPF Telemetry

Pro+ License gate: ebpf_telemetry

EBPFCollector

The EBPFCollector (in collectors/ebpf_collector.py — conceptual; implemented via the API store) samples kernel-level networking statistics and computes latency percentiles for each host:

  • /proc/net/snmp mode — reads TcpRetransSegs, TcpInErrs, and related counters from the Linux kernel's /proc/net/snmp pseudo-file. Available on any Linux host without BPF privileges.
  • Demo mode — generates synthetic per-host metrics seeded from a deterministic RNG; no kernel access required.

Deploying the eBPF Agent

The eBPF agent is a lightweight Python process that runs directly on each GPU host or network node. It reads kernel TCP statistics, computes latency percentiles, and forwards them to the MeshOptixIQ API on a configurable interval. No BPF privileges or kernel modules are required — the agent uses the /proc/net/snmp pseudo-file on Linux and netstat -s on macOS.

Installation

Install the package that includes the agent on each host you want to monitor:

pip install 'meshoptixiq-network-discovery[ebpf]'

Or, if the package is already installed (e.g. on the same host running the API), the agent command is available immediately — no additional extras are needed.

Linux (systemd)

On any Linux host with Python 3.10+, the agent reads /proc/net/snmp — available on every kernel since 2.2, with no elevated privileges required.

Run manually (foreground, useful for testing):

meshq-ebpf-agent \
  --api-url  http://meshoptixiq.internal:8000 \
  --api-key  <your-api-key> \
  --host     "$(hostname -f)" \
  --interval 30

Install as a systemd service:

# /etc/systemd/system/meshoptixiq-ebpf.service
[Unit]
Description=MeshOptixIQ eBPF Telemetry Agent
After=network.target

[Service]
Type=simple
Environment="MESHOPTIXIQ_API_URL=http://meshoptixiq.internal:8000"
Environment="MESHOPTIXIQ_API_KEY=<your-api-key>"
ExecStart=meshq-ebpf-agent --host %H --interval 30
Restart=on-failure
RestartSec=15
User=nobody

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now meshoptixiq-ebpf

Run in Docker (alongside your workload container):

docker run -d \
  --name meshoptixiq-ebpf \
  --pid=host \
  --volume /proc/net/snmp:/proc/net/snmp:ro \
  -e MESHOPTIXIQ_API_URL=http://meshoptixiq.internal:8000 \
  -e MESHOPTIXIQ_API_KEY=<your-api-key> \
  ghcr.io/niccolus/meshoptixiq:latest \
  meshq-ebpf-agent --host "$(hostname -f)" --interval 30
InfiniBand hosts
On DGX / HGX nodes with InfiniBand, also pass --ib-interface mlx5_0 (adjust to your IB device) to enable per-port retransmit tracking alongside the TCP counters.

macOS

macOS does not expose /proc/net/snmp. The agent automatically falls back to netstat -s -p tcp, which provides equivalent TCP retransmit and error counters. No configuration change is needed — the agent detects the OS at startup.

Run manually:

meshq-ebpf-agent \
  --api-url  http://meshoptixiq.internal:8000 \
  --api-key  <your-api-key> \
  --host     "$(hostname -f)" \
  --interval 30

Install as a launchd daemon (runs at boot, survives logout):

# ~/Library/LaunchAgents/com.meshoptixiq.ebpf-agent.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key>
  <string>com.meshoptixiq.ebpf-agent</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/local/bin/meshq-ebpf-agent</string>
    <string>--api-url</string>
    <string>http://meshoptixiq.internal:8000</string>
    <string>--api-key</string>
    <string>YOUR_API_KEY</string>
    <string>--interval</string>
    <string>30</string>
  </array>
  <key>RunAtLoad</key>
  <true/>
  <key>KeepAlive</key>
  <true/>
  <key>StandardOutPath</key>
  <string>/tmp/meshoptixiq-ebpf.log</string>
  <key>StandardErrorPath</key>
  <string>/tmp/meshoptixiq-ebpf-err.log</string>
</dict>
</plist>
launchctl load ~/Library/LaunchAgents/com.meshoptixiq.ebpf-agent.plist
launchctl start com.meshoptixiq.ebpf-agent

Agent CLI Reference

FlagEnv varDefaultDescription
--api-urlMESHOPTIXIQ_API_URLrequiredBase URL of the MeshOptixIQ API
--api-keyMESHOPTIXIQ_API_KEYrequiredAPI key (X-API-Key header)
--hostMESHOPTIXIQ_HOST$(hostname)Identifier used to associate metrics with a device in the graph
--intervalEBPF_POLL_INTERVAL_SEC30Seconds between polls
--ib-interfaceInfiniBand device name (e.g. mlx5_0) for per-port counters
--batch-size100Max metric records per POST /ebpf/ingest request
--tls-verifyMESHOPTIXIQ_TLS_VERIFYtrueSet false to skip TLS verification (dev only)
Matching to graph devices
The value passed to --host is matched against device hostnames already ingested in the graph. Use the same hostname or FQDN that appears in your meshq ingest output. If the host is not yet in the graph, metrics are stored but not correlated until a device record is added.

Ingest API

MethodEndpointDescription
POST/ebpf/ingestPush a batch of eBPF metric records (host, timestamp, retransmits, latency values)
GET/ebpf/metricsRetrieve latest metrics per host; optional ?device= filter
GET/ebpf/eventsRetrieve anomaly events (microburst, high retransmit rate) per host

Latency Histograms

Each metric record includes three latency percentiles (in microseconds):

  • p50_us — median round-trip latency
  • p99_us — 99th percentile latency (key SLO indicator)
  • p999_us — 99.9th percentile latency (tail latency for training collective ops)

Microburst Detection

A microburst is flagged when the p99 latency for a host exceeds 3× the rolling 5-minute average. The event is stored with a severity field (warning or critical) and surfaced via GET /ebpf/events.

Environment Variables

No additional environment variables are required in demo mode. For production:

VariableDefaultDescription
EBPF_POLL_INTERVAL_SEC30How often to sample /proc/net/snmp
EBPF_MICROBURST_MULTIPLIER3.0Multiplier over rolling average to flag a microburst event
EBPF_RETENTION_HOURS24In-memory event retention window

14.4 Token-Path Tracing

Enterprise License gate: token_path_tracing

OTEL Span Ingestion

Submit batches of OTEL-compatible span objects via POST /tracing/spans:

{
  "spans": [
    {
      "trace_id": "abc123",
      "span_id": "def456",
      "service": "llm-inference",
      "operation": "generate",
      "start_time_ms": 1711234567000,
      "end_time_ms": 1711234569500,
      "duration_ms": 2500,
      "network_path": ["gpu-srv-01", "ib-sw-01", "gpu-srv-02"],
      "attributes": {
        "model": "llama-3-70b",
        "request_tokens": 1024,
        "response_tokens": 256
      }
    }
  ]
}

SLO Definition

SLOs are defined per (service, operation) pair. The default SLO threshold is 2000 ms (p99). Configure per-service SLOs via environment variable:

TRACING_SLO_llm_inference_generate=1500   # 1500 ms p99 for llm-inference/generate

p99 Filtering

The GET /tracing/slo-violations endpoint returns spans where duration_ms exceeds the p99 threshold for that service/operation pair, computed over the trailing 1-hour window.

Correlation Algorithm

When an SLO violation is detected, MeshOptixIQ correlates it to network anomalies on the span's declared network_path:

  1. For each device in network_path, look up active eBPF events and NVLink hang flags at the time of the span
  2. Count the number of devices with a matching anomaly (matched_anomaly_count)
  3. Compute confidence: min(1.0, matched_anomaly_count / len(network_path))
  4. Return correlations with confidence >= 0.25

Correlation Retrieval

MethodEndpointDescription
GET/tracing/correlationsList all SLO violations with network correlation data
GET/tracing/correlations?trace_id=<id>Correlations for a specific trace ID
GET/tracing/slo-violationsSpans that violated their SLO threshold

14.5 Chaos Simulation

Enterprise License gate: chaos_engineering

BFS Traversal

The chaos engine performs a breadth-first search from the simulated failure node across the network graph to enumerate all affected devices. The search respects VLAN boundaries and directed link state.

Impact Score Formula

For each simulation, an impact score is computed from two weighted factors:

impact_score = 0.4 * (affected_devices / total_devices)
             + 0.6 * (affected_jobs / total_active_jobs)

A score of 0.0 indicates no impact; 1.0 indicates total network failure affecting all devices and training jobs. The 0.6 weight on jobs reflects that training job disruption is typically more costly than raw device count.

Simulation Lifecycle

Simulations are computed asynchronously. The lifecycle follows an async polling pattern:

  1. POST /graph/chaos-simulate — submit simulation; returns {"simulation_id": "...", "status": "running"} with HTTP 202
  2. Poll GET /graph/chaos-results/{simulation_id} — returns HTTP 202 while running, HTTP 200 when complete
  3. On completion, the 200 response includes impact_score, affected_devices, affected_jobs, and the full BFS traversal path

Rate Limiting

Chaos simulations are rate-limited to 5 requests per minute per API key to prevent graph traversal abuse. Exceeding this limit returns HTTP 429.

Request Format

POST /graph/chaos-simulate
{
  "device": "core-sw-01",
  "failure_type": "power_failure",
  "include_jobs": true
}

Supported Failure Types

failure_typeDescription
device_failureSingle device goes offline (default)
power_failurePower domain failure — all devices sharing a PDU are removed
rack_failureAll devices in the same rack are removed
switch_failureSwitch failure — all downstream endpoints lose connectivity

14.6 Natural Language Querying

Enterprise License gate: nl_conversation

Conversation Endpoint

POST /ai/query/conversation
Authorization: Bearer <token>
Content-Type: application/json

{
  "message": "Which GPU servers have the most NCCL hang events in the last hour?",
  "session_id": "optional-session-uuid"
}

Session Management

Each conversation session maintains a 10-turn sliding window. When the window is exceeded, the oldest turn is evicted. Sessions expire after 30 minutes of inactivity. If session_id is omitted, a new session is created and returned in the response.

Parameter Extraction

The AI Query Service uses a two-phase approach to extract query parameters from natural language:

  1. Regex phase — extracts common patterns (device names, IP addresses, VLAN IDs, time ranges) deterministically using compiled regex patterns
  2. LLM phase (optional) — when a configured LLM provider is available, the service sends the message and extracted parameters to the LLM for refinement and disambiguation

The extracted parameters are matched against the query registry to identify the best-fit registered query. Only query names that appear in the registry YAML are eligible for dispatch — the LLM cannot invoke arbitrary queries.

Parameter Preview Endpoint

To preview parameter extraction without executing a query, use:

GET /ai/query/parameters?message=Which+devices+have+BGP+peers+down

Returns the extracted parameters and the matched query name without dispatching.

14.7 Vendor-Agnostic LLM Configuration

MeshOptixIQ does not lock you into a single LLM vendor. The provider registry auto-detects the active provider from environment variables in priority order: Anthropic → OpenAI → Ollama → OpenAI-compatible.

LLM_PROVIDER Values

LLM_PROVIDER valueAliasesRequired vars
anthropicANTHROPIC_API_KEY
openaiOPENAI_API_KEY
ollamaOLLAMA_URL (default: http://localhost:11434)
openai_compatiblevllm, lmstudio, llama_cppLLM_BASE_URL, LLM_API_KEY=local

Common Environment Variables

VariableDescription
LLM_PROVIDEROverride auto-detection with an explicit provider name
LLM_MODELModel identifier (e.g., claude-3-5-sonnet-20241022, gpt-4o, llama3.2)
LLM_API_KEYAPI key (use local for local models)
LLM_BASE_URLBase URL for OpenAI-compatible endpoints
ANTHROPIC_API_KEYAnthropic API key (auto-selects Anthropic provider)
OPENAI_API_KEYOpenAI API key (auto-selects OpenAI provider)
OPENAI_BASE_URLCustom base URL for OpenAI-compatible API
OLLAMA_URLOllama server URL (auto-selects Ollama provider)

Installing AI Extras

# For Anthropic + OpenAI support
pip install 'meshoptixiq-network-discovery[ai]'

# Ollama uses stdlib urllib — no extra install needed
# Just set OLLAMA_URL

Ollama Example

# Start Ollama with a local model
ollama pull llama3.2

# Configure MeshOptixIQ to use it
export OLLAMA_URL=http://localhost:11434
export LLM_MODEL=llama3.2

# Verify the provider is detected
curl http://localhost:8000/chat/status
# {"active": true, "provider": "ollama", "model": "llama3.2"}
Local model performance
For natural language query parameter extraction, a 7B–13B parameter model (e.g., Llama 3.2, Mistral 7B) is sufficient. Larger models improve accuracy for ambiguous queries but add latency. The regex pre-pass handles most common query patterns without any LLM call.
Privacy note
When LLM_BASE_URL points to a locally-hosted model, no query data leaves your network. MeshOptixIQ does not send telemetry to any external service.

14.8 MeshQL and NL Query Catalog

MeshQL — Structured Query DSL

For automation pipelines and NOC scripts where determinism matters more than natural language flexibility, MeshQL provides a compact SQL-like syntax that compiles directly to named queries:

POST /queries/meshql
Content-Type: application/json
X-API-Key: $API_KEY

{
  "query": "SHOW NEIGHBORS OF DEVICE \"core-switch-01\"",
  "execute": true
}

Returns {"query_name": "device_neighbors", "params": {"device_name": "core-switch-01"}, "results": [...]}. Parse errors return HTTP 422.

Supported forms:

  • SHOW NEIGHBORS OF DEVICE "name"device_neighbors
  • SHOW IMPACT IF DEVICE "name" DOWNblast_radius_device
  • SHOW LOCATION OF IP "10.0.0.1"locate_endpoint_by_ip
  • SHOW FIREWALL RULES FOR DEVICE "name"firewall_rules_by_device
  • SHOW BGP PEERS FOR DEVICE "name"bgp_peers
  • SHOW SUBNET "10.0.0.0/24"ips_in_subnet
  • SHOW DEVICES [WHERE field = "value"]all_devices
  • SHOW SUMMARYsummary_stats

Set execute: false to parse only (returns query_name + params without hitting the graph backend). Requires Pro+ (api_access).

NL Query Catalog

GET /ai/query/catalog returns all 109 registered queries sorted by category — no authentication required. Use it to populate autocomplete dropdowns or discover available queries:

curl https://<host>/ai/query/catalog | jq '.[0:3]'

Each entry includes name, description, category, and parameters list. The NL router uses the same catalog internally for keyword-based fast-path routing before falling back to LLM classification.