AI Reliability Engineering
eBPF kernel-level telemetry, NVLink + NCCL topology, token-path tracing, chaos simulation, and vendor-agnostic natural language querying for AI/GPU training infrastructure.
14.1 Overview
Target Audience
This chapter is written for:
- AI Infrastructure Engineers managing GPU cluster networking (InfiniBand fabrics, NVLink topologies, NCCL collective operations)
- MLOps Engineers responsible for training job reliability, SLO tracking, and post-mortem root cause analysis
- Site Reliability Engineers supporting LLM inference serving infrastructure
Prerequisites
| Requirement | Notes |
|---|---|
| Enterprise license | All features in this chapter require the Enterprise tier |
| InfluxDB 2.x | Required for time-series metrics (eBPF latency histograms, NVLink throughput, NCCL hang detection). Set INFLUXDB_URL, INFLUXDB_TOKEN, INFLUXDB_ORG, INFLUXDB_BUCKET. |
| Enterprise container image | Build with --target enterprise; see §13.1 |
| LLM provider (optional) | Required only for NL querying (§14.6) and token-path correlation (§14.4). See §14.7 for provider setup. |
Architecture Overview
The AI Reliability Engineering stack layers on top of the core MeshOptixIQ graph:
- eBPF Collector — reads kernel TCP/network stats from
/proc/net/snmp(or a demo generator) and writes p50/p99/p999 latency histograms to InfluxDB - NVLink + NCCL Engine — ingests NVLink edge data and NCCL AllReduce operation spans; detects hangs by comparing current duration against a rolling average
- Tracing Store — accepts OTEL-compatible span batches via
POST /tracing/spans, evaluates SLOs, and correlates high-latency spans to network path anomalies - Chaos Engine — performs BFS graph traversal to compute device/job blast radius for a simulated failure; results are stored asynchronously and polled via a result ID
- AI Query Service — maintains a 10-turn conversation session, extracts query parameters from natural language, dispatches registered queries, and returns structured results
MESHOPTIXIQ_DEMO_MODE=true and GRAPH_BACKEND=inmemory. No external services (InfluxDB, GPU hardware, real OTEL spans) are required to evaluate the feature set.
14.2 eBPF Telemetry
Pro+ License gate: ebpf_telemetry
EBPFCollector
The EBPFCollector (in collectors/ebpf_collector.py — conceptual; implemented via the API store) samples kernel-level networking statistics and computes latency percentiles for each host:
- /proc/net/snmp mode — reads
TcpRetransSegs,TcpInErrs, and related counters from the Linux kernel's/proc/net/snmppseudo-file. Available on any Linux host without BPF privileges. - Demo mode — generates synthetic per-host metrics seeded from a deterministic RNG; no kernel access required.
Deploying the eBPF Agent
The eBPF agent is a lightweight Python process that runs directly on each GPU host or network node. It reads kernel TCP statistics, computes latency percentiles, and forwards them to the MeshOptixIQ API on a configurable interval. No BPF privileges or kernel modules are required — the agent uses the /proc/net/snmp pseudo-file on Linux and netstat -s on macOS.
Installation
Install the package that includes the agent on each host you want to monitor:
pip install 'meshoptixiq-network-discovery[ebpf]'
Or, if the package is already installed (e.g. on the same host running the API), the agent command is available immediately — no additional extras are needed.
Linux (systemd)
On any Linux host with Python 3.10+, the agent reads /proc/net/snmp — available on every kernel since 2.2, with no elevated privileges required.
Run manually (foreground, useful for testing):
meshq-ebpf-agent \
--api-url http://meshoptixiq.internal:8000 \
--api-key <your-api-key> \
--host "$(hostname -f)" \
--interval 30
Install as a systemd service:
# /etc/systemd/system/meshoptixiq-ebpf.service
[Unit]
Description=MeshOptixIQ eBPF Telemetry Agent
After=network.target
[Service]
Type=simple
Environment="MESHOPTIXIQ_API_URL=http://meshoptixiq.internal:8000"
Environment="MESHOPTIXIQ_API_KEY=<your-api-key>"
ExecStart=meshq-ebpf-agent --host %H --interval 30
Restart=on-failure
RestartSec=15
User=nobody
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now meshoptixiq-ebpf
Run in Docker (alongside your workload container):
docker run -d \
--name meshoptixiq-ebpf \
--pid=host \
--volume /proc/net/snmp:/proc/net/snmp:ro \
-e MESHOPTIXIQ_API_URL=http://meshoptixiq.internal:8000 \
-e MESHOPTIXIQ_API_KEY=<your-api-key> \
ghcr.io/niccolus/meshoptixiq:latest \
meshq-ebpf-agent --host "$(hostname -f)" --interval 30
--ib-interface mlx5_0 (adjust to your IB device) to enable per-port retransmit tracking alongside the TCP counters.
macOS
macOS does not expose /proc/net/snmp. The agent automatically falls back to netstat -s -p tcp, which provides equivalent TCP retransmit and error counters. No configuration change is needed — the agent detects the OS at startup.
Run manually:
meshq-ebpf-agent \
--api-url http://meshoptixiq.internal:8000 \
--api-key <your-api-key> \
--host "$(hostname -f)" \
--interval 30
Install as a launchd daemon (runs at boot, survives logout):
# ~/Library/LaunchAgents/com.meshoptixiq.ebpf-agent.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.meshoptixiq.ebpf-agent</string>
<key>ProgramArguments</key>
<array>
<string>/usr/local/bin/meshq-ebpf-agent</string>
<string>--api-url</string>
<string>http://meshoptixiq.internal:8000</string>
<string>--api-key</string>
<string>YOUR_API_KEY</string>
<string>--interval</string>
<string>30</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/meshoptixiq-ebpf.log</string>
<key>StandardErrorPath</key>
<string>/tmp/meshoptixiq-ebpf-err.log</string>
</dict>
</plist>
launchctl load ~/Library/LaunchAgents/com.meshoptixiq.ebpf-agent.plist
launchctl start com.meshoptixiq.ebpf-agent
Agent CLI Reference
| Flag | Env var | Default | Description |
|---|---|---|---|
--api-url | MESHOPTIXIQ_API_URL | required | Base URL of the MeshOptixIQ API |
--api-key | MESHOPTIXIQ_API_KEY | required | API key (X-API-Key header) |
--host | MESHOPTIXIQ_HOST | $(hostname) | Identifier used to associate metrics with a device in the graph |
--interval | EBPF_POLL_INTERVAL_SEC | 30 | Seconds between polls |
--ib-interface | — | — | InfiniBand device name (e.g. mlx5_0) for per-port counters |
--batch-size | — | 100 | Max metric records per POST /ebpf/ingest request |
--tls-verify | MESHOPTIXIQ_TLS_VERIFY | true | Set false to skip TLS verification (dev only) |
--host is matched against device hostnames already ingested in the graph. Use the same hostname or FQDN that appears in your meshq ingest output. If the host is not yet in the graph, metrics are stored but not correlated until a device record is added.
Ingest API
| Method | Endpoint | Description |
|---|---|---|
POST | /ebpf/ingest | Push a batch of eBPF metric records (host, timestamp, retransmits, latency values) |
GET | /ebpf/metrics | Retrieve latest metrics per host; optional ?device= filter |
GET | /ebpf/events | Retrieve anomaly events (microburst, high retransmit rate) per host |
Latency Histograms
Each metric record includes three latency percentiles (in microseconds):
p50_us— median round-trip latencyp99_us— 99th percentile latency (key SLO indicator)p999_us— 99.9th percentile latency (tail latency for training collective ops)
Microburst Detection
A microburst is flagged when the p99 latency for a host exceeds 3× the rolling 5-minute average. The event is stored with a severity field (warning or critical) and surfaced via GET /ebpf/events.
Environment Variables
No additional environment variables are required in demo mode. For production:
| Variable | Default | Description |
|---|---|---|
EBPF_POLL_INTERVAL_SEC | 30 | How often to sample /proc/net/snmp |
EBPF_MICROBURST_MULTIPLIER | 3.0 | Multiplier over rolling average to flag a microburst event |
EBPF_RETENTION_HOURS | 24 | In-memory event retention window |
14.3 NVLink & NCCL Topology
Enterprise License gate: nccl_silicon_mapping
NVLinkEdgeModel Fields
Each NVLink edge in the graph carries the following fields:
| Field | Type | Description |
|---|---|---|
source_device | str | Hostname of the source GPU server |
source_gpu_id | int | GPU index on the source device (0-based) |
dest_device | str | Hostname of the destination GPU server |
dest_gpu_id | int | GPU index on the destination device (0-based) |
link_id | str | Unique identifier for the NVLink connection |
bandwidth_gbps | float | Rated bidirectional bandwidth |
link_state | str | active | inactive | error |
p99_latency_us | float | None | p99 latency from InfluxDB (populated when InfluxDB is configured) |
NCCL Hang Detection
A NCCL operation is marked as hanging when its current elapsed duration exceeds 3× the rolling average duration for operations of the same type (e.g., AllReduce, AllGather, ReduceScatter). The rolling average is computed over the last 20 completed operations per type.
Hanging operations appear with hanging: true in the GET /nccl/operations/active response.
Merged Topology API
| Method | Endpoint | Description |
|---|---|---|
GET | /nccl/topology/full | Returns NVLink edges merged with IB topology from the main graph. Includes p99 latency values from InfluxDB when available. |
GET | /nccl/operations/active | Returns currently active NCCL operations with hang status |
POST | /nccl/operations | Ingest a new NCCL operation record |
GET | /nvlink/edges | Returns raw NVLink edges (no IB merge) |
InfluxDB p99 Latency Queries
When INFLUXDB_URL is set, the topology endpoint enriches each NVLink edge with a p99 latency value queried from InfluxDB using the Flux query language:
from(bucket: "network_metrics")
|> range(start: -5m)
|> filter(fn: (r) => r._measurement == "nvlink_latency")
|> filter(fn: (r) => r.link_id == "<link_id>")
|> filter(fn: (r) => r._field == "p99_us")
|> last()
14.4 Token-Path Tracing
Enterprise License gate: token_path_tracing
OTEL Span Ingestion
Submit batches of OTEL-compatible span objects via POST /tracing/spans:
{
"spans": [
{
"trace_id": "abc123",
"span_id": "def456",
"service": "llm-inference",
"operation": "generate",
"start_time_ms": 1711234567000,
"end_time_ms": 1711234569500,
"duration_ms": 2500,
"network_path": ["gpu-srv-01", "ib-sw-01", "gpu-srv-02"],
"attributes": {
"model": "llama-3-70b",
"request_tokens": 1024,
"response_tokens": 256
}
}
]
}
SLO Definition
SLOs are defined per (service, operation) pair. The default SLO threshold is 2000 ms (p99). Configure per-service SLOs via environment variable:
TRACING_SLO_llm_inference_generate=1500 # 1500 ms p99 for llm-inference/generate
p99 Filtering
The GET /tracing/slo-violations endpoint returns spans where duration_ms exceeds the p99 threshold for that service/operation pair, computed over the trailing 1-hour window.
Correlation Algorithm
When an SLO violation is detected, MeshOptixIQ correlates it to network anomalies on the span's declared network_path:
- For each device in
network_path, look up active eBPF events and NVLink hang flags at the time of the span - Count the number of devices with a matching anomaly (
matched_anomaly_count) - Compute confidence:
min(1.0, matched_anomaly_count / len(network_path)) - Return correlations with
confidence >= 0.25
Correlation Retrieval
| Method | Endpoint | Description |
|---|---|---|
GET | /tracing/correlations | List all SLO violations with network correlation data |
GET | /tracing/correlations?trace_id=<id> | Correlations for a specific trace ID |
GET | /tracing/slo-violations | Spans that violated their SLO threshold |
14.5 Chaos Simulation
Enterprise License gate: chaos_engineering
BFS Traversal
The chaos engine performs a breadth-first search from the simulated failure node across the network graph to enumerate all affected devices. The search respects VLAN boundaries and directed link state.
Impact Score Formula
For each simulation, an impact score is computed from two weighted factors:
impact_score = 0.4 * (affected_devices / total_devices)
+ 0.6 * (affected_jobs / total_active_jobs)
A score of 0.0 indicates no impact; 1.0 indicates total network failure affecting all devices and training jobs. The 0.6 weight on jobs reflects that training job disruption is typically more costly than raw device count.
Simulation Lifecycle
Simulations are computed asynchronously. The lifecycle follows an async polling pattern:
POST /graph/chaos-simulate— submit simulation; returns{"simulation_id": "...", "status": "running"}with HTTP 202- Poll
GET /graph/chaos-results/{simulation_id}— returns HTTP 202 while running, HTTP 200 when complete - On completion, the 200 response includes
impact_score,affected_devices,affected_jobs, and the full BFS traversal path
Rate Limiting
Chaos simulations are rate-limited to 5 requests per minute per API key to prevent graph traversal abuse. Exceeding this limit returns HTTP 429.
Request Format
POST /graph/chaos-simulate
{
"device": "core-sw-01",
"failure_type": "power_failure",
"include_jobs": true
}
Supported Failure Types
| failure_type | Description |
|---|---|
device_failure | Single device goes offline (default) |
power_failure | Power domain failure — all devices sharing a PDU are removed |
rack_failure | All devices in the same rack are removed |
switch_failure | Switch failure — all downstream endpoints lose connectivity |
14.6 Natural Language Querying
Enterprise License gate: nl_conversation
Conversation Endpoint
POST /ai/query/conversation
Authorization: Bearer <token>
Content-Type: application/json
{
"message": "Which GPU servers have the most NCCL hang events in the last hour?",
"session_id": "optional-session-uuid"
}
Session Management
Each conversation session maintains a 10-turn sliding window. When the window is exceeded, the oldest turn is evicted. Sessions expire after 30 minutes of inactivity. If session_id is omitted, a new session is created and returned in the response.
Parameter Extraction
The AI Query Service uses a two-phase approach to extract query parameters from natural language:
- Regex phase — extracts common patterns (device names, IP addresses, VLAN IDs, time ranges) deterministically using compiled regex patterns
- LLM phase (optional) — when a configured LLM provider is available, the service sends the message and extracted parameters to the LLM for refinement and disambiguation
The extracted parameters are matched against the query registry to identify the best-fit registered query. Only query names that appear in the registry YAML are eligible for dispatch — the LLM cannot invoke arbitrary queries.
Parameter Preview Endpoint
To preview parameter extraction without executing a query, use:
GET /ai/query/parameters?message=Which+devices+have+BGP+peers+down
Returns the extracted parameters and the matched query name without dispatching.
14.7 Vendor-Agnostic LLM Configuration
MeshOptixIQ does not lock you into a single LLM vendor. The provider registry auto-detects the active provider from environment variables in priority order: Anthropic → OpenAI → Ollama → OpenAI-compatible.
LLM_PROVIDER Values
| LLM_PROVIDER value | Aliases | Required vars |
|---|---|---|
anthropic | — | ANTHROPIC_API_KEY |
openai | — | OPENAI_API_KEY |
ollama | — | OLLAMA_URL (default: http://localhost:11434) |
openai_compatible | vllm, lmstudio, llama_cpp | LLM_BASE_URL, LLM_API_KEY=local |
Common Environment Variables
| Variable | Description |
|---|---|
LLM_PROVIDER | Override auto-detection with an explicit provider name |
LLM_MODEL | Model identifier (e.g., claude-3-5-sonnet-20241022, gpt-4o, llama3.2) |
LLM_API_KEY | API key (use local for local models) |
LLM_BASE_URL | Base URL for OpenAI-compatible endpoints |
ANTHROPIC_API_KEY | Anthropic API key (auto-selects Anthropic provider) |
OPENAI_API_KEY | OpenAI API key (auto-selects OpenAI provider) |
OPENAI_BASE_URL | Custom base URL for OpenAI-compatible API |
OLLAMA_URL | Ollama server URL (auto-selects Ollama provider) |
Installing AI Extras
# For Anthropic + OpenAI support
pip install 'meshoptixiq-network-discovery[ai]'
# Ollama uses stdlib urllib — no extra install needed
# Just set OLLAMA_URL
Ollama Example
# Start Ollama with a local model
ollama pull llama3.2
# Configure MeshOptixIQ to use it
export OLLAMA_URL=http://localhost:11434
export LLM_MODEL=llama3.2
# Verify the provider is detected
curl http://localhost:8000/chat/status
# {"active": true, "provider": "ollama", "model": "llama3.2"}
LLM_BASE_URL points to a locally-hosted model, no query data leaves your network. MeshOptixIQ does not send telemetry to any external service.
14.8 MeshQL and NL Query Catalog
MeshQL — Structured Query DSL
For automation pipelines and NOC scripts where determinism matters more than natural language flexibility, MeshQL provides a compact SQL-like syntax that compiles directly to named queries:
POST /queries/meshql
Content-Type: application/json
X-API-Key: $API_KEY
{
"query": "SHOW NEIGHBORS OF DEVICE \"core-switch-01\"",
"execute": true
}
Returns {"query_name": "device_neighbors", "params": {"device_name": "core-switch-01"}, "results": [...]}. Parse errors return HTTP 422.
Supported forms:
SHOW NEIGHBORS OF DEVICE "name"→device_neighborsSHOW IMPACT IF DEVICE "name" DOWN→blast_radius_deviceSHOW LOCATION OF IP "10.0.0.1"→locate_endpoint_by_ipSHOW FIREWALL RULES FOR DEVICE "name"→firewall_rules_by_deviceSHOW BGP PEERS FOR DEVICE "name"→bgp_peersSHOW SUBNET "10.0.0.0/24"→ips_in_subnetSHOW DEVICES [WHERE field = "value"]→all_devicesSHOW SUMMARY→summary_stats
Set execute: false to parse only (returns query_name + params without hitting the graph backend). Requires Pro+ (api_access).
NL Query Catalog
GET /ai/query/catalog returns all 109 registered queries sorted by category — no authentication required. Use it to populate autocomplete dropdowns or discover available queries:
curl https://<host>/ai/query/catalog | jq '.[0:3]'
Each entry includes name, description, category, and parameters list. The NL router uses the same catalog internally for keyword-based fast-path routing before falling back to LLM classification.