Production monitoring, alerting, and operational procedures for MeshOptixIQ.
Health Check Endpoints
Shallow Health Check
Fast health check that verifies the API server is responding:
GET /health
Response: 200 OK
{
"status": "ok",
"version": "1.2.0",
"timestamp": "2026-02-16T10:30:00Z"
}
Use for load balancer health checks (fast response, no database query)
Deep Health Check (Readiness)
Comprehensive health check including database connectivity:
GET /health/ready
Response: 200 OK
{
"status": "ok",
"checks": {
"database": "connected",
"license": "valid"
},
"timestamp": "2026-02-16T10:30:00Z"
}
Use for Kubernetes readiness probes and operational monitoring
License Status
Returns plan tier, expiry date, and days remaining. No authentication required — safe to call from monitoring scripts and ops dashboards.
GET /health/license
Response: 200 OK
{
"plan": "pro",
"expires": "2026-12-31",
"days_remaining": 310,
"demo_mode": false
}
Use for Prometheus alerting (meshoptixiq_license_days_remaining), ops dashboards, and automated renewal reminders. days_remaining is null when no expiry date is set (perpetual license).
Key Metrics to Monitor
API Performance
Request latency: p50, p95, p99
Error rate: 4xx and 5xx responses
Request throughput: Requests/second
Concurrent connections: Active sessions
License Health
Validation time: License check latency
Grace period: Activation count
Device usage: N/M devices registered
Expiration: Days until expiry
Database Performance
Query time: Average query duration
Connection pool: Active/idle connections
Storage: Database size and growth rate
Memory: Heap usage (Neo4j)
Discovery Operations
Collection success rate: % devices collected
Collection duration: Time per run
Parser failures: Parse error count
SSH timeouts: Connection failure rate
Alert Definitions
Recommended alert thresholds based on baseline metrics:
Critical Alerts (Immediate Response)
Alert
Metric
Threshold
Window
API Error Rate High
5xx errors
> 5%
5 minutes
API Latency Critical
p99 response time
> 2000ms
5 minutes
Health Check Failing
/health/ready
3 consecutive failures
3 minutes
Database Unavailable
Connection status
Disconnected
1 minute
Warning Alerts (Investigation Required)
Alert
Metric
Threshold
Window
API Latency Elevated
p95 response time
> 1000ms
10 minutes
License Validation Slow
License check time
> 1000ms
5 minutes
Grace Period Active
Grace period rate
> 10%
1 hour
Collection Failure Rate
Failed devices
> 20%
Per run
License Expiring Soon
Days until expiry
< 30 days
Daily
GPU Temperature Critical
DCGM temp per GPU
> 80°C
1 minute
InfiniBand Ports Down
IB ports not Active
> 0
2 minutes
CloudWatch Integration
Setup CloudWatch Alarms
Configure CloudWatch alarms for production monitoring:
The [metrics] extras group is pre-installed in all Docker images. For bare-metal Python installs, run the above command after the main install. The /metrics endpoint requires no authentication.
The endpoint exposes the following gauges and counters:
Metric
Type
Description
meshoptixiq_http_requests_total
Counter
HTTP requests labelled by method, path, status. Query paths normalised to /queries/{name}/execute.
meshoptixiq_query_duration_seconds
Histogram
Query execution latency labelled by query_name. Use p95/p99 buckets to identify slow queries.
meshoptixiq_license_days_remaining
Gauge
Days until license expiry labelled by plan. -1 = no expiry; -2 = demo mode.
meshoptixiq_device_count
Gauge
Total number of discovered network devices in the graph.
meshoptixiq_flow_store_total
Gauge
Current number of flows in the ring buffer (Enterprise only). Max 100K.
meshoptixiq_alert_rules_total
Gauge
Number of configured alert rules (Pro+ only).
meshoptixiq_bgp_peers_down
Gauge
Count of BGP peers NOT in Established state. Alert when > 0 (Pro+ only).
meshoptixiq_collection_run_duration_seconds
Histogram
Duration of each collection run. Labelled by result (success/partial/failure).