Agent Health Monitoring
| Field | Value |
|---|---|
| Document ID | ASCEND-AGENT-001 |
| Version | 1.0.0 |
| Last Updated | December 19, 2025 |
| Author | Ascend Engineering Team |
| Publisher | OW-KAI Technologies Inc. |
| Classification | Enterprise Client Documentation |
| Compliance | SOC 2 CC6.1/CC6.2, PCI-DSS 7.1/8.3, HIPAA 164.312, NIST 800-53 AC-2/SI-4 |
Reading Time: 10 minutes | Skill Level: Intermediate
Overview
ASCEND provides Datadog-style health monitoring for all registered agents. Continuous monitoring enables early detection of issues and automatic incident response.
Architecture
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ HEALTH MONITORING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ SDK Agent ASCEND Platform Dashboard │
│ │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Heartbeat │─────────────▶│ Health Service │─────────────▶│ Health │ │
│ │ Every 60s │ │ │ │ Summary │ │
│ │ │ │ • Process HB │ │ │ │
│ │ • agent_id │ │ • Update status │ │ • Online │ │
│ │ • metrics │ │ • Check health │ │ • Degraded │ │
│ │ • sdk_ver │ │ • Detect anom. │ │ • Offline │ │
│ └─────────────┘ └────────┬────────┘ └─────────────┘ │
│ │ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Auto-Actions │ │
│ │ │ │
│ │ • Auto-suspend │ │
│ │ • Alert notify │ │
│ │ • Webhook call │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
Health Status
Status Definitions
| Status | Description | Heartbeat | Action |
|---|---|---|---|
| online | Operating normally | Recent | Normal operation |
| degraded | Missed 1-2 heartbeats | Delayed | Warning alert |
| offline | Missed 3+ heartbeats | None | Critical alert |
| unknown | Never received heartbeat | Never | Check configuration |
Status Calculation
# Source: services/agent_health_service.py
# Health status is calculated based on missed heartbeats
def calculate_health_status(agent):
"""Calculate agent health status."""
if not agent.last_heartbeat:
return "unknown"
now = datetime.now(UTC)
expected_interval = agent.heartbeat_interval_seconds # default: 60
elapsed = (now - agent.last_heartbeat).total_seconds()
missed = int(elapsed / expected_interval)
if missed == 0:
return "online"
elif missed <= 2:
return "degraded"
else:
return "offline"
Heartbeat API
Send Heartbeat
import requests
import time
def send_heartbeat(api_key: str, agent_id: str, metrics: dict = None):
"""Send heartbeat to ASCEND."""
response = requests.post(
"https://pilot.owkai.app/api/agents/health/heartbeat",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"agent_id": agent_id,
"metrics": metrics,
"sdk_version": "1.0.0"
}
)
return response.json()
# Usage
while True:
result = send_heartbeat(
api_key="owkai_...",
agent_id="my-agent-001",
metrics={
"response_time_ms": 45.2,
"error_rate": 0.5,
"requests_count": 1247,
"last_error": None
}
)
print(f"Health status: {result.get('health_status')}")
time.sleep(60) # Every 60 seconds
Heartbeat Request
# Source: routes/agent_health_routes.py:36
class HeartbeatRequest(BaseModel):
"""Heartbeat payload from agent SDK."""
agent_id: str = Field(..., description="Unique agent identifier")
metrics: Optional[Dict[str, Any]] = Field(
default=None,
description="Optional performance metrics",
example={
"response_time_ms": 45.2,
"error_rate": 0.5,
"requests_count": 1247,
"last_error": None
}
)
sdk_version: Optional[str] = Field(
default=None,
description="SDK version for compatibility tracking"
)
Heartbeat Response
{
"success": true,
"agent_id": "my-agent-001",
"health_status": "online",
"next_heartbeat_expected_at": "2025-12-15T10:31:00Z",
"heartbeat_interval_seconds": 60
}
Batch Heartbeat
Send heartbeats for multiple agents:
curl -X POST "https://pilot.owkai.app/api/agents/health/heartbeat/batch" \
-H "Authorization: Bearer owkai_..." \
-H "Content-Type: application/json" \
-d '[
{
"agent_id": "agent-001",
"metrics": {"response_time_ms": 45.2}
},
{
"agent_id": "agent-002",
"metrics": {"response_time_ms": 32.1}
}
]'
Health Dashboard
Get Health Summary
curl "https://pilot.owkai.app/api/agents/health/summary" \
-H "Authorization: Bearer owkai_..."
Response:
{
"summary": {
"total_agents": 15,
"online": 12,
"degraded": 2,
"offline": 1,
"unknown": 0,
"health_score": 87
},
"metrics": {
"avg_response_time_ms": 42.5,
"total_requests_24h": 125847,
"avg_error_rate": 0.3
},
"problem_agents": [
{
"agent_id": "data-processor-003",
"status": "offline",
"last_heartbeat": "2025-12-15T09:15:00Z",
"minutes_offline": 45
},
{
"agent_id": "api-gateway-002",
"status": "degraded",
"last_heartbeat": "2025-12-15T10:28:00Z",
"error_rate": 5.2
}
],
"recent_changes": [
{
"agent_id": "finance-bot-001",
"previous_status": "online",
"new_status": "degraded",
"changed_at": "2025-12-15T10:25:00Z"
}
],
"last_check": "2025-12-15T10:30:00Z"
}
Get Agent Health Detail
curl "https://pilot.owkai.app/api/agents/health/my-agent-001" \
-H "Authorization: Bearer owkai_..."
Response:
{
"agent_id": "my-agent-001",
"display_name": "Data Processing Agent",
"agent_type": "supervised",
"status": "online",
"health": {
"status": "online",
"last_heartbeat": "2025-12-15T10:29:45Z",
"next_expected": "2025-12-15T10:30:45Z",
"heartbeat_interval_seconds": 60,
"consecutive_missed": 0
},
"metrics": {
"avg_response_time_ms": 45.2,
"error_rate_percent": 0.5,
"total_requests_24h": 8547,
"sdk_version": "1.0.0"
},
"errors": {
"last_error": null,
"last_error_at": null,
"error_count_24h": 42
},
"recent_history": [
{
"timestamp": "2025-12-15T10:29:45Z",
"status": "online",
"response_time_ms": 45.2
},
{
"timestamp": "2025-12-15T10:28:45Z",
"status": "online",
"response_time_ms": 43.8
}
]
}
Performance Metrics
Tracked Metrics
| Metric | Type | Description |
|---|---|---|
avg_response_time_ms | float | Average action response time |
error_rate_percent | float | Error rate over 24 hours |
total_requests_24h | int | Total actions in last 24 hours |
last_error | string | Most recent error message |
last_error_at | datetime | Timestamp of last error |
Reporting Metrics
# Include metrics in heartbeat
client.heartbeat(
metrics={
"response_time_ms": measure_response_time(),
"error_rate": calculate_error_rate(),
"requests_count": get_request_count(),
"memory_mb": get_memory_usage(),
"cpu_percent": get_cpu_usage()
}
)
Anomaly Detection
Configuration
# Source: models_agent_registry.py:173
# Anomaly detection settings
{
"anomaly_detection_enabled": true,
"baseline_actions_per_hour": 100.0, # Normal action rate
"baseline_error_rate": 0.5, # Normal error rate (%)
"baseline_avg_risk_score": 35.0, # Normal risk score
"anomaly_threshold_percent": 50.0 # Alert if 50% deviation
}
Anomaly Types
| Anomaly | Detection | Severity |
|---|---|---|
| Action Rate | Current rate > baseline + 50% | Medium to Critical |
| Error Rate | Current rate > baseline + 50% | High |
| Risk Score | Average risk > baseline + 50% | High |
Detection Logic
# Source: services/agent_registry_service.py:396
def detect_anomalies(db, agent, current_action_rate, current_error_rate, current_risk_score):
"""Compare current behavior against baseline."""
if not agent.anomaly_detection_enabled:
return {"has_anomaly": False}
anomalies = []
threshold = agent.anomaly_threshold_percent or 50.0
# Check action rate anomaly
if agent.baseline_actions_per_hour and current_action_rate:
deviation = abs(current_action_rate - agent.baseline_actions_per_hour)
deviation_percent = (deviation / agent.baseline_actions_per_hour) * 100
if deviation_percent > threshold:
anomalies.append({
"type": "action_rate",
"baseline": agent.baseline_actions_per_hour,
"current": current_action_rate,
"deviation_percent": deviation_percent
})
# Determine severity based on max deviation
if anomalies:
max_deviation = max(a["deviation_percent"] for a in anomalies)
if max_deviation > threshold * 2:
severity = "critical"
elif max_deviation > threshold * 1.5:
severity = "high"
else:
severity = "medium"
return {
"has_anomaly": len(anomalies) > 0,
"anomalies": anomalies,
"severity": severity
}
Anomaly Response
{
"has_anomaly": true,
"anomalies": [
{
"type": "action_rate",
"baseline": 100.0,
"current": 250.0,
"deviation_percent": 150.0,
"threshold_percent": 50.0
}
],
"severity": "critical",
"anomaly_count_24h": 3
}
Auto-Suspension
Trigger Configuration
# Source: models_agent_registry.py:163
{
"auto_suspend_enabled": true,
"auto_suspend_on_error_rate": 0.10, # 10% error rate
"auto_suspend_on_offline_minutes": 30, # 30 minutes offline
"auto_suspend_on_budget_exceeded": true,
"auto_suspend_on_rate_exceeded": false
}
Auto-Suspend Check
# Source: services/agent_registry_service.py:522
def check_auto_suspend_triggers(db, agent):
"""Check if any auto-suspend conditions are met."""
if not agent.auto_suspend_enabled:
return {"should_suspend": False}
# Error rate trigger
if agent.auto_suspend_on_error_rate:
if agent.error_rate_percent >= agent.auto_suspend_on_error_rate * 100:
return {
"should_suspend": True,
"trigger": "error_rate",
"reason": f"Error rate {agent.error_rate_percent:.1f}% exceeds {agent.auto_suspend_on_error_rate * 100:.1f}%"
}
# Offline duration trigger
if agent.auto_suspend_on_offline_minutes and agent.last_heartbeat:
minutes_offline = (now - agent.last_heartbeat).total_seconds() / 60
if minutes_offline > agent.auto_suspend_on_offline_minutes:
return {
"should_suspend": True,
"trigger": "offline_duration",
"reason": f"Agent offline for {minutes_offline:.0f} minutes"
}
# Budget exceeded trigger
if agent.auto_suspend_on_budget_exceeded and agent.max_daily_budget_usd:
if agent.current_daily_spend_usd >= agent.max_daily_budget_usd:
return {
"should_suspend": True,
"trigger": "budget_exceeded",
"reason": f"Budget exceeded: ${agent.current_daily_spend_usd:.2f}"
}
return {"should_suspend": False}
Heartbeat Configuration
Update Interval
curl -X PUT "https://pilot.owkai.app/api/agents/health/my-agent-001/interval" \
-H "Authorization: Bearer owkai_..." \
-H "Content-Type: application/json" \
-d '{
"interval_seconds": 30
}'
Interval Guidelines
| Environment | Interval | Rationale |
|---|---|---|
| Production Critical | 30 seconds | Fast issue detection |
| Production Standard | 60 seconds | Balance monitoring/overhead |
| Staging | 120 seconds | Less critical |
| Development | 300 seconds | Minimal overhead |
Manual Health Check
Trigger immediate health check:
curl -X POST "https://pilot.owkai.app/api/agents/health/check" \
-H "Authorization: Bearer owkai_..."
Response:
{
"checked_by": "admin@company.com",
"status_changes": [
{
"agent_id": "api-gateway-001",
"previous_status": "online",
"new_status": "degraded",
"reason": "Missed heartbeat"
}
],
"changes_count": 1
}
SDK Integration
Python SDK
from ascend import AscendClient
import threading
client = AscendClient(
api_key="owkai_...",
agent_id="my-agent-001",
heartbeat_interval=60 # seconds
)
# Heartbeat runs automatically in background thread
# Or manually:
client.send_heartbeat(metrics={
"response_time_ms": 45.2,
"error_rate": 0.5
})
TypeScript SDK
import { AscendClient } from '@ascend/sdk';
const client = new AscendClient({
apiKey: process.env.ASCEND_API_KEY,
agentId: 'my-agent-001',
heartbeatInterval: 60000 // milliseconds
});
// Heartbeat runs automatically
// Or manually:
await client.sendHeartbeat({
metrics: {
responseTimeMs: 45.2,
errorRate: 0.5
}
});
Best Practices
1. Always Send Heartbeats
# Start heartbeat immediately after initialization
client = AscendClient(...)
client.start_heartbeat() # Background thread
2. Include Meaningful Metrics
# Good - actionable metrics
metrics={
"response_time_ms": 45.2,
"error_rate": 0.5,
"queue_depth": 150,
"memory_percent": 75
}
# Bad - no useful information
metrics={}
3. Set Appropriate Intervals
# Production: 60 seconds or less
# Development: Can be longer
heartbeat_interval = 60 if is_production else 300
4. Configure Auto-Suspend Carefully
# Enable for autonomous agents
{
"auto_suspend_enabled": True,
"auto_suspend_on_error_rate": 0.10, # 10% - not too aggressive
"auto_suspend_on_offline_minutes": 30
}
Next Steps
- Kill-Switch — Emergency procedures
- Smart Rules — Health-based rules
- Notifications — Alert configuration
Document Version: 1.0.0 | Last Updated: December 2025