Skip to main content

Agent Health Monitoring

FieldValue
Document IDASCEND-AGENT-001
Version1.0.0
Last UpdatedDecember 19, 2025
AuthorAscend Engineering Team
PublisherOW-KAI Technologies Inc.
ClassificationEnterprise Client Documentation
ComplianceSOC 2 CC6.1/CC6.2, PCI-DSS 7.1/8.3, HIPAA 164.312, NIST 800-53 AC-2/SI-4

Reading Time: 10 minutes | Skill Level: Intermediate

Overview

ASCEND provides Datadog-style health monitoring for all registered agents. Continuous monitoring enables early detection of issues and automatic incident response.

Architecture

┌─────────────────────────────────────────────────────────────────────────────────────┐
│ HEALTH MONITORING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ SDK Agent ASCEND Platform Dashboard │
│ │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Heartbeat │─────────────▶│ Health Service │─────────────▶│ Health │ │
│ │ Every 60s │ │ │ │ Summary │ │
│ │ │ │ • Process HB │ │ │ │
│ │ • agent_id │ │ • Update status │ │ • Online │ │
│ │ • metrics │ │ • Check health │ │ • Degraded │ │
│ │ • sdk_ver │ │ • Detect anom. │ │ • Offline │ │
│ └─────────────┘ └────────┬────────┘ └─────────────┘ │
│ │ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Auto-Actions │ │
│ │ │ │
│ │ • Auto-suspend │ │
│ │ • Alert notify │ │
│ │ • Webhook call │ │
│ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘

Health Status

Status Definitions

StatusDescriptionHeartbeatAction
onlineOperating normallyRecentNormal operation
degradedMissed 1-2 heartbeatsDelayedWarning alert
offlineMissed 3+ heartbeatsNoneCritical alert
unknownNever received heartbeatNeverCheck configuration

Status Calculation

# Source: services/agent_health_service.py
# Health status is calculated based on missed heartbeats

def calculate_health_status(agent):
"""Calculate agent health status."""
if not agent.last_heartbeat:
return "unknown"

now = datetime.now(UTC)
expected_interval = agent.heartbeat_interval_seconds # default: 60

elapsed = (now - agent.last_heartbeat).total_seconds()
missed = int(elapsed / expected_interval)

if missed == 0:
return "online"
elif missed <= 2:
return "degraded"
else:
return "offline"

Heartbeat API

Send Heartbeat

import requests
import time

def send_heartbeat(api_key: str, agent_id: str, metrics: dict = None):
"""Send heartbeat to ASCEND."""
response = requests.post(
"https://pilot.owkai.app/api/agents/health/heartbeat",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"agent_id": agent_id,
"metrics": metrics,
"sdk_version": "1.0.0"
}
)
return response.json()

# Usage
while True:
result = send_heartbeat(
api_key="owkai_...",
agent_id="my-agent-001",
metrics={
"response_time_ms": 45.2,
"error_rate": 0.5,
"requests_count": 1247,
"last_error": None
}
)
print(f"Health status: {result.get('health_status')}")
time.sleep(60) # Every 60 seconds

Heartbeat Request

# Source: routes/agent_health_routes.py:36
class HeartbeatRequest(BaseModel):
"""Heartbeat payload from agent SDK."""
agent_id: str = Field(..., description="Unique agent identifier")
metrics: Optional[Dict[str, Any]] = Field(
default=None,
description="Optional performance metrics",
example={
"response_time_ms": 45.2,
"error_rate": 0.5,
"requests_count": 1247,
"last_error": None
}
)
sdk_version: Optional[str] = Field(
default=None,
description="SDK version for compatibility tracking"
)

Heartbeat Response

{
"success": true,
"agent_id": "my-agent-001",
"health_status": "online",
"next_heartbeat_expected_at": "2025-12-15T10:31:00Z",
"heartbeat_interval_seconds": 60
}

Batch Heartbeat

Send heartbeats for multiple agents:

curl -X POST "https://pilot.owkai.app/api/agents/health/heartbeat/batch" \
-H "Authorization: Bearer owkai_..." \
-H "Content-Type: application/json" \
-d '[
{
"agent_id": "agent-001",
"metrics": {"response_time_ms": 45.2}
},
{
"agent_id": "agent-002",
"metrics": {"response_time_ms": 32.1}
}
]'

Health Dashboard

Get Health Summary

curl "https://pilot.owkai.app/api/agents/health/summary" \
-H "Authorization: Bearer owkai_..."

Response:

{
"summary": {
"total_agents": 15,
"online": 12,
"degraded": 2,
"offline": 1,
"unknown": 0,
"health_score": 87
},
"metrics": {
"avg_response_time_ms": 42.5,
"total_requests_24h": 125847,
"avg_error_rate": 0.3
},
"problem_agents": [
{
"agent_id": "data-processor-003",
"status": "offline",
"last_heartbeat": "2025-12-15T09:15:00Z",
"minutes_offline": 45
},
{
"agent_id": "api-gateway-002",
"status": "degraded",
"last_heartbeat": "2025-12-15T10:28:00Z",
"error_rate": 5.2
}
],
"recent_changes": [
{
"agent_id": "finance-bot-001",
"previous_status": "online",
"new_status": "degraded",
"changed_at": "2025-12-15T10:25:00Z"
}
],
"last_check": "2025-12-15T10:30:00Z"
}

Get Agent Health Detail

curl "https://pilot.owkai.app/api/agents/health/my-agent-001" \
-H "Authorization: Bearer owkai_..."

Response:

{
"agent_id": "my-agent-001",
"display_name": "Data Processing Agent",
"agent_type": "supervised",
"status": "online",
"health": {
"status": "online",
"last_heartbeat": "2025-12-15T10:29:45Z",
"next_expected": "2025-12-15T10:30:45Z",
"heartbeat_interval_seconds": 60,
"consecutive_missed": 0
},
"metrics": {
"avg_response_time_ms": 45.2,
"error_rate_percent": 0.5,
"total_requests_24h": 8547,
"sdk_version": "1.0.0"
},
"errors": {
"last_error": null,
"last_error_at": null,
"error_count_24h": 42
},
"recent_history": [
{
"timestamp": "2025-12-15T10:29:45Z",
"status": "online",
"response_time_ms": 45.2
},
{
"timestamp": "2025-12-15T10:28:45Z",
"status": "online",
"response_time_ms": 43.8
}
]
}

Performance Metrics

Tracked Metrics

MetricTypeDescription
avg_response_time_msfloatAverage action response time
error_rate_percentfloatError rate over 24 hours
total_requests_24hintTotal actions in last 24 hours
last_errorstringMost recent error message
last_error_atdatetimeTimestamp of last error

Reporting Metrics

# Include metrics in heartbeat
client.heartbeat(
metrics={
"response_time_ms": measure_response_time(),
"error_rate": calculate_error_rate(),
"requests_count": get_request_count(),
"memory_mb": get_memory_usage(),
"cpu_percent": get_cpu_usage()
}
)

Anomaly Detection

Configuration

# Source: models_agent_registry.py:173
# Anomaly detection settings
{
"anomaly_detection_enabled": true,
"baseline_actions_per_hour": 100.0, # Normal action rate
"baseline_error_rate": 0.5, # Normal error rate (%)
"baseline_avg_risk_score": 35.0, # Normal risk score
"anomaly_threshold_percent": 50.0 # Alert if 50% deviation
}

Anomaly Types

AnomalyDetectionSeverity
Action RateCurrent rate > baseline + 50%Medium to Critical
Error RateCurrent rate > baseline + 50%High
Risk ScoreAverage risk > baseline + 50%High

Detection Logic

# Source: services/agent_registry_service.py:396
def detect_anomalies(db, agent, current_action_rate, current_error_rate, current_risk_score):
"""Compare current behavior against baseline."""

if not agent.anomaly_detection_enabled:
return {"has_anomaly": False}

anomalies = []
threshold = agent.anomaly_threshold_percent or 50.0

# Check action rate anomaly
if agent.baseline_actions_per_hour and current_action_rate:
deviation = abs(current_action_rate - agent.baseline_actions_per_hour)
deviation_percent = (deviation / agent.baseline_actions_per_hour) * 100

if deviation_percent > threshold:
anomalies.append({
"type": "action_rate",
"baseline": agent.baseline_actions_per_hour,
"current": current_action_rate,
"deviation_percent": deviation_percent
})

# Determine severity based on max deviation
if anomalies:
max_deviation = max(a["deviation_percent"] for a in anomalies)
if max_deviation > threshold * 2:
severity = "critical"
elif max_deviation > threshold * 1.5:
severity = "high"
else:
severity = "medium"

return {
"has_anomaly": len(anomalies) > 0,
"anomalies": anomalies,
"severity": severity
}

Anomaly Response

{
"has_anomaly": true,
"anomalies": [
{
"type": "action_rate",
"baseline": 100.0,
"current": 250.0,
"deviation_percent": 150.0,
"threshold_percent": 50.0
}
],
"severity": "critical",
"anomaly_count_24h": 3
}

Auto-Suspension

Trigger Configuration

# Source: models_agent_registry.py:163
{
"auto_suspend_enabled": true,
"auto_suspend_on_error_rate": 0.10, # 10% error rate
"auto_suspend_on_offline_minutes": 30, # 30 minutes offline
"auto_suspend_on_budget_exceeded": true,
"auto_suspend_on_rate_exceeded": false
}

Auto-Suspend Check

# Source: services/agent_registry_service.py:522
def check_auto_suspend_triggers(db, agent):
"""Check if any auto-suspend conditions are met."""

if not agent.auto_suspend_enabled:
return {"should_suspend": False}

# Error rate trigger
if agent.auto_suspend_on_error_rate:
if agent.error_rate_percent >= agent.auto_suspend_on_error_rate * 100:
return {
"should_suspend": True,
"trigger": "error_rate",
"reason": f"Error rate {agent.error_rate_percent:.1f}% exceeds {agent.auto_suspend_on_error_rate * 100:.1f}%"
}

# Offline duration trigger
if agent.auto_suspend_on_offline_minutes and agent.last_heartbeat:
minutes_offline = (now - agent.last_heartbeat).total_seconds() / 60
if minutes_offline > agent.auto_suspend_on_offline_minutes:
return {
"should_suspend": True,
"trigger": "offline_duration",
"reason": f"Agent offline for {minutes_offline:.0f} minutes"
}

# Budget exceeded trigger
if agent.auto_suspend_on_budget_exceeded and agent.max_daily_budget_usd:
if agent.current_daily_spend_usd >= agent.max_daily_budget_usd:
return {
"should_suspend": True,
"trigger": "budget_exceeded",
"reason": f"Budget exceeded: ${agent.current_daily_spend_usd:.2f}"
}

return {"should_suspend": False}

Heartbeat Configuration

Update Interval

curl -X PUT "https://pilot.owkai.app/api/agents/health/my-agent-001/interval" \
-H "Authorization: Bearer owkai_..." \
-H "Content-Type: application/json" \
-d '{
"interval_seconds": 30
}'

Interval Guidelines

EnvironmentIntervalRationale
Production Critical30 secondsFast issue detection
Production Standard60 secondsBalance monitoring/overhead
Staging120 secondsLess critical
Development300 secondsMinimal overhead

Manual Health Check

Trigger immediate health check:

curl -X POST "https://pilot.owkai.app/api/agents/health/check" \
-H "Authorization: Bearer owkai_..."

Response:

{
"checked_by": "admin@company.com",
"status_changes": [
{
"agent_id": "api-gateway-001",
"previous_status": "online",
"new_status": "degraded",
"reason": "Missed heartbeat"
}
],
"changes_count": 1
}

SDK Integration

Python SDK

from ascend import AscendClient
import threading

client = AscendClient(
api_key="owkai_...",
agent_id="my-agent-001",
heartbeat_interval=60 # seconds
)

# Heartbeat runs automatically in background thread
# Or manually:
client.send_heartbeat(metrics={
"response_time_ms": 45.2,
"error_rate": 0.5
})

TypeScript SDK

import { AscendClient } from '@ascend/sdk';

const client = new AscendClient({
apiKey: process.env.ASCEND_API_KEY,
agentId: 'my-agent-001',
heartbeatInterval: 60000 // milliseconds
});

// Heartbeat runs automatically
// Or manually:
await client.sendHeartbeat({
metrics: {
responseTimeMs: 45.2,
errorRate: 0.5
}
});

Best Practices

1. Always Send Heartbeats

# Start heartbeat immediately after initialization
client = AscendClient(...)
client.start_heartbeat() # Background thread

2. Include Meaningful Metrics

# Good - actionable metrics
metrics={
"response_time_ms": 45.2,
"error_rate": 0.5,
"queue_depth": 150,
"memory_percent": 75
}

# Bad - no useful information
metrics={}

3. Set Appropriate Intervals

# Production: 60 seconds or less
# Development: Can be longer
heartbeat_interval = 60 if is_production else 300

4. Configure Auto-Suspend Carefully

# Enable for autonomous agents
{
"auto_suspend_enabled": True,
"auto_suspend_on_error_rate": 0.10, # 10% - not too aggressive
"auto_suspend_on_offline_minutes": 30
}

Next Steps


Document Version: 1.0.0 | Last Updated: December 2025