Skip to main content

Overview

Risk Legion includes comprehensive health monitoring to ensure system reliability, performance visibility, and quick issue detection. Health checks are available at multiple levels: API, database, cache, and application.

Health Check Endpoints

Primary Health Endpoint

GET /health
Returns overall system health status:
{
  "status": "healthy",
  "timestamp": "2026-01-16T10:30:00Z",
  "version": "1.0.0",
  "components": {
    "api": "healthy",
    "database": "healthy",
    "redis": "healthy"
  },
  "uptime_seconds": 86400
}
StatusDescription
healthyAll systems operational
degradedSome components impaired
unhealthyCritical components failing

Component Health

Database Health

GET /health/database
{
  "status": "healthy",
  "latency_ms": 12,
  "connection_pool": {
    "active": 5,
    "idle": 15,
    "max": 20
  }
}

Redis Health

GET /health/redis
{
  "status": "healthy",
  "latency_ms": 2,
  "memory_used_mb": 45,
  "memory_max_mb": 256
}

Implementation

FastAPI Health Endpoint

# backend/app/routers/health.py

from fastapi import APIRouter, Response
from datetime import datetime
import time

router = APIRouter()
start_time = time.time()

@router.get("/health")
async def health_check():
    components = {}
    overall_status = "healthy"
    
    # Check database
    try:
        db_start = time.time()
        await db.execute("SELECT 1")
        db_latency = (time.time() - db_start) * 1000
        components["database"] = {
            "status": "healthy",
            "latency_ms": round(db_latency, 2)
        }
    except Exception as e:
        components["database"] = {
            "status": "unhealthy",
            "error": str(e)
        }
        overall_status = "unhealthy"
    
    # Check Redis
    try:
        redis_start = time.time()
        await redis.ping()
        redis_latency = (time.time() - redis_start) * 1000
        components["redis"] = {
            "status": "healthy",
            "latency_ms": round(redis_latency, 2)
        }
    except Exception as e:
        components["redis"] = {
            "status": "degraded",
            "error": str(e)
        }
        if overall_status == "healthy":
            overall_status = "degraded"
    
    return {
        "status": overall_status,
        "timestamp": datetime.utcnow().isoformat() + "Z",
        "version": settings.APP_VERSION,
        "components": components,
        "uptime_seconds": int(time.time() - start_time)
    }

Docker Health Check

# Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8000/health || exit 1

Docker Compose Health Check

# docker-compose.yml
services:
  backend:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  redis:
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

Monitoring Stack

Metrics Collection

Risk Legion exposes Prometheus-compatible metrics:
GET /metrics
Available Metrics:
MetricTypeDescription
http_requests_totalCounterTotal HTTP requests
http_request_duration_secondsHistogramRequest latency
http_requests_in_progressGaugeCurrent active requests
db_query_duration_secondsHistogramDatabase query latency
cache_hits_totalCounterRedis cache hits
cache_misses_totalCounterRedis cache misses

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'risk-legion-api'
    static_configs:
      - targets: ['api:8000']
    metrics_path: /metrics
    scrape_interval: 15s

Logging

Structured Logging

import structlog

logger = structlog.get_logger()

@app.middleware("http")
async def log_requests(request: Request, call_next):
    start_time = time.time()
    
    response = await call_next(request)
    
    duration = time.time() - start_time
    
    logger.info(
        "http_request",
        method=request.method,
        path=request.url.path,
        status_code=response.status_code,
        duration_ms=round(duration * 1000, 2),
        user_id=getattr(request.state, 'user_id', None)
    )
    
    return response

Log Format

{
  "timestamp": "2026-01-16T10:30:00.123Z",
  "level": "info",
  "event": "http_request",
  "method": "GET",
  "path": "/api/v1/bras",
  "status_code": 200,
  "duration_ms": 45.23,
  "user_id": "user-uuid",
  "request_id": "req-uuid"
}

Log Levels

LevelUsage
DEBUGDetailed debugging information
INFOGeneral operational events
WARNINGUnexpected but handled situations
ERRORErrors requiring attention
CRITICALSystem-level failures

Alerting

Alert Configuration

# alertmanager.yml
route:
  receiver: 'default'
  group_by: ['alertname']
  
receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@risklegion.com'
    slack_configs:
      - api_url: 'https://hooks.slack.com/...'
        channel: '#alerts'

rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High error rate detected
      
  - alert: HighLatency
    expr: histogram_quantile(0.95, http_request_duration_seconds) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: API latency above 2s
      
  - alert: DatabaseDown
    expr: up{job="database"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: Database connection lost

Dashboard Metrics

Application Metrics

MetricDescriptionAlert Threshold
Request RateRequests per secondN/A (informational)
Error Rate5xx errors per second> 1% for 5 min
Latency P9595th percentile response time> 2 seconds
Active UsersConcurrent authenticated usersN/A (informational)

Infrastructure Metrics

MetricDescriptionAlert Threshold
CPU UsageContainer CPU utilization> 80% for 5 min
Memory UsageContainer memory utilization> 85% for 5 min
Disk UsageVolume utilization> 80%
Network I/OBytes in/outN/A (informational)

Database Metrics

MetricDescriptionAlert Threshold
Connection PoolActive/idle connectionsActive > 80% of max
Query LatencyAverage query duration> 500ms
Query ErrorsFailed queries per second> 0.1/s
Table SizeDatabase table sizesN/A (informational)

Error Tracking

Sentry Integration

# backend/app/main.py

import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    environment=settings.ENVIRONMENT,
    integrations=[FastApiIntegration()],
    traces_sample_rate=0.1,
    profiles_sample_rate=0.1,
)

Error Categorization

CategoryExamples
AuthenticationInvalid tokens, session expired
AuthorizationPermission denied, role mismatch
ValidationInvalid input, missing fields
DatabaseConnection errors, constraint violations
ExternalThird-party service failures

Deployment Health

GitHub Actions Health

The CI/CD pipeline includes health verification:
# Health check after deployment
- name: Verify Deployment
  run: |
    for i in {1..10}; do
      response=$(curl -s -o /dev/null -w "%{http_code}" https://api.risklegion.com/health)
      if [ "$response" = "200" ]; then
        echo "Health check passed"
        exit 0
      fi
      echo "Attempt $i: Health check returned $response"
      sleep 5
    done
    echo "Health check failed after 10 attempts"
    exit 1

Rollback Triggers

Automatic rollback is triggered when:
  • Health check fails for 3 consecutive checks
  • Error rate exceeds 5% for 5 minutes
  • Critical alerts remain unresolved

Runbooks

Database Connection Issues

1

Check Connection Pool

Query active connections: SELECT count(*) FROM pg_stat_activity
2

Review Recent Changes

Check deployment history and recent code changes
3

Restart Connection Pool

Restart the application to reset connection pool
4

Scale if Needed

Increase max connections if consistently at capacity

High Latency

1

Check Slow Queries

Review query performance using EXPLAIN ANALYZE
2

Check Resource Usage

Monitor CPU, memory, and I/O metrics
3

Review Cache Hit Rate

Check Redis cache effectiveness
4

Scale Resources

Increase instance size or add replicas