Overview
Risk Legion includes comprehensive health monitoring to ensure system reliability, performance visibility, and quick issue detection. Health checks are available at multiple levels: API, database, cache, and application.Health Check Endpoints
Primary Health Endpoint
| Status | Description |
|---|---|
healthy | All systems operational |
degraded | Some components impaired |
unhealthy | Critical components failing |
Component Health
Database Health
Redis Health
Implementation
FastAPI Health Endpoint
Docker Health Check
Docker Compose Health Check
Monitoring Stack
Metrics Collection
Risk Legion exposes Prometheus-compatible metrics:| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests |
http_request_duration_seconds | Histogram | Request latency |
http_requests_in_progress | Gauge | Current active requests |
db_query_duration_seconds | Histogram | Database query latency |
cache_hits_total | Counter | Redis cache hits |
cache_misses_total | Counter | Redis cache misses |
Prometheus Configuration
Logging
Structured Logging
Log Format
Log Levels
| Level | Usage |
|---|---|
DEBUG | Detailed debugging information |
INFO | General operational events |
WARNING | Unexpected but handled situations |
ERROR | Errors requiring attention |
CRITICAL | System-level failures |
Alerting
Alert Configuration
Dashboard Metrics
Application Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Request Rate | Requests per second | N/A (informational) |
| Error Rate | 5xx errors per second | > 1% for 5 min |
| Latency P95 | 95th percentile response time | > 2 seconds |
| Active Users | Concurrent authenticated users | N/A (informational) |
Infrastructure Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| CPU Usage | Container CPU utilization | > 80% for 5 min |
| Memory Usage | Container memory utilization | > 85% for 5 min |
| Disk Usage | Volume utilization | > 80% |
| Network I/O | Bytes in/out | N/A (informational) |
Database Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
| Connection Pool | Active/idle connections | Active > 80% of max |
| Query Latency | Average query duration | > 500ms |
| Query Errors | Failed queries per second | > 0.1/s |
| Table Size | Database table sizes | N/A (informational) |
Error Tracking
Sentry Integration
Error Categorization
| Category | Examples |
|---|---|
| Authentication | Invalid tokens, session expired |
| Authorization | Permission denied, role mismatch |
| Validation | Invalid input, missing fields |
| Database | Connection errors, constraint violations |
| External | Third-party service failures |
Deployment Health
GitHub Actions Health
The CI/CD pipeline includes health verification:Rollback Triggers
Automatic rollback is triggered when:- Health check fails for 3 consecutive checks
- Error rate exceeds 5% for 5 minutes
- Critical alerts remain unresolved
Runbooks
Database Connection Issues
1
Check Connection Pool
Query active connections:
SELECT count(*) FROM pg_stat_activity2
Review Recent Changes
Check deployment history and recent code changes
3
Restart Connection Pool
Restart the application to reset connection pool
4
Scale if Needed
Increase max connections if consistently at capacity
High Latency
1
Check Slow Queries
Review query performance using
EXPLAIN ANALYZE2
Check Resource Usage
Monitor CPU, memory, and I/O metrics
3
Review Cache Hit Rate
Check Redis cache effectiveness
4
Scale Resources
Increase instance size or add replicas
Related Documentation
- Deployment - Deployment procedures
- Architecture Overview - System architecture