Failure handling
Virtual MCP Server (vMCP) implements failure handling patterns to prevent cascading failures and provide graceful degradation when backends become unavailable. This guide covers circuit breaker configuration and partial failure modes.
For backend health status monitoring and the /status endpoint, see
Backend discovery modes.
Overview
When backends fail due to crashes, network issues, or rate limiting, vMCP provides circuit breaker and partial failure modes to handle failures gracefully:
- Circuit breaker: Prevents cascading failures by immediately rejecting requests to failing backends instead of waiting for timeouts
- Partial failure modes: Choose whether to fail entire requests or continue with available backends
- Automatic recovery: Backends are automatically restored when they recover
Enable circuit breaker for production environments where backends may experience temporary failures (deployments, restarts, rate limits). For highly stable backends, health checks alone may be sufficient.
Circuit breaker
The circuit breaker tracks backend failures and transitions through three states:
- Closed (normal operation): Requests pass through to the backend. Failures are counted.
- Open (failing state): After exceeding the failure threshold, the circuit opens. Requests fail immediately without contacting the backend.
- Half-open (recovery testing): After a timeout period, the circuit allows exactly one test request through. While this request is in progress, all other requests are rejected (circuit remains half-open). If the test succeeds, the circuit closes immediately and normal operation resumes. If it fails, the circuit reopens for another timeout period.
Enable circuit breaker
Configure circuit breaker in the VirtualMCPServer resource:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: my-vmcp
namespace: toolhive-system
spec:
config:
groupRef: my-group
operational:
failureHandling:
healthCheckInterval: 30s
unhealthyThreshold: 3
circuitBreaker:
enabled: true
failureThreshold: 5
timeout: 60s
incomingAuth:
type: anonymous
Configuration options
| Field | Description | Default |
|---|---|---|
healthCheckInterval | Time between health checks for each backend | 30s |
unhealthyThreshold | Consecutive failures before marking backend unhealthy | 3 |
healthCheckTimeout | Maximum duration for a single health check | 10s |
statusReportingInterval | Interval for reporting status to Kubernetes | 30s |
| Circuit breaker | ||
enabled | Enable circuit breaker | false |
failureThreshold | Number of failures before opening the circuit | 5 |
timeout | Duration to wait before testing recovery | 60s |
Circuit breaker is disabled by default. Health checks run independently of the
circuit breaker and mark backends as healthy/unhealthy based on
unhealthyThreshold.
vMCP uses two thresholds:
unhealthyThreshold(default: 3): Consecutive health check failures before marking backend unhealthyfailureThreshold(default: 5): Consecutive request failures before opening circuit breaker
Health checks detect failures during idle periods (max detection time:
healthCheckInterval × unhealthyThreshold). Circuit breaker provides fast
failure protection during active traffic.
Partial failure modes
Configure how vMCP behaves when some backends are unavailable:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: my-vmcp
namespace: toolhive-system
spec:
config:
groupRef: my-group
operational:
failureHandling:
partialFailureMode: best_effort
incomingAuth:
type: anonymous
Modes
fail(default): Entire request fails if any required backend is unavailable. Use when all backends must be operational.best_effort: Return results from healthy backends even if some fail. Tools from failed backends are omitted from responses. Use for graceful degradation.
Example: Best effort mode
With partialFailureMode: best_effort, if the GitHub backend is down but Fetch
is healthy, the tools/list response only includes tools from healthy backends:
{
"jsonrpc": "2.0",
"result": {
"tools": [{ "name": "fetch_url", "description": "Fetch URL content" }]
},
"id": 1
}
GitHub tools are omitted from the response because the circuit breaker is open. The client doesn't see unavailable backend tools, preventing timeout errors when attempting to call them.
Monitor circuit breaker status
Check backend health and circuit state:
kubectl get virtualmcpserver my-vmcp -n toolhive-system -o yaml
Status includes health information and circuit breaker state:
status:
phase: Degraded # Ready|Degraded if some backends unhealthy
backendCount: 2 # Only counts ready backends (fetch-mcp, jira-mcp)
discoveredBackends:
- name: github-mcp
status: unavailable
lastHealthCheck: '2025-02-09T10:29:45Z'
message: 'connection timeout'
circuitBreakerState: open # Circuit breaker state: closed|open|half-open
circuitLastChanged: '2025-02-09T10:28:30Z' # When circuit opened
consecutiveFailures: 8 # Current failure count
- name: fetch-mcp
status: ready
lastHealthCheck: '2025-02-09T10:30:05Z'
circuitBreakerState: closed
consecutiveFailures: 0
- name: jira-mcp
status: ready
lastHealthCheck: '2025-02-09T10:30:03Z'
circuitBreakerState: half-open # Testing recovery
circuitLastChanged: '2025-02-09T10:30:00Z'
consecutiveFailures: 2 # Reduced after partial recovery
Status fields:
status: Backend health (ready, degraded, unavailable, unknown)circuitBreakerState: Circuit state (closed, open, half-open) - empty if circuit breaker disabledcircuitLastChanged: When the circuit breaker state last changedconsecutiveFailures: Count of consecutive health check failuresmessage: Additional information about backend status or errors
The /status HTTP endpoint provides a simplified view:
curl http://localhost:4483/status
{
"backends": [
{
"name": "github-mcp",
"health": "unhealthy",
"transport": "sse",
"auth_type": "token_exchange"
},
{
"name": "fetch-mcp",
"health": "healthy",
"transport": "streamable-http",
"auth_type": "unauthenticated"
}
],
"healthy": false,
"version": "v1.2.3",
"group_ref": "my-group"
}
The /status endpoint provides basic health information but does not include
circuit breaker state. For detailed circuit breaker information
(circuitBreakerState, consecutiveFailures, circuitLastChanged), use the
Kubernetes status shown above. See
Backend discovery modes for
more details on the /status endpoint.
Example configurations
Production with aggressive failure detection
Detect failures quickly and fail fast:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: production-vmcp
namespace: toolhive-system
spec:
config:
groupRef: production-backends
operational:
failureHandling:
# Check every 10 seconds
healthCheckInterval: 10s
# Mark unhealthy after 2 failures (20 seconds)
unhealthyThreshold: 2
healthCheckTimeout: 5s
# Open circuit after 3 failures
circuitBreaker:
enabled: true
failureThreshold: 3
timeout: 30s
# Fail requests if any backend down
partialFailureMode: fail
incomingAuth:
type: oidc
oidc:
issuerRef:
name: my-issuer
Development with best effort
Continue with available backends:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: dev-vmcp
namespace: toolhive-system
spec:
config:
groupRef: dev-backends
operational:
failureHandling:
healthCheckInterval: 30s
unhealthyThreshold: 3
circuitBreaker:
enabled: true
failureThreshold: 5
timeout: 60s
# Continue with healthy backends
partialFailureMode: best_effort
incomingAuth:
type: anonymous
Troubleshooting
Circuit breaker opens too frequently
If the circuit breaker is too sensitive:
Increase failure threshold:
operational:
failureHandling:
circuitBreaker:
failureThreshold: 10 # Require more failures before opening
Increase timeout:
operational:
failureHandling:
circuitBreaker:
timeout: 120s # Give backends more time to recover
Backends not recovering automatically
If backends stay unhealthy after recovering:
-
Test backend connectivity
Verify the backend MCP server is accessible from vMCP:
kubectl exec -n toolhive-system deployment/vmcp-my-vmcp -- \
curl -v http://my-backend:8080/mcpThe backend should respond with MCP protocol headers.
-
Increase circuit breaker timeout
operational:
failureHandling:
circuitBreaker:
timeout: 90s # Allow more time for full recovery -
Review vMCP logs
kubectl logs -n toolhive-system deployment/vmcp-my-vmcpLook for circuit breaker state transitions:
WARN Circuit breaker for backend github-mcp OPENED (threshold exceeded)
INFO Circuit breaker for backend github-mcp CLOSED (recovery successful)
Healthy backends marked unhealthy
If backends are incorrectly marked unhealthy:
Increase health check timeout:
operational:
failureHandling:
healthCheckTimeout: 20s # Allow slower responses
Increase unhealthy threshold:
operational:
failureHandling:
unhealthyThreshold: 5 # Allow more failures before marking unhealthy
Related information
- Backend discovery modes - Backend health status and
/statusendpoint - Configuration guide
- VirtualMCPServer CRD specification