Circuit Breakers and Resilience
MESHFLOW_MOCK=1 python3 hands_on/15_circuit_breaker.pyLesson 15: Circuit Breakers And Failure Resilience
Lesson Goal
By the end of this lesson, you should be able to:
- Explain the three circuit breaker states and when each applies.
- Configure CircuitBreakerConfig with appropriate thresholds.
- Distinguish between failure retries and timeout actions.
- Build a gracefully degrading pipeline using step_timeout_action='skip'.
- Explain why circuit breakers prevent cascading failures.
Estimated time: 45 to 60 minutes.
1. The Problem: Cascading Failures
Imagine a five-agent pipeline. Agent 3 calls a slow external API. The API starts responding slowly. Agent 3 backs up. The pipeline stalls. Agents 4 and 5 wait. The run times out. This happens again. And again. Eventually the whole pipeline is blocked on one flaky dependency.
This is a cascading failure. One unreliable component stops the entire system. The circuit breaker pattern prevents this by stopping calls to a failing component automatically, giving it time to recover, then carefully retrying.
2. The Three Circuit Breaker States
CLOSED ──(failure threshold exceeded)──▶ OPEN ──(half_open_after_s elapsed)──▶ HALF-OPEN
▲ │
└──────────────────────(probe call succeeds)──────────────────────────────────────┘
CLOSED (normal operation)
Calls pass through. Failures are counted in a sliding window. As long as the failure count stays below failure_threshold, the breaker stays closed.
OPEN (too many failures)
The breaker trips when failures in the window exceed the threshold. All calls are immediately rejected without being attempted. The error is "circuit open — N failures in W seconds (threshold=T)". The rejection is instant and cheap — no timeout required.
HALF-OPEN (recovery probe)
After half_open_after_s seconds, the breaker allows one probe call through. If the probe succeeds, the breaker resets to CLOSED and normal operation resumes. If the probe fails, the breaker returns to OPEN.
3. CircuitBreakerConfig
from meshflow.core.schemas import CircuitBreakerConfig, Policy
cb = CircuitBreakerConfig(
max_retries=3, # retry a failed step up to 3 times
failure_window_s=60.0, # count failures within a 60-second window
failure_threshold=5, # open the breaker after 5 failures in the window
half_open_after_s=30.0, # wait 30s before allowing a probe call
)
policy = Policy(budget_usd=1.0, circuit_breaker=cb)
max_retries controls how many times a single failing step is retried before it is counted as a failure. With max_retries=3, a step that fails is retried three times. If it succeeds on any retry, the failure counter is not incremented.
failure_window_s sets the sliding window for counting failures. Failures older than this window are not counted. A short window (5s) trips quickly; a long window (300s) tolerates bursts.
failure_threshold sets how many failures in the window trigger the OPEN state. Lower is more sensitive; higher tolerates more intermittent failures.
half_open_after_s sets how long the breaker waits before allowing a probe call. Too short and you hammer a recovering service; too long and you delay recovery unnecessarily.
4. Step Timeouts And Timeout Actions
Independent of the circuit breaker, you can set per-step time limits:
policy = Policy(
step_timeout_s=5.0, # each step gets 5 seconds
step_timeout_action="fail", # default: fail the step on timeout
)
Three timeout actions:
| Action | Behavior |
|---|---|
"fail" | Step fails; error recorded; pipeline stops (default) |
"skip" | Step is silently skipped; pipeline continues with empty output for this step |
"retry" | Step is retried once after the timeout; fails if second attempt also times out |
When to use skip: optional enrichment steps that improve but do not block output. If the enrichment API is slow, skip it and continue with what you have.
When to use retry: steps that are occasionally slow but usually succeed on a second attempt.
When to use fail: steps that are required — if they fail, the pipeline should not continue.
5. Combining Circuit Breaker And Timeout
The circuit breaker and timeout work at different levels:
step_timeout_slimits how long a single call can take.max_retriesdetermines how many times a step can be retried before it
counts toward the circuit breaker threshold.
failure_thresholddetermines when the whole component is shut off.
A typical production configuration:
cb = CircuitBreakerConfig(
max_retries=2, # retry twice before counting as failure
failure_window_s=30.0, # fast window for quick trip
failure_threshold=3, # trip after 3 failures in 30s
half_open_after_s=10.0, # try again after 10s
)
policy = Policy(
step_timeout_s=10.0,
step_timeout_action="retry",
circuit_breaker=cb,
)
6. Graceful Degradation
A well-designed pipeline degrades gracefully when non-critical steps fail. Use step_timeout_action='skip' on optional steps:
# optional_enrichment is a nice-to-have, not required for correctness
policy = Policy(
step_timeout_s=2.0,
step_timeout_action="skip",
)
The main processor receives an empty execution_result if the enrichment step was skipped, and falls back to producing basic output. The pipeline completes successfully even when the enrichment API is unavailable.
7. When NOT To Use Circuit Breakers
Circuit breakers protect pipelines from unreliable dependencies. They are not the right tool for:
- Expected failures (input validation errors — these should be caught before
reaching the circuit breaker)
- Critical required steps (use
step_timeout_action='fail'and let the circuit
breaker trip to protect downstream steps)
- Logic errors in your own code (fix the bug, do not mask it with retries)
8. Hands-On Lab
MESHFLOW_MOCK=1 python3 hands_on/15_circuit_breaker.py
Observe:
- Demo 1: a flaky agent failing without retries (status=failed, calls=1)
- Demo 2: the same agent succeeding on the third call with max_retries=3
- Demo 3: the circuit breaker tripping after 2 failures (calls 3 and 4 show
"circuit open" error instead of the actual agent error)
- Demo 4: step_timeout_action='retry' allowing a timed-out step to succeed on
the second attempt (calls=2, status=completed)
- Demo 5: step_timeout_action='skip' letting a slow optional step be bypassed
(status=completed, output contains "enrichment skipped")
9. Summary
Circuit breakers prevent cascading failures by stopping calls to repeatedly failing components. The three states — CLOSED, OPEN, HALF-OPEN — govern when calls are allowed, blocked, and tested for recovery. max_retries handles transient failures; failure_threshold handles persistent ones. Step timeouts with 'skip' or 'retry' actions let pipelines degrade gracefully rather than failing hard on every slow dependency.
Exercises
Exercises
Exercise 1: Run the Script and Observe All Three Circuit States
Goal: Watch the CLOSED → OPEN → HALF-OPEN → CLOSED transition happen in real output.
Instructions:
- Run the hands-on script:
python hands_on/15_circuit_breaker.py
- The script simulates a flaky agent that fails on a configurable percentage of calls. Read the output and find the lines that indicate state transitions (look for labels like
[CIRCUIT: OPEN],[CIRCUIT: HALF-OPEN],[CIRCUIT: CLOSED]). - Record:
- At which call number did the circuit first open? - How long (in simulated seconds) did the circuit stay OPEN before entering HALF-OPEN? - Did the probe call in HALF-OPEN succeed or fail? What was the resulting state? - At which call number did the circuit return to CLOSED (if it did)?
- Draw a simple timeline (pen and paper or in your notes) marking each state transition with a timestamp.
Expected output: A sequence of labeled output lines showing successful calls (CLOSED), the threshold breach (OPEN), the wait period (OPEN), the probe attempt (HALF-OPEN), and recovery (CLOSED or re-OPEN if the probe failed).
Exercise 2: Tune the Circuit Breaker Parameters
Goal: Understand how parameter changes affect circuit breaker sensitivity.
Instructions:
- Open
hands_on/15_circuit_breaker.pyand find theCircuitBreakerConfiginstantiation. Note the default values. - Make the circuit breaker more sensitive by changing:
- failure_threshold from its default down to 2 - failure_window_s from its default up to 120 (wider window catches more failures) - half_open_after_s down to 5 (recover faster) Run the script and note how quickly the circuit opens and how fast it recovers.
- Make the circuit breaker less sensitive by changing:
- failure_threshold up to 10 - failure_window_s down to 10 (very short window — failures age out quickly) - half_open_after_s up to 60 (slow recovery) Run the script and note whether the circuit ever opens with the same failure pattern.
- Find the parameter combination that gives you "just right" behavior for the simulated failure rate: the circuit should open when there is a real sustained problem but not on a single transient failure.
- Write two to three sentences explaining the trade-off between a sensitive circuit breaker (opens fast) and a tolerant one (opens slowly).
Expected output: Three different behavioral profiles from the same underlying failure simulation, demonstrating that parameters directly control sensitivity.
Exercise 3: Test step_timeout_action Values
Goal: See how each step_timeout_action handles a slow agent.
Instructions:
- The hands-on script should include a "slow agent" demo — an agent step that takes longer than
step_timeout_s. If not, create one:
import time
from meshflow import MeshFlow, MeshNode, CircuitBreakerConfig
def slow_agent(input):
time.sleep(10) # always exceeds timeout
return {"result": "done"}
cb = CircuitBreakerConfig(step_timeout_s=2.0, step_timeout_action="fail")
app = MeshFlow(circuit_breaker=cb)
node = MeshNode.from_callable(slow_agent, name="slow_agent")
app.add_node(node)
- Run with
step_timeout_action="fail". Record the exception type and message. - Change to
step_timeout_action="skip". Run again. What value does the next step receive as its input? Is the skipped step recorded in the ledger? - Change to
step_timeout_action="retry". Run again withmax_retries=2. How many times does the agent attempt to run before the final action is taken? What is the total elapsed time? - Build a table in your notes: one row per
step_timeout_actionvalue, columns for "behavior", "ledger entry?", "next step receives", and "total elapsed time".
Expected output: Three distinct behaviors from the same slow agent, demonstrating the three timeout action modes. Elapsed times should reflect retry attempts in the "retry" mode.
Exercise 4: Implement Graceful Degradation with a Fallback
Goal: Build a pipeline that serves a degraded result when the primary agent is unavailable.
Instructions:
- Design a two-path pipeline:
- Primary path: An agent that calls an external API (simulate with a function that randomly raises an exception 80% of the time to simulate a down service). - Fallback path: A simpler agent that returns a cached or default response.
- Configure a circuit breaker that opens after 2 failures in 30 seconds:
from meshflow import MeshFlow, MeshNode, CircuitBreakerConfig
cb = CircuitBreakerConfig(
failure_threshold=2,
failure_window_s=30,
half_open_after_s=10,
step_timeout_action="skip"
)
app = MeshFlow(circuit_breaker=cb)
- Configure a fallback so that when the primary agent's circuit is OPEN, the workflow automatically uses the fallback agent:
primary_node = MeshNode.from_callable(
primary_agent,
name="primary",
fallback=MeshNode.from_callable(fallback_agent, name="fallback")
)
- Run the pipeline 10 times in a loop and observe:
- How many times did the primary succeed? - How many times did the fallback activate? - Did the workflow ever return an error to the caller (versus gracefully degrading)?
- Inspect the ledger records. How are fallback activations recorded? Is it clear from the ledger that a degraded response was returned?
Expected output: The majority of calls should succeed (either primary or fallback). The workflow should never raise an unhandled exception. The ledger should record both primary failures and fallback activations.
Exercise 5: Simulate a Cascading Failure and Recovery
Goal: Observe how circuit breakers prevent cascading failures in a multi-node pipeline.
Instructions:
- Build a four-node pipeline where Node 3 is the unstable one:
- Node 1: Always succeeds (data preparation) - Node 2: Always succeeds (enrichment) - Node 3: Fails 90% of the time (simulates a down service) - Node 4: Always succeeds (formatting)
- Configure separate circuit breakers per node (if MeshFlow supports per-node circuit breakers) or use the global circuit breaker.
- Run the pipeline 20 times and observe:
- After how many runs does Node 3's circuit open? - While Node 3's circuit is OPEN, do Nodes 1 and 2 still execute (wasted work) or does the pipeline short-circuit earlier? - How does the failure of Node 3 affect the output of Node 4? (Does Node 4 receive a None, a default, or does it not execute at all?)
- After the circuit opens, wait for
half_open_after_sand run one more call. Record whether the probe succeeds or fails and what the new state is. - Write a short paragraph (4–5 sentences) on how circuit breakers prevent cascading failures from propagating through a multi-stage pipeline and why this matters for system reliability.
Expected output: Clear evidence that Node 3's circuit opens after the threshold is reached, the pipeline handles the OPEN state without crashing, and the recovery probe (HALF-OPEN) is visible in the output.