Learn/Phase 7/Circuit Breakers and Resilience

Circuit Breakers and Resilience

Ch 15 · Production Engineering 55 min
CLOSED / OPEN / HALF-OPENmax_retriesskip / retry / fail
Hands-on:MESHFLOW_MOCK=1 python3 hands_on/15_circuit_breaker.py

Lesson 15: Circuit Breakers And Failure Resilience

Lesson Goal

By the end of this lesson, you should be able to:

  • Explain the three circuit breaker states and when each applies.
  • Configure CircuitBreakerConfig with appropriate thresholds.
  • Distinguish between failure retries and timeout actions.
  • Build a gracefully degrading pipeline using step_timeout_action='skip'.
  • Explain why circuit breakers prevent cascading failures.

Estimated time: 45 to 60 minutes.

1. The Problem: Cascading Failures

Imagine a five-agent pipeline. Agent 3 calls a slow external API. The API starts responding slowly. Agent 3 backs up. The pipeline stalls. Agents 4 and 5 wait. The run times out. This happens again. And again. Eventually the whole pipeline is blocked on one flaky dependency.

This is a cascading failure. One unreliable component stops the entire system. The circuit breaker pattern prevents this by stopping calls to a failing component automatically, giving it time to recover, then carefully retrying.

2. The Three Circuit Breaker States

CLOSED ──(failure threshold exceeded)──▶ OPEN ──(half_open_after_s elapsed)──▶ HALF-OPEN
  ▲                                                                                  │
  └──────────────────────(probe call succeeds)──────────────────────────────────────┘

CLOSED (normal operation)

Calls pass through. Failures are counted in a sliding window. As long as the failure count stays below failure_threshold, the breaker stays closed.

OPEN (too many failures)

The breaker trips when failures in the window exceed the threshold. All calls are immediately rejected without being attempted. The error is "circuit open — N failures in W seconds (threshold=T)". The rejection is instant and cheap — no timeout required.

HALF-OPEN (recovery probe)

After half_open_after_s seconds, the breaker allows one probe call through. If the probe succeeds, the breaker resets to CLOSED and normal operation resumes. If the probe fails, the breaker returns to OPEN.

3. CircuitBreakerConfig

from meshflow.core.schemas import CircuitBreakerConfig, Policy

cb = CircuitBreakerConfig(
    max_retries=3,           # retry a failed step up to 3 times
    failure_window_s=60.0,   # count failures within a 60-second window
    failure_threshold=5,     # open the breaker after 5 failures in the window
    half_open_after_s=30.0,  # wait 30s before allowing a probe call
)

policy = Policy(budget_usd=1.0, circuit_breaker=cb)

max_retries controls how many times a single failing step is retried before it is counted as a failure. With max_retries=3, a step that fails is retried three times. If it succeeds on any retry, the failure counter is not incremented.

failure_window_s sets the sliding window for counting failures. Failures older than this window are not counted. A short window (5s) trips quickly; a long window (300s) tolerates bursts.

failure_threshold sets how many failures in the window trigger the OPEN state. Lower is more sensitive; higher tolerates more intermittent failures.

half_open_after_s sets how long the breaker waits before allowing a probe call. Too short and you hammer a recovering service; too long and you delay recovery unnecessarily.

4. Step Timeouts And Timeout Actions

Independent of the circuit breaker, you can set per-step time limits:

policy = Policy(
    step_timeout_s=5.0,           # each step gets 5 seconds
    step_timeout_action="fail",   # default: fail the step on timeout
)

Three timeout actions:

ActionBehavior
"fail"Step fails; error recorded; pipeline stops (default)
"skip"Step is silently skipped; pipeline continues with empty output for this step
"retry"Step is retried once after the timeout; fails if second attempt also times out

When to use skip: optional enrichment steps that improve but do not block output. If the enrichment API is slow, skip it and continue with what you have.

When to use retry: steps that are occasionally slow but usually succeed on a second attempt.

When to use fail: steps that are required — if they fail, the pipeline should not continue.

5. Combining Circuit Breaker And Timeout

The circuit breaker and timeout work at different levels:

  • step_timeout_s limits how long a single call can take.
  • max_retries determines how many times a step can be retried before it

counts toward the circuit breaker threshold.

  • failure_threshold determines when the whole component is shut off.

A typical production configuration:

cb = CircuitBreakerConfig(
    max_retries=2,           # retry twice before counting as failure
    failure_window_s=30.0,   # fast window for quick trip
    failure_threshold=3,     # trip after 3 failures in 30s
    half_open_after_s=10.0,  # try again after 10s
)
policy = Policy(
    step_timeout_s=10.0,
    step_timeout_action="retry",
    circuit_breaker=cb,
)

6. Graceful Degradation

A well-designed pipeline degrades gracefully when non-critical steps fail. Use step_timeout_action='skip' on optional steps:

# optional_enrichment is a nice-to-have, not required for correctness
policy = Policy(
    step_timeout_s=2.0,
    step_timeout_action="skip",
)

The main processor receives an empty execution_result if the enrichment step was skipped, and falls back to producing basic output. The pipeline completes successfully even when the enrichment API is unavailable.

7. When NOT To Use Circuit Breakers

Circuit breakers protect pipelines from unreliable dependencies. They are not the right tool for:

  • Expected failures (input validation errors — these should be caught before

reaching the circuit breaker)

  • Critical required steps (use step_timeout_action='fail' and let the circuit

breaker trip to protect downstream steps)

  • Logic errors in your own code (fix the bug, do not mask it with retries)

8. Hands-On Lab

MESHFLOW_MOCK=1 python3 hands_on/15_circuit_breaker.py

Observe:

  • Demo 1: a flaky agent failing without retries (status=failed, calls=1)
  • Demo 2: the same agent succeeding on the third call with max_retries=3
  • Demo 3: the circuit breaker tripping after 2 failures (calls 3 and 4 show

"circuit open" error instead of the actual agent error)

  • Demo 4: step_timeout_action='retry' allowing a timed-out step to succeed on

the second attempt (calls=2, status=completed)

  • Demo 5: step_timeout_action='skip' letting a slow optional step be bypassed

(status=completed, output contains "enrichment skipped")

9. Summary

Circuit breakers prevent cascading failures by stopping calls to repeatedly failing components. The three states — CLOSED, OPEN, HALF-OPEN — govern when calls are allowed, blocked, and tested for recovery. max_retries handles transient failures; failure_threshold handles persistent ones. Step timeouts with 'skip' or 'retry' actions let pipelines degrade gracefully rather than failing hard on every slow dependency.


Exercises

Exercises

Exercise 1: Run the Script and Observe All Three Circuit States

Goal: Watch the CLOSED → OPEN → HALF-OPEN → CLOSED transition happen in real output.

Instructions:

  1. Run the hands-on script:
   python hands_on/15_circuit_breaker.py
  1. The script simulates a flaky agent that fails on a configurable percentage of calls. Read the output and find the lines that indicate state transitions (look for labels like [CIRCUIT: OPEN], [CIRCUIT: HALF-OPEN], [CIRCUIT: CLOSED]).
  2. Record:

- At which call number did the circuit first open? - How long (in simulated seconds) did the circuit stay OPEN before entering HALF-OPEN? - Did the probe call in HALF-OPEN succeed or fail? What was the resulting state? - At which call number did the circuit return to CLOSED (if it did)?

  1. Draw a simple timeline (pen and paper or in your notes) marking each state transition with a timestamp.

Expected output: A sequence of labeled output lines showing successful calls (CLOSED), the threshold breach (OPEN), the wait period (OPEN), the probe attempt (HALF-OPEN), and recovery (CLOSED or re-OPEN if the probe failed).


Exercise 2: Tune the Circuit Breaker Parameters

Goal: Understand how parameter changes affect circuit breaker sensitivity.

Instructions:

  1. Open hands_on/15_circuit_breaker.py and find the CircuitBreakerConfig instantiation. Note the default values.
  2. Make the circuit breaker more sensitive by changing:

- failure_threshold from its default down to 2 - failure_window_s from its default up to 120 (wider window catches more failures) - half_open_after_s down to 5 (recover faster) Run the script and note how quickly the circuit opens and how fast it recovers.

  1. Make the circuit breaker less sensitive by changing:

- failure_threshold up to 10 - failure_window_s down to 10 (very short window — failures age out quickly) - half_open_after_s up to 60 (slow recovery) Run the script and note whether the circuit ever opens with the same failure pattern.

  1. Find the parameter combination that gives you "just right" behavior for the simulated failure rate: the circuit should open when there is a real sustained problem but not on a single transient failure.
  2. Write two to three sentences explaining the trade-off between a sensitive circuit breaker (opens fast) and a tolerant one (opens slowly).

Expected output: Three different behavioral profiles from the same underlying failure simulation, demonstrating that parameters directly control sensitivity.


Exercise 3: Test step_timeout_action Values

Goal: See how each step_timeout_action handles a slow agent.

Instructions:

  1. The hands-on script should include a "slow agent" demo — an agent step that takes longer than step_timeout_s. If not, create one:
   import time
   from meshflow import MeshFlow, MeshNode, CircuitBreakerConfig

   def slow_agent(input):
       time.sleep(10)  # always exceeds timeout
       return {"result": "done"}

   cb = CircuitBreakerConfig(step_timeout_s=2.0, step_timeout_action="fail")
   app = MeshFlow(circuit_breaker=cb)
   node = MeshNode.from_callable(slow_agent, name="slow_agent")
   app.add_node(node)
  1. Run with step_timeout_action="fail". Record the exception type and message.
  2. Change to step_timeout_action="skip". Run again. What value does the next step receive as its input? Is the skipped step recorded in the ledger?
  3. Change to step_timeout_action="retry". Run again with max_retries=2. How many times does the agent attempt to run before the final action is taken? What is the total elapsed time?
  4. Build a table in your notes: one row per step_timeout_action value, columns for "behavior", "ledger entry?", "next step receives", and "total elapsed time".

Expected output: Three distinct behaviors from the same slow agent, demonstrating the three timeout action modes. Elapsed times should reflect retry attempts in the "retry" mode.


Exercise 4: Implement Graceful Degradation with a Fallback

Goal: Build a pipeline that serves a degraded result when the primary agent is unavailable.

Instructions:

  1. Design a two-path pipeline:

- Primary path: An agent that calls an external API (simulate with a function that randomly raises an exception 80% of the time to simulate a down service). - Fallback path: A simpler agent that returns a cached or default response.

  1. Configure a circuit breaker that opens after 2 failures in 30 seconds:
   from meshflow import MeshFlow, MeshNode, CircuitBreakerConfig

   cb = CircuitBreakerConfig(
       failure_threshold=2,
       failure_window_s=30,
       half_open_after_s=10,
       step_timeout_action="skip"
   )
   app = MeshFlow(circuit_breaker=cb)
  1. Configure a fallback so that when the primary agent's circuit is OPEN, the workflow automatically uses the fallback agent:
   primary_node = MeshNode.from_callable(
       primary_agent,
       name="primary",
       fallback=MeshNode.from_callable(fallback_agent, name="fallback")
   )
  1. Run the pipeline 10 times in a loop and observe:

- How many times did the primary succeed? - How many times did the fallback activate? - Did the workflow ever return an error to the caller (versus gracefully degrading)?

  1. Inspect the ledger records. How are fallback activations recorded? Is it clear from the ledger that a degraded response was returned?

Expected output: The majority of calls should succeed (either primary or fallback). The workflow should never raise an unhandled exception. The ledger should record both primary failures and fallback activations.


Exercise 5: Simulate a Cascading Failure and Recovery

Goal: Observe how circuit breakers prevent cascading failures in a multi-node pipeline.

Instructions:

  1. Build a four-node pipeline where Node 3 is the unstable one:

- Node 1: Always succeeds (data preparation) - Node 2: Always succeeds (enrichment) - Node 3: Fails 90% of the time (simulates a down service) - Node 4: Always succeeds (formatting)

  1. Configure separate circuit breakers per node (if MeshFlow supports per-node circuit breakers) or use the global circuit breaker.
  2. Run the pipeline 20 times and observe:

- After how many runs does Node 3's circuit open? - While Node 3's circuit is OPEN, do Nodes 1 and 2 still execute (wasted work) or does the pipeline short-circuit earlier? - How does the failure of Node 3 affect the output of Node 4? (Does Node 4 receive a None, a default, or does it not execute at all?)

  1. After the circuit opens, wait for half_open_after_s and run one more call. Record whether the probe succeeds or fails and what the new state is.
  2. Write a short paragraph (4–5 sentences) on how circuit breakers prevent cascading failures from propagating through a multi-stage pipeline and why this matters for system reliability.

Expected output: Clear evidence that Node 3's circuit opens after the threshold is reached, the pipeline handles the OPEN state without crashing, and the recovery probe (HALF-OPEN) is visible in the output.