Resilience Patterns

Circuit breakers, exponential backoff, and the five-layer safety stack that keeps autonomous agents from cascading into failure.

Executive Takeaway

If you only implement three resilience controls first, implement these:

Circuit breakers — stop failing skills from cascading into system-wide instability.
Retry discipline — exponential backoff with jitter for transient failures, fail fast for permanent errors.
Escalation visibility — quarantine unstable skills and notify humans with actionable context.

The Five-Layer Safety Stack

Chapter 22 introduced this stack as a production pattern. This chapter is the operations playbook: concrete failure handling behavior, implementation details, and incident-time decision rules.

FlowPilot implements resilience as a stack of five layers. Each layer handles a different class of failure:

┌─────────────────────────────────────────────────────────────┐
│  Layer 1: PREVENTION                                         │
│  Circuit breakers stop cascading failures before they start  │
├─────────────────────────────────────────────────────────────┤
│  Layer 2: RECOVERY                                          │
│  Self-repair retries with exponential backoff               │
├─────────────────────────────────────────────────────────────┤
│  Layer 3: ESCALATION                                        │
│  Auto-disable unstable skills after threshold               │
├─────────────────────────────────────────────────────────────┤
│  Layer 4: EVALUATION                                        │
│  Hard gates for technical errors vs. soft failures          │
├─────────────────────────────────────────────────────────────┤
│  Layer 5: BACKOFF                                           │
│  Exponential backoff for heartbeat on repeated failures     │
└─────────────────────────────────────────────────────────────┘

Layer 1: Circuit Breakers

A circuit breaker prevents a failing skill from taking down the whole agent. The pattern comes from electrical engineering: when too much current flows, a breaker trips and protects the circuit.

In agentic systems, the “current” is repeated failures. The breaker opens when a skill fails too many times in a row.

States

CLOSED (normal)    → Skill executes normally
       │
       ├── 3+ consecutive failures in 1h window
       ▼
OPEN (tripped)     → Skill is blocked, returns error immediately
       │
       ├── After 30 min cooldown
       ▼
HALF-OPEN (testing)→ One test request allowed
       │
       ├── Success → back to CLOSED
       └── Failure → back to OPEN

Implementation

async function executeWithCircuitBreaker(
  supabase: SupabaseClient,
  skillName: string,
  execute: () => Promise<SkillResult>
): Promise<SkillResult> {
  const breaker = await getCircuitBreakerState(supabase, skillName);

  if (breaker.state === 'OPEN') {
    if (Date.now() < breaker.resetAt) {
      return { error: 'CIRCUIT_OPEN', message: `Skill '${skillName}' is temporarily disabled` };
    }
    // Transition to HALF-OPEN for test
    await setCircuitBreakerState(supabase, skillName, 'HALF-OPEN');
  }

  try {
    const result = await execute();
    await resetCircuitBreaker(supabase, skillName); // Success → CLOSED
    return result;
  } catch (err) {
    await recordFailure(supabase, skillName);

    const failures = await getRecentFailureCount(supabase, skillName, '1 hour');
    if (failures >= 3) {
      await openCircuitBreaker(supabase, skillName, 30 * 60 * 1000); // 30 min
    }
    throw err;
  }
}

Why Not Just Disable?

Disabling is permanent until a human re-enables. A circuit breaker is temporary and self-healing. The distinction matters for transient failures:

API rate limit hit → circuit trips for 30 min → tries again → succeeds → resets automatically
Skill logic bug → circuit trips → tries again → fails again → escalates to human

The circuit breaker handles the first case without human intervention. The escalation layer handles the second.

Layer 2: Exponential Backoff

When a skill fails, the agent doesn’t retry immediately. It waits, and each retry waits longer.

Attempt 1: Fail → Wait 1s
Attempt 2: Fail → Wait 2s
Attempt 3: Fail → Wait 4s
Attempt 4: Fail → Wait 8s
Attempt 5: Fail → Give up, log, escalate

With Jitter

Pure exponential backoff causes “thundering herd” problems when multiple agents retry simultaneously. Jitter spreads retries across time:

function backoffDelay(attempt: number, baseMs = 1000, maxMs = 30000): number {
  const exponential = Math.min(baseMs * Math.pow(2, attempt), maxMs);
  const jitter = Math.random() * 0.3 * exponential; // ±30% randomness
  return exponential + jitter;
}

// Usage
for (let attempt = 0; attempt < MAX_RETRIES; attempt++) {
  try {
    return await executeSkill(skillName, params);
  } catch (err) {
    if (attempt === MAX_RETRIES - 1) throw err;
    await sleep(backoffDelay(attempt));
  }
}

When to Retry vs. When to Fail Fast

Not every error should be retried:

Error Type	Retry?	Reason
Network timeout	Yes	Transient
Rate limit (429)	Yes (with delay)	Quota resets
Invalid params (400)	No	Retrying won’t help
Auth failure (401)	No	Credential issue
Not found (404)	No	Resource doesn’t exist
Server error (500)	Yes (limited)	May be transient

function shouldRetry(error: SkillError): boolean {
  if (error.status === 429) return true;  // Rate limit
  if (error.status >= 500) return true;   // Server errors
  if (error.type === 'NETWORK_TIMEOUT') return true;
  return false; // Don't retry client errors
}

Layer 3: Skill Escalation

After 3+ consecutive failures, a skill is automatically quarantined. This happens in the SELF-HEAL phase of every heartbeat:

// Self-heal: find unstable skills
const unstableSkills = await supabase
  .from('agent_activity')
  .select('skill_name, status')
  .eq('status', 'error')
  .gte('created_at', new Date(Date.now() - 3 * 24 * 60 * 60 * 1000).toISOString())
  .order('created_at', { ascending: false });

// Group and count consecutive failures
const streaks = getConsecutiveFailureStreaks(unstableSkills);

for (const [skillName, streak] of streaks) {
  if (streak >= 3) {
    // Quarantine the skill
    await supabase.from('agent_skills')
      .update({ enabled: false, quarantine_reason: `${streak} consecutive failures` })
      .eq('name', skillName);

    // Disable dependent automations
    await disableAutomationsUsingSkill(supabase, skillName);

    // Notify admin
    await createActivityEntry(supabase, 'skill_quarantined', {
      skill_name: skillName,
      failure_streak: streak
    });
  }
}

The admin sees quarantined skills in the Activity Feed and can investigate before re-enabling.

Layer 4: Hard Gates vs. Soft Failures

Not all errors are equal. FlowPilot distinguishes:

Type	Examples	Behavior
Hard failure	Auth error, schema violation, missing required field	Abort immediately, log with full context
Soft failure	Partial result, empty response, low-confidence output	Continue with warning, log for review
Expected failure	”No leads to qualify”, “No content to publish”	Treat as success, no escalation

function classifyFailure(error: SkillError): FailureClass {
  if (error.status === 401 || error.status === 403) return 'HARD';
  if (error.type === 'SCHEMA_VIOLATION') return 'HARD';
  if (error.type === 'EMPTY_RESULT') return 'EXPECTED';
  if (error.confidence < 0.5) return 'SOFT';
  return 'SOFT';
}

Hard failures abort the current operation immediately. Soft failures are logged but don’t stop the heartbeat from completing other steps. Expected failures are effectively no-ops.

Without this distinction, a single “no leads to qualify today” result would abort the entire heartbeat, skip the blog planning, and miss the newsletter review. That’s incorrect behavior — the agent should be resilient to empty queues.

Layer 5: Heartbeat Backoff

When the heartbeat itself fails repeatedly, the system backs off the schedule:

Normal schedule:   Every 12 hours
1st failure:       Next run in 12h (unchanged)
2nd failure:       Next run in 24h
3rd failure:       Next run in 48h + admin notification
4th+ failure:      Heartbeat paused, admin must manually resume

This prevents a broken heartbeat from hammering the infrastructure:

async function scheduleNextHeartbeat(
  supabase: SupabaseClient,
  lastResult: HeartbeatResult
): Promise<void> {
  const consecutiveFailures = await getConsecutiveHeartbeatFailures(supabase);

  const baseInterval = 12 * 60 * 60 * 1000; // 12 hours
  const multiplier = Math.min(Math.pow(2, consecutiveFailures - 1), 4); // Max 4x
  const nextInterval = baseInterval * multiplier;

  if (consecutiveFailures >= 4) {
    await pauseHeartbeat(supabase, 'Too many consecutive failures');
    await notifyAdmin(supabase, 'Heartbeat paused — manual review required');
    return;
  }

  await scheduleAt(supabase, Date.now() + nextInterval);
}

The Anti-Patterns

Anti-Pattern	Consequence	Fix
No circuit breaker	One flaky API takes down all skills	Circuit breaker with OPEN/HALF-OPEN states
Immediate retry	Thundering herd, rate limit death spiral	Exponential backoff with jitter
No failure classification	Empty queue = abort everything	Hard/soft/expected failure types
Permanent disable on failure	Human required for every transient error	Circuit breaker auto-resets after cooldown
No heartbeat backoff	Broken heartbeat hammers infrastructure	Exponential backoff on heartbeat schedule
No admin notification	Silent failures accumulate unnoticed	Escalation creates activity log entries

Putting It Together

A robust agent run looks like this when everything goes right:

Heartbeat starts → Circuit breakers all CLOSED
→ Skills execute → Some succeed, one rate-limited
→ Rate-limited skill: retry with backoff → succeeds on attempt 2
→ Heartbeat completes → Schedule next run in 12h

When things go wrong:

Heartbeat starts → One skill: circuit OPEN (3 failures yesterday)
→ Skip that skill, log as quarantined
→ Other skills execute normally
→ One skill: hard failure (auth error)
→ Abort that skill immediately, log full context
→ Other skills continue
→ Heartbeat completes (partial success)
→ Admin sees quarantine + auth error in Activity Feed
→ Schedule next run in 12h (no failure increment, partial success counts)

The agent doesn’t give up. It doesn’t fail silently. It does what it can, logs what it couldn’t, and gives the admin the information to fix what needs fixing.

Resilience is not about preventing failure. It’s about making failure cheap, visible, and recoverable. An agent that handles failure gracefully is more trustworthy than an agent that never fails — because you know exactly what will happen when things go wrong.

Operational Health Monitoring

Beyond circuit breakers and recovery, production agents need continuous health signals — not just “did the last heartbeat succeed?” but “is the system operating within expected parameters?”

Two signals matter most in practice:

SLA health — Are the agent’s objectives being completed within expected timeframes? An agent that runs correctly but produces no meaningful outcomes is still failing. SLA monitoring tracks whether the autonomous loop is delivering, not just executing.

Instance health — Is the overall system configuration valid? Skills enabled, modules active, integrations reachable, memory accessible? A healthy heartbeat on a degraded instance produces subtly wrong behavior that is harder to diagnose than a hard failure.

These checks belong in the observability layer, not just in error handling. They give the team a signal before things break rather than after.

The operational principle: instrument for drift, not just failure. Most production issues in autonomous agents are not crashes — they are slow degradations that accumulate over days until someone notices the outputs no longer match expectations.

Next: the failure mode nobody talks about — recovering from hallucinated tool calls. Tool Hallucination Recovery →