The Token Economy
Context budget management — how to run 100+ skills without hitting the context ceiling. Prompt compilation, lazy loading, and graceful degradation.
The Problem No One Talks About
Every tutorial about AI agents focuses on what the agent can do. Nobody talks about the math that determines what it will do in practice.
Here’s the uncomfortable truth: context windows are not as large as they seem. A 128K token window sounds enormous until you stack up everything an autonomous agent needs:
System prompt + grounding rules ~3,000 tokens
Soul + Identity + Agents ~2,000 tokens
Domain context (CMS schema) ~2,000 tokens
Skill metadata (100 skills) ~10,000 tokens
Skill instructions (if loaded) ~50,000 tokens ← THIS IS THE KILLER
Working memory (30 entries) ~3,000 tokens
Objectives + progress ~2,000 tokens
Conversation history ~5-15,000 tokens
────────────────────────────────────────────────────
Total without instructions: ~27,000 tokens (21% of budget)
Total WITH all instructions: ~77,000 tokens (60% of budget)
If you naively load everything, you’ve consumed 60% of your context before the agent even starts reasoning. The remaining 40% must handle the actual conversation, tool call results, and the agent’s chain of thought.
This is why the Token Economy exists: the art of spending context tokens where they create the most value.
The Prompt Compiler
An autonomous agent doesn’t have a “prompt.” It has a prompt compiler — a system that assembles the optimal prompt for each reasoning cycle based on current state, budget, and intent.
FlowPilot’s prompt compiler builds a 6-layer stack:
┌──────────────────────────────────────────┐
│ Layer 1: GROUNDING_RULES │ ~800 tokens
│ Response format, safety constraints, │ Static, always present
│ tool-use conventions │
├──────────────────────────────────────────┤
│ Layer 2: MODE IDENTITY │ ~400 tokens
│ "You are FlowPilot, operating in │ Varies by surface
│ [heartbeat|operate|chat] mode" │
├──────────────────────────────────────────┤
│ Layer 3: WORKSPACE FILES │ ~2,000 tokens
│ Soul, Identity, Agents memory keys │ Loaded from agent_memory
│ (personality, boundaries, rules) │
├──────────────────────────────────────────┤
│ Layer 4: DOMAIN CONTEXT │ ~2,000 tokens
│ CMS page schema, active modules, │ Dynamic, filtered by
│ site configuration awareness │ relevance
├──────────────────────────────────────────┤
│ Layer 5: MEMORIES & OBJECTIVES │ ~3,000 tokens
│ Working memory (top 30), active │ Semantic search +
│ objectives with progress │ recency sort
├──────────────────────────────────────────┤
│ Layer 6: REPLY DIRECTIVES │ ~300 tokens
│ Mode-specific output instructions │ Static per surface
└──────────────────────────────────────────┘
Total system prompt: ~8,500 tokens (6.6% of 128K)
The key insight: the system prompt is only 6.6% of the budget. This leaves 93% for skills, conversation, and reasoning. The prompt compiler achieves this through aggressive filtering and lazy loading.
Lazy Instruction Loading (Law 3 in Practice)
This is the single most impactful optimization in the entire system. It’s the difference between a 77K token prompt and a 27K token prompt.
The Problem
Each skill has two parts:
- Metadata — name, description, JSON schema (~100 tokens)
- Instructions — detailed usage guide, edge cases, decision tables (~500-2,000 tokens)
With 100+ skills, loading all instructions costs 50K+ tokens. That’s nearly half the context window consumed by information the agent might need.
The Solution
Phase 1 — STARTUP (always):
Load metadata only for all eligible skills
Cost: ~100 tokens × N skills ≈ 8,000 tokens
Phase 2 — ON-CALL (on demand):
When the LLM selects a skill, fetch its full instructions
Cost: ~500-2,000 tokens per skill actually used
Typical: 2-4 skills per turn = 1,000-8,000 tokens
Phase 3 — BUDGET PRESSURE (when needed):
Compress metadata, drop low-priority skills
See: Skill Budget Degradation below
Implementation
// Phase 1: Load lightweight tool definitions
const tools = await loadSkillTools(supabase, scope, budget);
// Returns: { name, description, parameters } — NO instructions
// Phase 2: After LLM picks a skill
const instructions = await fetchSkillInstructions(supabase, skillName);
// Returns: full instructions text, injected as system message
// before re-entering the reasoning loop
The agent sees 80 skill summaries (8K tokens) and only loads the 2-4 skill manuals it actually needs (2-8K tokens). Total skill cost: 10-16K tokens instead of 58K.
Skill Budget Degradation
When token usage climbs, the system progressively reduces skill richness:
Budget usage Action Savings
─────────────────────────────────────────────────────────────
0-50% Full metadata (name + description Baseline
+ full JSON schema)
50-75% Compact metadata (name + ~40% reduction
one-line description, simplified in skill tokens
schema)
75-90% Drop low-priority skills entirely. ~60% reduction
Keep only skills matching current
intent category.
90%+ Emergency: flush working state, Graceful exit
save progress to memory, end turn
with summary.
Intent-Based Filtering
Before degradation even kicks in, the system filters skills by relevance. Not every skill is offered to every request:
// agent-operate analyzes the user's message
const intent = classifyIntent(userMessage);
// Returns: ['content', 'crm'] or ['booking'] or ['accounting']
// Only skills matching the intent categories are loaded
const relevantSkills = allSkills.filter(s =>
intent.includes(s.category) || s.category === 'core'
);
// 100+ skills → typically 30-50 per request
This means the agent never sees all skills simultaneously. It sees 30-50 relevant skills plus core utilities, keeping the metadata cost around 5-8K tokens.
The Token Budget Object
Every reasoning cycle carries a budget tracker:
interface TokenBudget {
limit: number; // Max tokens for this run (e.g., 80,000)
used: number; // Accumulated across all turns
remaining: number; // limit - used
turnCount: number; // Number of reasoning turns so far
maxTurns: number; // Hard limit on reasoning iterations
}
The budget serves two purposes:
-
Cost control — An autonomous heartbeat that runs 48 times/day must not burn unlimited API credits. Each run has a token ceiling.
-
Graceful degradation — When the budget runs low, the agent saves its progress and exits cleanly rather than crashing mid-task.
// In the reasoning loop
if (isOverBudget(usage, budget.limit)) {
// Save partial progress
await saveProgressToMemory(supabase, currentState);
// Exit with summary of what was accomplished
return { status: 'budget_exhausted', completed: stepsDone, remaining: stepsLeft };
}
Cost Tiers: Free First, Paid When Necessary
Not all reasoning requires the same model. FlowPilot uses a tiered approach:
| Tier | Model | Cost | Use Case |
|---|---|---|---|
fast | gpt-4.1-mini | ~$0.40/M tokens | Default for most operations: tool selection, simple Q&A, data lookups |
reasoning | gpt-4.1 / gemini-2.5-pro | ~$10/M tokens | Complex planning, multi-step reasoning, content generation |
The default is always fast. Skills can specify a preferred_provider to override:
{
"name": "plan_quarterly_strategy",
"preferred_provider": "reasoning",
"instructions": "This skill requires deep analysis..."
}
The Math
A heartbeat running 48 times/day with the fast tier:
- ~10K tokens per run × 48 runs = 480K tokens/day
- Cost: ~$0.19/day = $5.70/month
The same heartbeat with reasoning for everything:
- Cost: ~$4.80/day = $144/month
That’s a 25x cost difference. The tier system isn’t optional — it’s existential for sustainable autonomy.
Memory as Context Extension
When information doesn’t fit in the context window, it lives in memory and gets retrieved on demand:
Context Window (fast, expensive):
└── Working memory: top 30 entries, always loaded
└── Conversation history: recent messages
Memory Tiers (slow, cheap):
└── L3 Long-term: full-text search via pg_trgm
└── L4 Semantic: vector similarity via pgvector
The agent doesn’t need everything in context. It needs to know that it can find things. The skill instructions tell it when to search memory:
"When a user asks about past blog performance, use memory_read
to search for 'blog_engagement' in the 'fact' category before
making recommendations."
This pattern — pointers in context, data in memory — is how you scale beyond the context window without losing capability.
The Context Stack — Where Every Token Goes
Understanding the total context cost requires seeing the full stack. Here is a real breakdown from a FlowPilot instance with 100+ skills:
Layer Tokens % of 128K
────────────────────────────────────────────────────────────
System prompt + GROUNDING_RULES ~3,000 2.3%
Soul + Agents + Identity ~2,000 1.6%
CMS Schema awareness ~2,000 1.6%
Skill metadata (80 filtered skills) ~8,000 6.3%
Memories (semantic search results) ~2,000 1.6%
Objectives + progress ~2,000 1.6%
Conversation history ~5-15K 4-12%
────────────────────────────────────────────────────────────
Total: ~25-35K ~20-27%
The key insight: with 100+ skills, the system uses only ~25% of the context window. This leaves 75% for the model’s reasoning chain and tool call responses.
Scaling Thresholds
This headroom is not infinite. Here is where the architecture faces pressure:
| Skill Count | Context Cost | Status | Required Action |
|---|---|---|---|
| 50-100 | ~5-8K tokens | ✅ Comfortable | Current filtering works |
| 100-200 | ~8-16K tokens | ⚠️ Manageable | Intent scoring must be aggressive |
| 200-500 | ~16-40K tokens | 🔴 Critical | Need hierarchical skill registries — category → sub-skill lookup |
| 500+ | ~40K+ tokens | 🚫 Architectural limit | Must move to multi-agent delegation or external skill index |
The 200-skill threshold is the most important planning milestone. Beyond it, the current flat-list-with-filtering approach starts competing with conversation history for context space.
The Anti-Patterns
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Loading all skill instructions | 50K+ tokens before first message | Lazy instruction loading |
| No intent filtering | All skills in every prompt | Category-based filtering |
| Single model tier | $144/month for heartbeats | Fast/reasoning tiers |
| No budget tracking | Runaway API costs | TokenBudget object |
| Everything in context | Context overflow, truncation | Memory tiers + search |
| No graceful degradation | Hard crashes at token limits | Progressive skill compression |
Monitoring: Know Your Spend
Every agent activity log includes token usage:
{
"skill_name": "generate_blog_post",
"token_usage": {
"prompt_tokens": 12450,
"completion_tokens": 2830,
"total_tokens": 15280
},
"duration_ms": 3200,
"model": "gpt-4.1-mini"
}
This data feeds the Engine Room dashboard, where operators can see:
- Token spend per skill (which skills are expensive?)
- Token spend per heartbeat cycle (is autonomy sustainable?)
- Budget utilization over time (are we trending up?)
Without monitoring, the token economy is theoretical. With it, it’s a managed resource.
Cost Modeling Worksheet
Before deploying an autonomous agent, estimate your monthly cost. The variables are predictable:
Monthly Cost = (heartbeat_runs/day × 30)
× avg_tokens_per_run
× model_cost_per_token
+ (operate_sessions/day × 30)
× avg_tokens_per_session
× model_cost_per_token
The Variables
| Variable | How to Estimate | Typical Range |
|---|---|---|
heartbeat_runs/day | Admin-configured schedule | 2 (twice daily) |
avg_tokens_per_run | From activity logs after first week | 8,000–15,000 |
operate_sessions/day | How often admin interacts | 2–10 |
avg_tokens_per_session | From activity logs | 3,000–8,000 |
model_cost_per_token | Provider pricing page | See table below |
Reference Pricing (2026)
| Model | Input $/M tokens | Output $/M tokens | Best for |
|---|---|---|---|
| gpt-4.1-mini | $0.40 | $1.60 | Heartbeat default, most operations |
| gpt-4.1 | $2.00 | $8.00 | Complex planning, content generation |
| gemini-2.5-pro | $1.25 | $10.00 | Long context, multimodal |
| claude-3-5-haiku | $0.80 | $4.00 | Fast, capable, good tool use |
Prices change frequently — verify against current provider pricing.
Example: Small B2B Site
Heartbeat: 2/day × 30 = 60 runs/month
@ 10,000 tokens avg × $0.40/M = $0.24/month
Operate: 5/day × 30 = 150 sessions/month
@ 5,000 tokens avg × $0.40/M = $0.30/month
Reasoning tier (10% of runs for complex tasks):
6 runs × 20,000 tokens × $2.00/M = $0.24/month
Total: ~$0.78/month on fast model + occasional reasoning
Example: Active Marketing Agency
Heartbeat: 4/day × 30 = 120 runs/month
@ 15,000 tokens avg × $0.40/M = $0.72/month
Operate: 20/day × 30 = 600 sessions/month
@ 8,000 tokens avg × $0.40/M = $1.92/month
Reasoning tier (30% of runs):
36 runs × 25,000 tokens × $2.00/M = $1.80/month
Total: ~$4.44/month
The Ceiling Check
Before going live, calculate your worst-case scenario:
Worst case = max_heartbeat_frequency
× max_tokens_per_run (128K limit)
× most_expensive_model
× 30 days
If the worst case is acceptable, deploy. If not, set lower budget.limit per run or reduce heartbeat frequency.
The numbers are almost always surprisingly small. The agent is not expensive — it’s the predictability that matters. A $5/month agent that can’t explain its costs is worse than a $50/month agent where every token is accounted for.
The token economy is not about limits — it’s about allocation. Every token spent on skill metadata is a token not available for reasoning. Every reasoning token spent on the wrong model is money wasted. The discipline is spending each token where it creates the most value.
Part III begins here. You’ve seen how the engine works: heartbeats, skills, memory, tokens. The next chapters shift to operating that engine — feedback loops, drift detection, governance, and the production patterns that keep an autonomous agent reliable at scale.
Next: how agents grow smarter over time — feedback loops, reflection, and compound learning. Feedback Loops →