Economics — Cost Optimization & ROI

Cost Optimization Ranking

Ranked by ROI—implement in this order for maximum cost impact.

Prerequisite: All session logs report cost: {total: 0}. No cost optimization can be measured until instrumentation is fixed. No current requirement covers cost telemetry. See Session Insights.

Feature

Savings

Effort

Phase

R8: Numeric Length Anchors

5–10%

Trivial

R1: Prompt Caching

10–15%

Low

R16: Model Roles

30%

Medium

R7: Model Variants

3–5%

Low

R6: Verification Agent

Negative

Medium

P6a

R6 adds cost but prevents costly mistakes. Verification doubles the token spend on checked work, but catching a production bug in verification is orders of magnitude cheaper than catching it in production.

Cache Economics

Prefix-based caching across all three major providers. Universal rule: maximize identical byte prefix length.

$0.50–$0.60

Saved per 10-turn Opus session

90%

Cache read discount (Anthropic)

93.9%

Cost savings vs Sonnet 4.6 (DeepSeek case study)

anthropic 5-min TTL, $15/MTok input, 90% discount on cache reads

openai prefix-based, automatic, ~50% discount on cache reads

deepseek prefix-based, ~85% hit rate achievable (Reasonix case study)

busters dynamic dates, nondeterministic tool schemas, per-repo paths

fix sort tool schemas alphabetically by name

fix deterministic serialization across requests

fix stable content first, append-only history

variant ~100 uncached tokens/turn beats 4 separate cached prefixes

Model Role Savings

Route read-only operations to cheaper models. 60% of exploration turns use smol → 30% total session cost reduction.

≥30%

Session cost reduction with model roles

Role

Operations

Cost Tier

% of Turns

smol

grep, find, read, ls, context_pack

Low

~60%

default

edit, write, bash, implementation

Standard

~30%

slow

architecture, complex debugging

Premium

~5%

commit

commit messages, changelogs

Low

~5%

Success Metrics

13 measurable targets across all 6 phases. Each metric has a defined measurement method.

Metric

Target

Phase

Measurement

Prompt cache hit rate (5+ turn sessions)

>60%

OpenRouter response headers

Hook-based policy enforcement without forking

Works

Integration test

Parallel subagent research while chatting

Works

E2E test

Risk self-assessment without deny-lists

Works

Behavioral audit

Session cold-start context reduction

≥40%

Context-gathering tool calls in first 3 turns

Independent verification for non-trivial work

Works

P6a

Verification pass rate audit

Output token spend reduction

≥5%

Token logging in telemetry

TTSR false-positive rate

<5%

Rule trigger audit log

Memory extraction accuracy

≥3/10

Manual audit of extracted facts

S10

Session fork/resume round-trip

Works

Fork + resume integration test

S11

Model role cost reduction (exploration)

≥30%

Cost tracking per session

S12

Cross-agent rule discovery formats

≥3

Format coverage test

S13

Parallel isolated tasks with correct merge

≥5

Worktree + merge integration test

Resource Requirements

~10 weeks total with 1–2 developers. Phase 6 is parallelizable across 2 developers.

Phase 1 — 1 week

1 developer. Prompt refactoring (R1), feature flag config (R12), length anchors (R8). Low risk, immediate value.

Phase 2 — 2 weeks

1–2 developers. Hook system (R2/R18) + TTSR (R13) requires streaming knowledge. Cross-agent rules (R17) parallelizable.

Phase 3 — 2 weeks

1–2 developers. Subagent orchestration (R3/R19) is the most complex phase. Model roles (R16) parallelizable.

Phase 4 — 1 week

1 developer. Prompt-only changes: risk taxonomy (R4), model variants (R7). A/B testing methodology needed.

Phase 5 — 2 weeks

1–2 developers. Memory pipeline (R5/R14) + session tree (R15). LLM extraction quality requires iteration.

Phase 6 — 2 weeks

2 developers. Split into 6a (R6+R10, verification+budget) and 6b (R9+R11+R20, MCP+autonomy+commit). Fully parallel.