Corpus Overview
25 sessions across April 18–20, 2026. 4.8 MB total. Two distinct usage modes surface different failure patterns.
SWE-bench Evaluation
Automated Bug Fixing
Largest sessions (1.2 MB, 394 KB). Provider "subquadratic", model "latest", thinking "off". All costs report $0.00. Up to 208 assistant messages per session.
Interactive Development
Wedding Platform Sprint
~40 minute autonomous sprint execution. Plan mode → PLAN.md → exit plan mode → execute. Database migrations, tRPC mutations, frontend pages, webhook handlers, BullMQ processors.
Top 5 Observations
System Reminder Infinite Loop
After sprint completion, the agent got trapped in 10+ consecutive system-reminder cycles (task-tracking, plan-drift, read-before-edit) with no user input between them. Each cycle consumed ~83K input tokens. The reminders fire because the agent hasn’t performed a tracked action, but the agent has nothing to do because it’s waiting for user input. Pure waste loop.
/sprint Command Not Registered
Three times per session, the agent attempted /sprint as a bash command (/bin/bash: /sprint: No such file or directory). The CLAUDE.md references a skill that isn’t registered as a tool. Agent recovers by falling back to manual file operations, but burns 3 round-trips per occurrence.
Edit-Without-Read Failures
The agent constructed edits from memory rather than reading current file state. A large whole-file edit on app-sidebar.tsx failed with “Could not find the exact text”—a single field difference (/reviews vs /brochures). The harness enforces read-before-edit via reminders, but reminders fire after the wasted edit attempt.
SWE-bench Verbose Exploration
60%+ of tool calls in evaluation sessions are find / grep / read for code discovery. One mypy session ran 27 minutes with 208 assistant messages, 124 bash calls, and 65 reads—exploring the entire type inference pipeline across 4 files. Systematic but token-expensive.
Autonomous Sprint Execution
Session ec6b0293 autonomously executed a full 8-point sprint: database migration, 6 tRPC mutations, 3 frontend pages, webhook handler, BullMQ processor, post-sprint protocol (doc sync, INDEX.md regen, sprint file move). Used ask_user to surface architectural tradeoffs before coding. The plan-then-execute flow worked cleanly.
Pain Points → Roadmap
Each observed pain point maps directly to one or more Beyond Parity requirements.
/sprint tool resolution failuresRecommendations
Five actionable fixes ordered by impact. The first two are pre-requisites for meaningful cost optimization work.
1. Kill the Reminder Loop
Add a guard: don’t fire system reminders if the last N messages contained no user input. This single fix eliminates the most expensive bug in the corpus—~83K tokens per cycle with no productive output.
2. Block Edit Without Read
Enforce read-before-edit at the tool level, not via post-hoc reminders. The current approach lets the model waste tokens constructing the edit, fail, then waste more tokens on the reminder. Block the call entirely if the file hasn’t been read in the current turn.
3. Code Navigation Index
SWE-bench sessions spend 60%+ of tool calls on find + grep for discovery. A symbol index, call graph, or semantic search would collapse multi-step exploration chains into single lookups. SubQ’s FFF search already leads here—extend it.
4. Register Skills as Tools
The /sprint failures happen because skills exist in CLAUDE.md documentation but not in the tool registry. Skills referenced in project docs must be resolvable by the harness—either as registered tools or with clear error messages.
5. Fix Cost Instrumentation
Every session reports cost: {total: 0}. Token counts are captured but no real cost data is computed. Cost optimization work (R1, R8, R16) cannot be measured until this is fixed.
Corpus Metrics
corpus 25 sessions | 4.8 MB total | Apr 18–20 2026 modes SWE-bench automated (23) | Interactive dev (2) largest 572ab58b → 1.2 MB | mypy type inference | 27 min longest ec6b0293 → ~40 min | wedding platform sprint waste ~83K input tokens per reminder loop cycle failures 3× /sprint resolution | 4× edit-without-read positive Full 8-point sprint executed autonomously positive Methodical reproduce → trace → fix pattern in SWE-bench cost $0.00 reported (instrumentation non-functional) provider "subquadratic" | model "latest" | thinking "off"
/Users/tomdimino/Downloads/session logs SubQ.zip. JSONL format, one event per line. Analysis focused on the 3–4 largest sessions for pattern identification. Smaller sessions confirmed patterns but added no new observations.