Session Insights — Real Log Analysis

Corpus Overview

25 sessions across April 18–20, 2026. 4.8 MB total. Two distinct usage modes surface different failure patterns.

SWE-bench Evaluation

Automated Bug Fixing

Largest sessions (1.2 MB, 394 KB). Provider "subquadratic", model "latest", thinking "off". All costs report $0.00. Up to 208 assistant messages per session.

Automated124 bash calls65 reads

Interactive Development

Wedding Platform Sprint

~40 minute autonomous sprint execution. Plan mode → PLAN.md → exit plan mode → execute. Database migrations, tRPC mutations, frontend pages, webhook handlers, BullMQ processors.

Interactive8-point sprintask_user

Top 5 Observations

Critical

System Reminder Infinite Loop

After sprint completion, the agent got trapped in 10+ consecutive system-reminder cycles (task-tracking, plan-drift, read-before-edit) with no user input between them. Each cycle consumed ~83K input tokens. The reminders fire because the agent hasn’t performed a tracked action, but the agent has nothing to do because it’s waiting for user input. Pure waste loop.

~83K tokens/cycle Active Bug

High

`/sprint` Command Not Registered

Three times per session, the agent attempted /sprint as a bash command (/bin/bash: /sprint: No such file or directory). The CLAUDE.md references a skill that isn’t registered as a tool. Agent recovers by falling back to manual file operations, but burns 3 round-trips per occurrence.

3 failures/session

High

Edit-Without-Read Failures

The agent constructed edits from memory rather than reading current file state. A large whole-file edit on app-sidebar.tsx failed with “Could not find the exact text”—a single field difference (/reviews vs /brochures). The harness enforces read-before-edit via reminders, but reminders fire after the wasted edit attempt.

Token waste

Medium

SWE-bench Verbose Exploration

60%+ of tool calls in evaluation sessions are find / grep / read for code discovery. One mypy session ran 27 minutes with 208 assistant messages, 124 bash calls, and 65 reads—exploring the entire type inference pipeline across 4 files. Systematic but token-expensive.

208 messages

Positive

Autonomous Sprint Execution

Session ec6b0293 autonomously executed a full 8-point sprint: database migration, 6 tRPC mutations, 3 frontend pages, webhook handler, BullMQ processor, post-sprint protocol (doc sync, INDEX.md regen, sprint file move). Used ask_user to surface architectural tradeoffs before coding. The plan-then-execute flow worked cleanly.

~40 min Working

Pain Points → Roadmap

Each observed pain point maps directly to one or more Beyond Parity requirements.

Pain Point

Mapped Requirement

Phase

System reminder feedback loop (~83K tokens/cycle)

R1 Context management + R10 Token budget

/sprint tool resolution failures

R17 Cross-agent rules + R18 Extension API

Edit-without-read token waste

R2 PreToolUse hook to block edit-without-read

Verbose codebase exploration (60%+ tool calls)

Existing FFF Search advantage + R14 code index

Cost reporting non-functional ($0.00)

Instrumentation gap — no current requirement covers cost telemetry

Recommendations

Five actionable fixes ordered by impact. The first two are pre-requisites for meaningful cost optimization work.

1. Kill the Reminder Loop

Add a guard: don’t fire system reminders if the last N messages contained no user input. This single fix eliminates the most expensive bug in the corpus—~83K tokens per cycle with no productive output.

2. Block Edit Without Read

Enforce read-before-edit at the tool level, not via post-hoc reminders. The current approach lets the model waste tokens constructing the edit, fail, then waste more tokens on the reminder. Block the call entirely if the file hasn’t been read in the current turn.

3. Code Navigation Index

SWE-bench sessions spend 60%+ of tool calls on find + grep for discovery. A symbol index, call graph, or semantic search would collapse multi-step exploration chains into single lookups. SubQ’s FFF search already leads here—extend it.

4. Register Skills as Tools

The /sprint failures happen because skills exist in CLAUDE.md documentation but not in the tool registry. Skills referenced in project docs must be resolvable by the harness—either as registered tools or with clear error messages.

5. Fix Cost Instrumentation

Every session reports cost: {total: 0}. Token counts are captured but no real cost data is computed. Cost optimization work (R1, R8, R16) cannot be measured until this is fixed.

Corpus Metrics

corpus     25 sessions | 4.8 MB total | Apr 18–20 2026
modes      SWE-bench automated (23) | Interactive dev (2)
largest    572ab58b → 1.2 MB | mypy type inference | 27 min
longest    ec6b0293 → ~40 min | wedding platform sprint
waste      ~83K input tokens per reminder loop cycle
failures   3× /sprint resolution | 4× edit-without-read
positive   Full 8-point sprint executed autonomously
positive   Methodical reproduce → trace → fix pattern in SWE-bench
cost       $0.00 reported (instrumentation non-functional)
provider   "subquadratic" | model "latest" | thinking "off"

Methodology. Sessions extracted from /Users/tomdimino/Downloads/session logs SubQ.zip. JSONL format, one event per line. Analysis focused on the 3–4 largest sessions for pattern identification. Smaller sessions confirmed patterns but added no new observations.