// the pivot
From Blind Navigation to Vision-Grounded Planning
The hardcoded MAP_EXIT_HINTS approach collapsed under its own assumptions.
Cecil would spam DOWN into a wall for dozens of replans because the hint
said south was the exit — even when the actual doorway was west. The architecture shifted
entirely: replace the symbolic table with Claude Sonnet 4.6 vision calls.
Every replan, BizHawk captures a screenshot, POSTs it alongside the GSL state to Python,
which forwards the image to Claude. The model looks at the actual screen, reads the
structured state, considers action history, and returns a JSON plan.
No training. No dataset. No behavioral cloning pipeline. Just eyeballs on the game.
// the rabbit hole
The Bug That Made Everything Look Broken
First live run: pos=(0,0) story=0 on every tick. Screenshots writing fine —
confirmed 77 PNGs in C:\we10\screenshots\, each showing Cecil clearly in the
throne room. But Python read every state as uninitialized WRAM. Hours of diagnostics
followed: checking memory domains, testing read_u8 vs read_u16_le,
running ram_diag.lua to enumerate BizHawk's exposed memory. All pointed to
healthy reads. Cecil was at x=14 y=5 in the Lua console — but the server saw
zero.
Added a debug log dumping the raw HTTP body. The POST arrived URL-encoded:
payload=%40S%7Cmap%3A00%7Cx%3A14%7Cy%3A5%7C...
comm.httpPost wraps its body as application/x-www-form-urlencoded
with a payload= prefix. The GSL parser was splitting on | —
but every | was a %7C. Every state field invisible. Every
parser branch skipped. State dict empty. Screenshot path empty. Claude call skipped.
// the fix
Four Lines. Everything Changed.
Strip the payload= prefix. URL-decode. Pass to the parser. That's it.
// speedy.py
from urllib.parse import unquote_plus
if raw_body.startswith("payload="):
raw_gsl = unquote_plus(raw_body[8:])
Next run: pos=(14,5) story=173. Real coordinates. Real story flag. Screenshot
path parsed. Claude call #1 fired. Response returned. Cecil moved with intent
for the first time.
// first reasoning
The Model Saw the Throne Room
First [CLAUDE] reasoning: line in the log:
"Cecil is facing south near the top of the map with a door/entrance visible above.
Moving up toward the banner-flanked doorway and pressing A should trigger an interaction
or enter the throne room."
Banner-flanked. It saw the banners. It described the banners. A few calls later it
identified the King as "a figure to the north" and proposed walking up to talk. The
vision pipeline was unambiguously working. The agent now perceived the game, not just
the coordinates.
// still flailing
JSON Errors, No Cache, No Context
Vision worked. Reasoning was lucid. But three structural problems surfaced almost
immediately:
1. JSON parse failures. Claude kept wrapping the JSON in prose —
"Looking at the screenshot, Cecil is..." then the object. The strict parser rejected
every prose-wrapped response, which ate the round trip. 2. Caching never fired.
Every call showed cache_read=0 cache_write=0. The system prompt was under
1024 tokens — below Anthropic's minimum caching threshold. Paying full input price on
every request. 3. No walkthrough context. The walkthrough_context=""
parameter had been plumbed through but never populated. Claude was reasoning about
Baron Castle without ever being told about Baron Castle.
// full stack upgrade
Thinking, Caching, Research, RAG
One session, four layered fixes:
Extended thinking enabled on plan calls — Sonnet 4.6 gets 1500 tokens
of deliberation before emitting the JSON.
JSON extractor hardened — balanced-brace parser that finds the first
{ in the response, ignores string-literal quotes, and tolerates trailing
prose. Four test cases green.
System prompt beefed up to ~1450 tokens — past the caching threshold.
Now includes explicit FF4 opening context, coordinate conventions, and repeated
"JSON only, start with {" guards at both ends.
Walkthrough RAG wired into every plan call — planner.py
queries MAINWLKT.txt by story flag, pulls the matching section,
passes it to Claude as auxiliary context alongside the screenshot.
Session playbook built once at boot — a new research.py
module calls Claude with web search enabled (capped at 3 searches), loads the walkthrough,
and compiles a structured playbook. The output gets injected into the cached system
prompt so every subsequent plan benefits from deep research at cache-read prices.
// boot sequence
[BOOT] WalkthroughRAG loaded from MAINWLKT.txt
[BOOT] Running research phase (10-30s on first launch)...
[BOOT] Playbook compiled: 1247 chars
Claude vision: ENABLED
Walkthrough RAG: ENABLED
Session playbook: loaded (1247 chars)
// the honest result
Cecil Has Eyes. Cecil Skips the Door.
All five information channels now flow into Claude: screenshot, structured state,
action history, walkthrough RAG, cached session playbook with FF4-specific guidance.
Extended thinking enabled. Caching confirmed working on call #2. The model confidently
describes what it sees, reasons about spatial constraints, reads dialog flags correctly.
And Cecil still walks straight out the south gate, past Rosa's
chambers entirely. The walkthrough says "go west to find Rosa before leaving." The
playbook says it. The RAG returns the section. Claude gets all of it — then emits
a plan to exit the castle because exiting the castle is the locally efficient move.
This is a better failure mode than "stuck in a wall." But it's not Matt Damon smart.
The model has no concept of gated sub-objectives. Each plan is
fresh-screen + fresh-prompt → Claude forgets everything that happened 30 seconds ago.
"Rosa first, then Mist" is a thing a human remembers across twenty minutes of
gameplay. Claude is a goldfish with a screenshot.
// next
Objective Stack or Die
The next layer isn't bigger prompts or more retrieval. It's state the model
can't forget. A persistent objective stack that lives outside the prompt: a
current goal, a locked sub-goal, completion criteria Claude can only satisfy by
explicit emission of objective_complete: true. The planner refuses to
advance past stage 1 until stage 1 is marked complete.
Also queued: the human-in-the-loop takeover we designed but didn't
build. When the model spins on an objective for N replans, return ACT:HUMAN,
the Lua skips joypad.set, physical keyboard inputs pass through for 15
joypad-active ticks, then control returns to Claude with a fresh screenshot of
wherever the human parked Cecil. The model doesn't need to learn from the
intervention — it just needs the new screenshot as fresh context.
Cecil has eyes now. Next we teach him to remember what he's doing.