// philknows.com  /  dev journal

DEVJOURNAL

A running log of sessions, decisions, and breakthroughs building the SNES AI agent — one iteration at a time.

Baron Castle Mapping & WE10

collapse
// context
Where We Left Off
WE9 was the last stable agent before this session — a Python loop that captured SNES frames via OpenCV, sent them to a local Ollama LLaVA 13B model, and translated the response into button presses via Arduino Micro. The core problem: Cecil wandered without purpose. The agent had no spatial memory, no sense of where it was, and no understanding of what the game was asking it to do between frames. This session was about fixing that at the architecture level.
// artifacts built
Four Files, Ground Up
The session produced four new artifacts — a complete replacement of WE9, not a patch on top of it.
WholeEnchilada10.py
Main agent loop. Replaced Ollama with Claude Vision API. Full context injection per location.
game_literacy.json
10-location map of Baron Castle. Exits, NPCs, triggers, interactables — built from live narration.
vision_heuristics.md
Visual signatures for game states. Dialog box geometry corrected from prior design.
decision_tree.md
Human decision logic encoded as priority order: dialog → battle → trigger → NPC → explore.
// architecture
Key Changes in WE10
Claude Vision API replaces LLaVA 13B. The local model produced unreliable structured JSON — WE10 drops Ollama entirely and routes vision calls directly to claude-opus-4-6. Reliability for the action schema is non-negotiable.

OpenCV pre-filter for dialog detection. Before making any API call, the Python loop scans the top third of the screen for a white-bordered, dark-blue-filled rectangle. If found, it fires an A press immediately — no Claude call needed. Saves latency on the most common game interaction.

Location-aware context injection. game_literacy.json is loaded at startup. Every Claude prompt now includes the exits, NPCs, interactables, and story triggers for the agent's current detected location. The model is no longer flying blind.

Accordion summarization. Every 30 actions, the older portion of the action log compresses into a bullet summary. The full raw log would blow the context window — this keeps recent history sharp while preserving older narrative.

Blocked path memory. Confirmed dead ends are logged per location. The agent won't retry a wall it already found.
Claude Vision API OpenCV game_literacy.json Accordion Summarization Blocked Path Memory NPC Hint Re-injection
// live play discoveries
What the Game Taught Us
Baron Castle was narrated in real time to build the location database. These are the findings that changed how the agent is designed:
Dialog boxes are upper-center, not bottom. The prior design assumed bottom-of-screen placement (JRPG convention). FF4 puts them in the top third. The OpenCV pre-filter was rewritten around this.
Room labels flash on entry. A small text box appears briefly when entering a room. This is the most reliable room identity signal — faster than parsing the background tileset.
Wall switches break tile pattern. Interactive switches are visually out of alignment with surrounding tiles. Press A when the pattern breaks.
Cecil can disappear behind scenery. If the map is still scrolling, he's still moving — keep pressing the direction. Don't mistake occlusion for a wall.
World map: explore all four directions first. Don't commit to a path on first exit. Narrate what's visible in each direction before choosing.
Some NPCs reveal exits after dialog. Talking to certain characters moves them aside, unlocking a previously blocked waypoint. NPC interaction is a navigation tool, not just flavor.
// game_literacy.json
Baron Castle — 10 Locations Mapped
The location database was built entirely from live gameplay narration this session. Each entry encodes exits (with visual descriptors), NPCs, interactables, story triggers, and navigation notes.
Throne Room
1F — Section A
1F — Section B
Baron Exterior East
Baron Exterior West
Gate Area
Cid's Area
NPC Room
Cecil's Bedroom
World Map
// current status
Ready to Run

WE10 is written and waiting. The agent code is complete, Baron Castle interior is fully mapped in game_literacy.json, and all four supporting artifacts are done. The Claude Vision API replaces LLaVA entirely. The OpenCV dialog pre-filter is in place.

Next session: narrate world map exploration, add waypoints and exit descriptors to game_literacy.json, then run WE10 for the first time on real hardware.

// hardware chain
Signal Path — Unchanged
SNES
OSSC 1.8
Elgato 4K
Python / OpenCV
Claude Vision API
Arduino Micro
SNES Controller Port

First Live Run & A* Pathfinding

collapse
// context
It Actually Ran
WE10 hit real hardware for the first time this session. The goal was simple: get the agent running end-to-end, observe what broke, fix it, and add A* pathfinding on top. Three bugs surfaced immediately — all diagnosed and patched mid-session. By the end, Cecil was navigating Baron Castle autonomously on a real SNES.
// bugs fixed
Three Fixes, One Session
Title screen not advancing. The stuck override was replacing A with a directional after just 2 frozen frames — but the title screen is static by design, so no_change_count hit 2 immediately and cancelled every A press before it could register. Fix: raised the override threshold from ≥ 2 to ≥ 5.
A and X buttons swapped in Arduino firmware. The in-game menu kept opening when the agent tried to interact with NPCs — the classic tell for a button mismatch. Arduino had pins 8 and 9 mapped in the wrong order. Fixed directly in the .ino firmware, then added a software safety layer: BUTTON_REMAP = {"A": "X", "X": "A"} in the Python press() function so Claude's logical button names stay correct regardless of firmware quirks.
START leaking through. Claude would occasionally return START despite the prompt explicitly forbidding it. Fix: removed START from VALID_BUTTONS entirely — physically impossible to send now.
Stuck Threshold ≥5 Arduino Pin Remap BUTTON_REMAP Safety Layer START Blocked
// new system
A* Pathfinding Added
Cecil's biggest failure mode was walking directly into walls and retrying endlessly. A* pathfinding was added to WholeEnchilada10.py to give the agent spatial awareness at the tile level — so when Claude picks a direction, the path is actually clear before the button gets pressed.
detect_game_region()
Scans frame for non-black bounding box at startup. Falls back to hardcoded OSSC 4x values if detection fails. Confirmed: 1920×1080, 64px per SNES tile, 16×14 grid.
build_walkability_grid()
Per-tile brightness >60 and Laplacian edge variance <500 = walkable. Dark tiles with high edge density = walls. Cecil's tile (col 8, row 7) always forced walkable.
record_collision() / apply_collision_map()
Screen doesn't change after a directional press → tile recorded as blocked. Persists to game_state.json per location_id across sessions.
astar()
Cecil is always grid center (8,7) — FF4's camera follows him. Claude picks a direction; A* finds a clear path 4 tiles that way, or reroutes to nearest walkable. Logs [PATHFIND] on reroutes.
// tooling
Live Debug Visualization
An OpenCV overlay window — "WE10 Pathfinding Debug" — renders on top of the live game frame every tick. Color key: green = walkable, red = blocked, yellow = Cecil, blue = A* planned path. Grid lines show tile boundaries. Not yet observed in a live run — that's the next session goal.
// confirmed from live run
What Actually Worked
Agent passed the title screen — A confirmed correct after the threshold fix.
Room label detection working — location context loading correctly from game_literacy.json.
Agent navigated freely inside Baron Castle and attempted NPC interaction.
A/X button confusion identified and fixed mid-session without a full restart.
// current status
Running. Pathfinding Untested.

WE10 runs end-to-end on real hardware. Cecil moves, the vision loop is live, room context loads correctly, and the three launch bugs are patched. Pathfinding code is written but the debug visualization hasn't been observed in a live run yet.

Next session: run with the debug window open, tune walkability thresholds for Baron Castle, verify A* is correctly reading walls vs floors, and watch whether collision learning actually reduces wall-bumping over time.

// hardware chain
Signal Path — Confirmed Specs
SNES
OSSC Kaico 1.8 (4x)
Elgato 4K60 (1920×1080)
Python / OpenCV
Claude Vision API
Arduino Micro (COM5)
SNES Controller Port

The Teacher Awakens

collapse
// pivot
A New Architecture
WE10 grew a second brain today. Instead of relying solely on Claude Vision API to figure out FFIV from scratch on real hardware, we introduced a parallel Speedy Teacher Model — a reinforcement learning agent running on emulated hardware that teaches the Vision student how to play. The teacher has ground truth. The student has eyes. Together they close the gap.
// teacher / student
Two Models, One Game
Speedy (Teacher) runs locally on the gaming PC via BizHawk emulator. It reads raw SNES WRAM directly — HP, position, map ID, story flags, terrain adjacency — learns optimal play through self-play at machine speed, and transmits knowledge to Vision via a custom intermediate language called GSL.
Vision (Student) continues running on real 1991 SNES hardware via the existing pipeline. It receives enriched context from Speedy before each decision. Vision still owns screen interpretation. Speedy owns game knowledge.
BizHawk WRAM
GSL Message
SpeedyNet (CUDA)
ACT: Response
BizHawk Joypad
// custom language
GSL — Game State Language
Not built for humans — built for AI. Token-efficient, semantically dense, zero ambiguity. Every 30 frames BizHawk pushes a 4-line GSL block over TCP to Python. Python responds with a single action string. BizHawk executes it.
@S — State
Map ID, plane, overworld flag, X/Y position, facing, vehicle, terrain type, all 4 adjacent tiles.
#N — Navigation
Story progress, map transition flag, movement flag.
%P — Party
All active party members: name, level, HP/MP ratio, status flags.
!F — Flags
Battle, dialog, menu, cutscene — binary game state flags for fast decision branching.
// neural network
SpeedyNet — 52,493 Parameters
A custom policy-gradient network running on RTX 4070 Ti via PyTorch CUDA. Lean but not thin — enough capacity to learn navigation, battle awareness, terrain reading, and story progress simultaneously.
Input (40)
Dense 256
Dense 128
Dense 64
Output (12)
LeakyReLU Policy + Value Heads Experience Replay 50K RTX 4070 Ti CUDA
// reward model
Custom Reward Table
Designed from scratch — no borrowed assumptions. The reward signal reflects exactly what good FFIV play looks like.
+250
Enter a boss room (one-time)
+100
Enter any new map ID (one-time)
+10
Move toward unvisited exit
+1
Net forward movement
−1
Revisit an already-stepped tile
−5
Standing still for 3+ ticks
−50
Party wipe → auto reset to save state
// walkthrough rag
355 Sections Indexed
Just like a real player Googling "what do I do next in FFIV" when stuck, Speedy queries a RAG-indexed walkthrough when it detects reward stagnation. 355 sections parsed from General Tips through end-game — location names, party requirements, key items, boss strategies. Read-only. The walkthrough speaks for itself.
// first training run
It's Training Live
By end of session Speedy was training on the RTX 4070 Ti with Cecil walking around Baron Castle. The loss curve is active, epsilon is decaying, and the reward signal is firing correctly.

[train] step=760 | loss=82.15 | eps=0.684 | reward=−6 | revisit_tile + standing_still

Still deep in random exploration at epsilon 0.68. The reward loop is stuck on revisit and standing still penalties — Speedy hasn't discovered that moving to new tiles feels good yet. Hold frames tuned from 8 → 15 and standing still threshold tightened from 10 → 3 ticks to force more decisive movement per action.

// session screenshot — step 760, ε=0.684, baron castle
Speedy training session - step 760
Step 760 ε = 0.684 16 Tiles Visited Weights Saving
// current status
Teacher is Awake. Learning to Walk.

Full pipeline is live: BizHawk → GSL → SpeedyNet (CUDA) → ACT response → joypad injection. The network trains every 10 steps, saves weights every 500, and resets automatically on party wipe.

Next session: wire in BizHawk save state control for autonomous resets, uncap emulator speed for self-play at machine speed, and begin distilling Speedy's learned knowledge into GSL teaching signals for the Vision student.

Only One Ear Was Listening

collapse
// wrong assumption
The Architecture Was Backwards
The session started with a broken assumption baked into the architecture. The original setup had Python acting as the TCP client and BizHawk as the server — which is backwards. BizHawk's Lua API exposes comm.socketServerSend() and comm.socketServerResponse(), meaning BizHawk is always the client. It connects out to whatever is listening. Getting BizHawk to open the connection at all required passing both URL flags at launch — get and post independently:
start "BizHawk" EmuHawk.exe --url-get=http://127.0.0.1:9001 --url-post=http://127.0.0.1:9001
// still broken
Empty Strings. Syntax Error. Session Stalls.
Even with the flags, comm.socketServerResponse() kept returning empty strings on every call. A protocol fix was queued in Lua — but a syntax error at line 154 in speedy_sender.lua blocked testing before it could be validated. The raw TCP socket approach was fighting us at every layer: timing races, two-port juggling, response polling that never quite landed. It was the wrong tool for what we were actually doing.
// the real fix
Throw Out the Sockets. Use HTTP.
The breakthrough was stepping back and reading the BizHawk Lua API properly. comm.httpPost(url, body) exists. It's synchronous. BizHawk POSTs the GSL payload, blocks until Python responds, gets the action back in the response body. No socket timing. No polling loop. No two-port juggling. Flask on port 9001 listening at /act — that's the whole server.
// speedy_sender.lua
local URL = "http://127.0.0.1:9001/act"
-- synchronous: BizHawk blocks until Python responds
local ok, response = pcall(comm.httpPost, URL, gsl)
// speedy.py
@app.route("/act", methods=["POST"])
def act():
    state = parse_gsl(request.get_data(as_text=True))
    return process_state(state), 200
BizHawk doesn't need --socket_ip or --socket_port launch flags for this. Load the ROM, run the script. comm.httpPost handles the rest.
Flask is a lightweight Python web framework — it lets you spin up an HTTP server in a handful of lines, mapping URL routes to Python functions. Here it meant we could replace the entire custom socket protocol with a single decorated function: BizHawk POSTs to /act, Flask calls process_state(), returns the action string. No handshake logic, no buffer management, no read/write timing to get right. HTTP is a solved problem. Flask just exposes it. That's exactly why it helped — we stopped writing plumbing and got back to writing the actual model.
// it works
Cecil Is Moving. The Model Is Learning.
With speedy.py and speedy_sender.lua running against each other over HTTP, the loop closed. BizHawk reads WRAM every 30 frames, builds a GSL payload encoding position, terrain, party state, and story flags, POSTs it to Flask, gets an ACT:DIRECTION back, holds the button for 30 frames, repeats. SpeedyNet is training. Epsilon decays from 1.0 as the replay buffer fills. Loss is stabilizing. Cecil is on screen and moving under model control.

Architecture: HTTP POST/response — fully synchronous, no race conditions. ✓

Training: SpeedyNet live — replay buffer filling, loss stabilizing, epsilon decaying. ✓

Cecil: moving on screen under model control. ✓

Next session: wire in save state control for autonomous resets, uncap emulator speed for self-play at machine speed.

// proof
Cecil On Screen
This is what it looks like when it works. SpeedyNet pushing inputs, BizHawk executing them, Cecil moving. First live run under model control.

Cecil Has Eyes (But Skips the Door)

collapse
// the pivot
From Blind Navigation to Vision-Grounded Planning
The hardcoded MAP_EXIT_HINTS approach collapsed under its own assumptions. Cecil would spam DOWN into a wall for dozens of replans because the hint said south was the exit — even when the actual doorway was west. The architecture shifted entirely: replace the symbolic table with Claude Sonnet 4.6 vision calls. Every replan, BizHawk captures a screenshot, POSTs it alongside the GSL state to Python, which forwards the image to Claude. The model looks at the actual screen, reads the structured state, considers action history, and returns a JSON plan.
No training. No dataset. No behavioral cloning pipeline. Just eyeballs on the game.
// the rabbit hole
The Bug That Made Everything Look Broken
First live run: pos=(0,0) story=0 on every tick. Screenshots writing fine — confirmed 77 PNGs in C:\we10\screenshots\, each showing Cecil clearly in the throne room. But Python read every state as uninitialized WRAM. Hours of diagnostics followed: checking memory domains, testing read_u8 vs read_u16_le, running ram_diag.lua to enumerate BizHawk's exposed memory. All pointed to healthy reads. Cecil was at x=14 y=5 in the Lua console — but the server saw zero.
Added a debug log dumping the raw HTTP body. The POST arrived URL-encoded:
payload=%40S%7Cmap%3A00%7Cx%3A14%7Cy%3A5%7C...
comm.httpPost wraps its body as application/x-www-form-urlencoded with a payload= prefix. The GSL parser was splitting on | — but every | was a %7C. Every state field invisible. Every parser branch skipped. State dict empty. Screenshot path empty. Claude call skipped.
// the fix
Four Lines. Everything Changed.
Strip the payload= prefix. URL-decode. Pass to the parser. That's it.
// speedy.py
from urllib.parse import unquote_plus
if raw_body.startswith("payload="):
    raw_gsl = unquote_plus(raw_body[8:])
Next run: pos=(14,5) story=173. Real coordinates. Real story flag. Screenshot path parsed. Claude call #1 fired. Response returned. Cecil moved with intent for the first time.
// first reasoning
The Model Saw the Throne Room
First [CLAUDE] reasoning: line in the log:
"Cecil is facing south near the top of the map with a door/entrance visible above. Moving up toward the banner-flanked doorway and pressing A should trigger an interaction or enter the throne room."
Banner-flanked. It saw the banners. It described the banners. A few calls later it identified the King as "a figure to the north" and proposed walking up to talk. The vision pipeline was unambiguously working. The agent now perceived the game, not just the coordinates.
// still flailing
JSON Errors, No Cache, No Context
Vision worked. Reasoning was lucid. But three structural problems surfaced almost immediately:
1. JSON parse failures. Claude kept wrapping the JSON in prose — "Looking at the screenshot, Cecil is..." then the object. The strict parser rejected every prose-wrapped response, which ate the round trip. 2. Caching never fired. Every call showed cache_read=0 cache_write=0. The system prompt was under 1024 tokens — below Anthropic's minimum caching threshold. Paying full input price on every request. 3. No walkthrough context. The walkthrough_context="" parameter had been plumbed through but never populated. Claude was reasoning about Baron Castle without ever being told about Baron Castle.
// full stack upgrade
Thinking, Caching, Research, RAG
One session, four layered fixes:
Extended thinking enabled on plan calls — Sonnet 4.6 gets 1500 tokens of deliberation before emitting the JSON.
JSON extractor hardened — balanced-brace parser that finds the first { in the response, ignores string-literal quotes, and tolerates trailing prose. Four test cases green.
System prompt beefed up to ~1450 tokens — past the caching threshold. Now includes explicit FF4 opening context, coordinate conventions, and repeated "JSON only, start with {" guards at both ends.
Walkthrough RAG wired into every plan callplanner.py queries MAINWLKT.txt by story flag, pulls the matching section, passes it to Claude as auxiliary context alongside the screenshot.
Session playbook built once at boot — a new research.py module calls Claude with web search enabled (capped at 3 searches), loads the walkthrough, and compiles a structured playbook. The output gets injected into the cached system prompt so every subsequent plan benefits from deep research at cache-read prices.
// boot sequence
[BOOT] WalkthroughRAG loaded from MAINWLKT.txt
[BOOT] Running research phase (10-30s on first launch)...
[BOOT] Playbook compiled: 1247 chars
Claude vision: ENABLED
Walkthrough RAG: ENABLED
Session playbook: loaded (1247 chars)
// the honest result
Cecil Has Eyes. Cecil Skips the Door.
All five information channels now flow into Claude: screenshot, structured state, action history, walkthrough RAG, cached session playbook with FF4-specific guidance. Extended thinking enabled. Caching confirmed working on call #2. The model confidently describes what it sees, reasons about spatial constraints, reads dialog flags correctly.
And Cecil still walks straight out the south gate, past Rosa's chambers entirely. The walkthrough says "go west to find Rosa before leaving." The playbook says it. The RAG returns the section. Claude gets all of it — then emits a plan to exit the castle because exiting the castle is the locally efficient move.
This is a better failure mode than "stuck in a wall." But it's not Matt Damon smart. The model has no concept of gated sub-objectives. Each plan is fresh-screen + fresh-prompt → Claude forgets everything that happened 30 seconds ago. "Rosa first, then Mist" is a thing a human remembers across twenty minutes of gameplay. Claude is a goldfish with a screenshot.
// next
Objective Stack or Die
The next layer isn't bigger prompts or more retrieval. It's state the model can't forget. A persistent objective stack that lives outside the prompt: a current goal, a locked sub-goal, completion criteria Claude can only satisfy by explicit emission of objective_complete: true. The planner refuses to advance past stage 1 until stage 1 is marked complete.
Also queued: the human-in-the-loop takeover we designed but didn't build. When the model spins on an objective for N replans, return ACT:HUMAN, the Lua skips joypad.set, physical keyboard inputs pass through for 15 joypad-active ticks, then control returns to Claude with a fresh screenshot of wherever the human parked Cecil. The model doesn't need to learn from the intervention — it just needs the new screenshot as fresh context.
Cecil has eyes now. Next we teach him to remember what he's doing.
// next entry
OBJECTIVE STACK & HUMAN HANDOFF