PhilKnows — The SNES AI Project

// 001

THE ORIGIN

The Question

Can an AI actually play a 1991 JRPG?

Not emulated. Not ROM-hacked. On real physical hardware — a Super Nintendo from 1991, a real cartridge, a real CRT signal. The idea: use Claude's vision capabilities to watch the screen, reason about the game state, and send controller inputs through a hardware emulator. No save states. No cheat codes. No APIs into the game's memory. Just pixels and buttons.

The Constraint

Real Hardware Only

The SNES outputs via composite or RGB. An Elgato capture card digitizes the signal into frames Python can read. An Arduino Micro emulates an actual SNES controller — the game has no idea it's talking to anything other than a first-party Nintendo pad. Every button press is a real electrical signal on real hardware.

// 002

THE HARDWARE

Signal Chain

SNES → OSSC → Elgato 4K → PC → Claude

The SNES outputs a native NTSC RGB signal via a Capacitor RGB NTSC cable. That feeds into an OSSC 1.8 (Kaico edition), which line-multiplies the 240p signal into clean HDMI. An Elgato 4K60 capture card ingests the stream into the Windows gaming PC running Python. OpenCV reads frames directly from the capture device — no screen recording, no emulator. Raw pixel-perfect NTSC output, handed to Claude Vision.

Controller Layer

Arduino Micro as SNES Pad

An Arduino Micro wires directly into the SNES controller port. It speaks the SNES latch/clock/data protocol natively. The PC sends button commands over USB serial; the Arduino converts them into timed electrical pulses the console reads as controller input. Early bugs: CH340 driver conflicts on Windows, wire color identification on the SNES connector, and interrupt-based timing to hit the SNES's 16ms polling window reliably.

Debugging Milestones

What Broke, What We Learned

Getting reliable button presses required moving from polling to interrupt-driven timing. The SNES controller protocol is strict — miss the latch window and the console reads garbage. We also had to identify SNES controller wire pinouts by probing with a multimeter since vintage cables have no consistent color standard.

SUPER NINTENDO

1991 · NTSC

CLICK TO LEARN MORE

OSSC 1.8 (KAICO)

Line Multiplier

CLICK TO LEARN MORE

ELGATO 4K60

Capture Card

CLICK TO LEARN MORE

WINDOWS GAMING PC

Python · OpenCV

CLICK TO LEARN MORE

ARDUINO MICRO

Controller Emulator

CLICK TO LEARN MORE

SONY TRINITRON

13" CRT · Display Only

CLICK TO LEARN MORE

// 002.5

SIGNAL CHAIN

Every pixel Claude sees traveled this path before it became a decision.

// 003

THE ARCHITECTURE

Vision Loop

Frame Capture → Claude Vision → Action

The core loop runs on the Windows gaming PC: capture a frame from the Elgato via OpenCV, encode it as base64, send it to Claude Vision with a system prompt describing the game and expected JSON output format, parse the response, and fire the corresponding button sequence to the Arduino over USB serial. Target loop time: under 3 seconds per decision cycle.

RAG Layer

Walkthrough Dictionary + HTML Corpus

To give the AI actual game knowledge, we built a RAG system backed by a structured walkthrough dictionary. The AI identifies its current location from the frame, queries the dictionary for what it should do there, and uses that context to inform its action decision. In theory. In practice, this layer has been the hardest to get right.

RAG Corpus

Scraping HTML for Game Knowledge

Rather than hand-crafting all game knowledge, we've been scraping HTML from FF4 walkthrough and wiki sources to build a richer RAG corpus. Parse the pages, chunk the content, embed it — give the model something to actually retrieve against when it needs to know what's in the next room or how a boss mechanic works.

Model Experiments

Llama 13B & 7B — Local Inference Trials

We've been testing local models alongside Claude — specifically Llama 13B and 7B — to evaluate whether a smaller on-device model could handle parts of the decision loop without hitting the API. Results so far: context reasoning degrades noticeably at 7B, and 13B gets closer but still struggles with structured JSON output consistency under game-state complexity.

Safety Systems

Stuck Detection & State Persistence

Cecil has a talent for getting stuck in corners and pressing the same direction forever. We built stuck detection — if the same action fires N times without a screen change, break the loop and reorient. Persistent game state tracks what's been tried, where Cecil has been, and what the current objective is across decision cycles.

// 004

VERSIONS

v1 — v3

Proof of concept. Basic frame capture working. Arduino sending button presses. First time Cecil moved on screen under AI control. No game knowledge, pure reaction.

v4 — v5

Structured JSON responses. Moved from freeform AI text to a strict JSON action schema. Cleaner parsing, fewer hallucinated button sequences. Still no walkthrough context.

WholeEnchilada6 — RAG introduced. Walkthrough dictionary added. First attempt at location-aware decision making. Stuck detection added. State persistence across cycles.
// the name: phil likes enchiladas. a lot. "the whole enchilada" also happens to be an idiom for stuffing an absurd amount of logic into one script and hitting run. it felt appropriate. nobody is complaining.

v7 — v8

Stability and timing improvements. Interrupt-based Arduino timing. Loop reliability improvements. Vision prompt tuning. Context window management for long sessions.

v9 — current

WholeEnchilada9. Most stable build to date. Vision loop and RAG functional but Cecil still lacks fundamental game literacy — no internal model of exploration, movement, or interaction basics. This is the rewrite target.

Parallel track

Local model experiments. Testing Llama 13B and 7B as alternatives to API calls. HTML scraping pipeline built to feed a richer RAG corpus from walkthrough and wiki sources. Neither local model matches Claude's structured output reliability yet.

// 005

THE STACK

// HARDWARE

Super Nintendo (1991)

The actual computer running the game

Capacitor RGB Cable (NTSC)

Native RGB signal out of SNES MultiAV

OSSC 1.8 (Kaico)

240p line multiplier → clean HDMI

Elgato 4K60

HDMI capture → host device input

Windows Gaming PC

Orchestration host — Python, OpenCV, serial

Arduino Micro

SNES controller emulation over serial

Sony Trinitron 13"

HDMI→composite downscaler · display only

// SOFTWARE

Python

Orchestration, vision loop, RAG, serial comms

OpenCV

Frame reading & preprocessing

HTML Scraper

FF4 walkthrough/wiki corpus for RAG pipeline

Walkthrough Dict (RAG)

Location-aware game knowledge retrieval

// AI / MODELS

Claude Vision (Sonnet)

Primary brain — frame interpretation & action decisions

Llama 13B

Local inference experiment — context reasoning

Llama 7B

Local inference experiment — speed vs. quality

// THE GAME

FF4 / FF2 US (SNES)

The game. Cecil's eternal burden.

// 006

WHERE WE ARE NOW

WholeEnchilada v9 is stable. The hardware pipeline is solid — frames are being captured, the Arduino is reliable, serial comms don't drop. The problem is upstairs.

Cecil doesn't move. The vision loop fires, the RAG queries, the JSON comes back — but Cecil has no internal model of what it means to explore a room. He doesn't know that rooms have exits, that you walk in directions to find them, that you press A to interact. The AI knows the walkthrough but doesn't know how to be a player.

Next up: a ground-up rewrite of the vision prompt and action schema with explicit game-literacy reasoning baked in. Teaching Cecil the basics before asking him to clear the Mist Cave.

Hardware ✓Serial Comms ✓Vision Loop ✓ RAG — Needs RewriteGame Literacy — Missing Llama 7B/13B — Evaluating HTML Corpus — In Progressv10 — In Progress

// 007

THE ORIGIN

THE HARDWARE

SIGNAL CHAIN

THE ARCHITECTURE

VERSIONS

THE STACK

WHERE WE ARE NOW

WHERE WE ARE GOING