I'll give bounties to people who suggest reasonable improvements to the criteria.
https://www.twitch.tv/claudeplayspokemon
Anthropic has taken the benchmark world by storm by assessing model performance against Pokèmon:

https://www.anthropic.com/news/visible-extended-thinking
Will any large language model become a Pokèmon Master by the end of 2025? To count, it must:
Complete a regular (being any of the base games like red/gold/sapphire/black/etc) Pokemon game, by getting all gym badges and beating the Elite 4 + rival.
Without assistance or steering mid-game. This means help specific to something it's stuck on that's not general. Tweaks to the system midway through are fine as long as it's in the spirit of general improvements, as in, the LLM should be able to complete the game end to end afterwards without additional changes. This description is in the spirit of small tweaks being able to be made to Claude Plays Pokemon without negating the validity of the run. That being said, if they become more loose with hints and unblocking it, it will not count.
With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.
Fine-tuning or reinforcement learning specific to Pokemon (or video games in general) is not allowed.
Any number of attempts are allowed, as in, the model can try an infinite number of times. I reserve the right to disqualify an attempt if it involves obscene abuse of save states, though.
RAG, knowledge files, custom system prompts, and interesting input/output schemes are all allowed. Anthropic has an interesting approach with Claude.
See also: /Sketchy/will-claude-become-a-pokemon-master-ng2zSA9ync
Update 2025-03-03 (PST) (AI summary of creator comment): Midgame Assistance Updates:
Allowed: Adjustments or tweaks made during the game are permitted, provided they are not directly hinting towards or addressing specific blockers.
Disallowed: Any midgame adjustments that serve as direct hints to overcome explicit obstacles in the game.
Update 2025-05-02 (PST) (AI summary of creator comment): Based on a discussion about a specific LLM run (Gemini 2.5 Pro beating Pokémon Blue):
The creator agreed that this specific run does not count towards market resolution.
The reason cited was that substantial mid-game changes to the system's structure, such as introducing a separate "strategist" LLM specifically to solve boulder puzzles mid-run, were considered "significantly over the boundary" of allowed mid-game tweaks.
This type of intervention is considered disallowed mid-game assistance specifically targeting blockers, rather than a general system improvement permitted by the rules.
Update 2025-06-04 (PST) (AI summary of creator comment): In response to a question about a specific type of autonomous run with pre-existing scaffolding, the creator confirmed such a run could count and provided these details:
A run using pre-existing game-specific scaffolding can count, even if there is "quite a bit" of such scaffolding.
The primary condition is that this pre-run scaffolding must not egregiously bypass the need for the LLM to drive itself through the game.
The run must be autonomous regarding this pre-existing scaffolding (e.g., no mid-game changes to prompts or tooling, contrasting with previously disallowed mid-run additions of specialized systems).
Developer interventions are only permissible if the LLM becomes "hard-stuck due to a system limitation," implying general system fixes rather than specific game hints or unblocking for game-specific challenges.
Update 2025-06-04 (PST) (AI summary of creator comment): Regarding specialized sub-systems, such as a 'boulder puzzle solver' LLM:
Such a system is acceptable if baked in from the start of the run.
This is permissible as long as it falls under game-specific prompting or scaffolding and LLMs are still making the decisions.
The critical factor is that the system is pre-existing for the run, contrasting with previous rulings where adding such a specialized system mid-run was disallowed.
Update 2025-06-04 (PST) (AI summary of creator comment): Regarding the criterion for minimal non-LLM programmatic assistance (where assistance "roughly twice as bad" as Claude's pathfinding might not count):
The creator considered a mapping system that provides the LLM with extensive details, such as all seen tiles, the current map layout (including warps, destinations, objects), and a calculated list of reachable tiles.
While acknowledging this is significantly more advanced than Claude's pathfinding, the creator stated they are leaning towards this specific mapping system not being "twice as bad" as Claude's pathfinding, implying it may be acceptable under this rule.
Update 2025-06-04 (PST) (AI summary of creator comment): The creator has confirmed that a specific, discussed LLM run setup (referred to as the 'current setup') is included for market resolution, despite being considered 'on the edge'.
This acceptable 'current setup' includes elements such as:
A specialized sub-LLM (e.g., for boulder puzzles) that is baked in from the start of the run.
An advanced mapping system that provides the LLM with extensive details like all seen tiles, current map layout (including warps, destinations, objects), and a calculated list of reachable tiles.
The creator stated that any further assistance beyond this current configuration will be approached with skepticism.
Update 2025-06-04 (PST) (AI summary of creator comment): Regarding the Gemini Plays Pokemon run on Twitch (
https://www.twitch.tv/gemini_plays_pokemon
):The second run, if it finishes, will count for market resolution.
This is contingent on no additional assistance being provided to this run beyond its state at the time of the creator's comment.
This specific run is considered on the edge of acceptable criteria.
Any further assistance introduced to this run will be viewed with significant skepticism.
Just to be explicit about https://www.twitch.tv/gemini_plays_pokemon
Assuming no additional assistance, the second run of Gemini Plays Pokemon will count for the resolution of this market (if it finishes). However, it's on the edge and I will treat any additional assistance with a lot of skepticism. More discussion here.
Just to be explicit about https://www.twitch.tv/gemini_plays_pokemon
Assuming no additional assistance, the second run of Gemini Plays Pokemon will count for the resolution of this market (if it finishes). However, it's on the edge and I will treat any additional assistance with a lot of skepticism. More discussion here.
Gemini 2.5 Pro beat Pokémon Blue earlier today. However I don't think it should count as too many substantial changes were made to the scaffolding throughout the run, including ex. allowing the main Gemini to call a separate Gemini with a prompt dedicated to solving boulder puzzles in Victory Road. (a "strategist" Gemini)
In general it was deliberately a pretty loose, experimental run. Later runs may count.
@JulianBradshaw agreed. I think it was significantly over the boundary of mid-game tweaks.
I’ll have to think about if the structure itself of map labelling is too much. Probably not.
@JulianBradshaw Would the run that's just been started where the previous scaffolding that's been developed won't be changed and the model will now run autonomously count?
From the description of the second run:
There are no changes to prompts or tooling, so this run serves as a clean test of all the improvements made during the first run. There won't be any developer interventions unless Gemini becomes hard-stuck due to a system limitation.
Yes, I think this would count. Although it's quite a bit of game-specific scaffolding, that was never disallowed and none of the structure I see is that egregiously bypassing the need to have the LLM drive itself.
@Sketchy this comment has a good summary of the scaffolding https://old.reddit.com/r/ClaudePlaysPokemon/comments/1kdjysi/gemini_beats_pokemon/mqblfeh/
In particular see
The other Gemini subagent is the Boulder Puzzle Solver, who is prompted with some pretty specific instructions on what kind of reasoning to use to solve the puzzles in Victory Road - examining the gates and switches to figure out which unsolved puzzles need to be done, and what sequence of pushes would accomplish that.
@jack yea the boulder puzzle solver is the worst part to me. But I didn’t disallow game-specific prompting or scaffolding, and it’s still LLMs making the decisions here. Since the boulder puzzle solver is baked in from the start of the run I think it’s acceptable.
It does seem like the whole benchmark is getting goodhart’ed pretty quickly but I’m trying to stick to my initial criteria.
@Sketchy The part of the criteria that seems relevant is
With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.
Which I personally thought this exceeded.
@jack The thing is, the boulder-solver is just an instance of Gemini with different prompting, not some kind of programmatic algorithm.
Ah, I guess I (mis)read that criteria as about assistance in general, not just programmatic assistance. Ok, in that case maybe the thing that stands out most for programmatic assistance is
Every tile it has seen in the map is recorded and the seen layout of the current map is given to it at all times. This includes warps and their destinations, and objects. Also, the list of which tiles can actually be reached or not in the current game state is calculated and told to Gemini.
@JulianBradshaw right this is where my head is at too. If it was a programmatic boulder solver that would be different.
@jack yea agreed. Does that feel twice as bad as the Claude path finding? I’m leaning towards no, but I do see how a map of visited places and layout much more than Claude’s pathfinding.
@Sketchy Wanted to say, as far as I know there's only been two changes to the scaffold since the start of this Gemini run (run #3):
when the navigator paths Gem into a moving NPC, it waits for the NPC to move instead of bonking
fixed a flaw in the harness where cut trees are considered non-traversible when HM01 has been deposited in the PC
According to GeminiPlaysPokemon, the first one is a QOL fix to not waste time, 2nd change is a bugfix, so they feel it's legitimate.
Citing Discord:

@Sketchy I think the main thing about Claude's pathfinding thing is it only applies to the current screen, whereas Gemini's applies to the whole map (if I understand correctly)? And Claude had tons of problems seeing the map, especially around visitability and warps, e.g. the constantly trying to get through 'gate-like structures" which Gemini's scaffold bypasses.
@jack Ultimately we don't know all the details of either the Claude or Gemini scaffolds (or o3 for that matter, which also started a run recently). They're not open source. But Claude also gets some information about which tiles are reachable/walkable or not per the excalidraw diagram linked on the twitch channel.
Gemini's map recording is arguably an allowed knowledge file system, since it only gets filled out as Gemini explores and sees new map tiles.
@DarklyMade I lean towards thinking this run should count. I think the main caveat here is that Gemini's harness has gotten around a lot of the typical LLM vision problems by a.) using game RAM information to specify if tiles are warps or not; b.) presenting map info in text form.
Now, in both of these cases, Claude and o3 are also getting the same info, just with (imo) less refined harnesses, so it's really a question of whether this class of harness counts at all. And I think the original market implies yes, since it's talking about Anthropic's own measurements of Pokémon progress using this class of harness.
Re: class of harnesses, here's an exchange between ClaudePlaysPokemon and GeminiPlaysPokemon from the former's stream that's enlightening:
asdfugil: wow claude has the same issue here as gemini but it's probably just a game bug
asdfugil: an unreachable warp right here
Gemini_Plays_Pokemon: yes there's an unused warp here in the game data
Gemini_Plays_Pokemon: for whatever reason
ClaudePlaysPokemon: i love how you have also had to go through knowing WAY TOO MUCH about weird quirks of pokemon red
Gemini_Plays_Pokemon: Haha yes I learned a lot of random stuff. There's also one like this at a gatehouse and an invisible non-functional warp beside the Silph Co President too
ClaudePlaysPokemon: pokemon devs y
ClaudePlaysPokemon: the one that caused me the most pain was figuring out how there is a list of tiles you can't walk between that is handled differently than collisions
Gemini_Plays_Pokemon: are you talking about moving from elevated ground to regular ground?
ClaudePlaysPokemon: yep lol
Gemini_Plays_Pokemon: yeah that's fun, I had to add special handling for that
ClaudePlaysPokemon: same, and it took me an annoying amount of reading the disassembly to figure it out
@JulianBradshaw yep, very true, but I believe Claude's reachability info is simpler and Claude can't even blindly trust that info because things like gates/doors are marked as "not reachable".
So overall I can see arguments both ways on whether this is like "twice as much programmatic assistance"
@jack So can I. I'm going to try to be definitive and say with the current setup, the run is included. But it's on the edge and I will approach any further assistance with skepticism.
@Sketchy I don't know, possibly https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-now-better-than-claude-at-pokemon and https://www.lesswrong.com/posts/8aPyKyRrMAQatFSnG ?
Or https://www.twitch.tv/gemini_plays_pokemon/about ?
Possibly worth waiting until it gets closer to finishing though, as I expect more will be written and it will be easier to decide
@Lorenzo I don't intend to do a detailed writeup on Gemini beating the game, there isn't too much new to say. Here's my quick take on it: https://www.lesswrong.com/posts/ekF2EDwKyZJNuxBTb/julian-bradshaw-s-shortform?commentId=cHqfKsCWtn5T5H4Tr
@JulianBradshaw Thanks!
@Sketchy I found this summary interesting: https://old.reddit.com/r/ClaudePlaysPokemon/comments/1kdjysi/gemini_beats_pokemon/mqblfeh/