Will an LLM become a Pokèmon Master by the end of 2025? [READ DESCRIPTION]

Ṁ15k

resolved Jun 10

Resolved

YES

ALL

I'll give bounties to people who suggest reasonable improvements to the criteria.

https://www.twitch.tv/claudeplayspokemon

Anthropic has taken the benchmark world by storm by assessing model performance against Pokèmon:

https://www.anthropic.com/news/visible-extended-thinking

Will any large language model become a Pokèmon Master by the end of 2025? To count, it must:

Complete a regular (being any of the base games like red/gold/sapphire/black/etc) Pokemon game, by getting all gym badges and beating the Elite 4 + rival.
Without assistance or steering mid-game. This means help specific to something it's stuck on that's not general. Tweaks to the system midway through are fine as long as it's in the spirit of general improvements, as in, the LLM should be able to complete the game end to end afterwards without additional changes. This description is in the spirit of small tweaks being able to be made to Claude Plays Pokemon without negating the validity of the run. That being said, if they become more loose with hints and unblocking it, it will not count.
With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.
Fine-tuning or reinforcement learning specific to Pokemon (or video games in general) is not allowed.

Any number of attempts are allowed, as in, the model can try an infinite number of times. I reserve the right to disqualify an attempt if it involves obscene abuse of save states, though.

RAG, knowledge files, custom system prompts, and interesting input/output schemes are all allowed. Anthropic has an interesting approach with Claude.

Update 2025-03-03 (PST) (AI summary of creator comment): Midgame Assistance Updates:
- Allowed: Adjustments or tweaks made during the game are permitted, provided they are not directly hinting towards or addressing specific blockers.
- Disallowed: Any midgame adjustments that serve as direct hints to overcome explicit obstacles in the game.

Update 2025-05-02 (PST) (AI summary of creator comment): Based on a discussion about a specific LLM run (Gemini 2.5 Pro beating Pokémon Blue):
- The creator agreed that this specific run does not count towards market resolution.
- The reason cited was that substantial mid-game changes to the system's structure, such as introducing a separate "strategist" LLM specifically to solve boulder puzzles mid-run, were considered "significantly over the boundary" of allowed mid-game tweaks.
- This type of intervention is considered disallowed mid-game assistance specifically targeting blockers, rather than a general system improvement permitted by the rules.

Update 2025-06-04 (PST) (AI summary of creator comment): In response to a question about a specific type of autonomous run with pre-existing scaffolding, the creator confirmed such a run could count and provided these details:
- A run using pre-existing game-specific scaffolding can count, even if there is "quite a bit" of such scaffolding.
- The primary condition is that this pre-run scaffolding must not egregiously bypass the need for the LLM to drive itself through the game.
- The run must be autonomous regarding this pre-existing scaffolding (e.g., no mid-game changes to prompts or tooling, contrasting with previously disallowed mid-run additions of specialized systems).
- Developer interventions are only permissible if the LLM becomes "hard-stuck due to a system limitation," implying general system fixes rather than specific game hints or unblocking for game-specific challenges.

Update 2025-06-04 (PST) (AI summary of creator comment): Regarding specialized sub-systems, such as a 'boulder puzzle solver' LLM:
- Such a system is acceptable if baked in from the start of the run.
- This is permissible as long as it falls under game-specific prompting or scaffolding and LLMs are still making the decisions.
- The critical factor is that the system is pre-existing for the run, contrasting with previous rulings where adding such a specialized system mid-run was disallowed.

Update 2025-06-04 (PST) (AI summary of creator comment): Regarding the criterion for minimal non-LLM programmatic assistance (where assistance "roughly twice as bad" as Claude's pathfinding might not count):
- The creator considered a mapping system that provides the LLM with extensive details, such as all seen tiles, the current map layout (including warps, destinations, objects), and a calculated list of reachable tiles.
- While acknowledging this is significantly more advanced than Claude's pathfinding, the creator stated they are leaning towards this specific mapping system not being "twice as bad" as Claude's pathfinding, implying it may be acceptable under this rule.

Update 2025-06-04 (PST) (AI summary of creator comment): The creator has confirmed that a specific, discussed LLM run setup (referred to as the 'current setup') is included for market resolution, despite being considered 'on the edge'.

This acceptable 'current setup' includes elements such as:

A specialized sub-LLM (e.g., for boulder puzzles) that is baked in from the start of the run.
An advanced mapping system that provides the LLM with extensive details like all seen tiles, current map layout (including warps, destinations, objects), and a calculated list of reachable tiles.

The creator stated that any further assistance beyond this current configuration will be approached with skepticism.

Update 2025-06-04 (PST) (AI summary of creator comment): Regarding the Gemini Plays Pokemon run on Twitch (https://www.twitch.tv/gemini_plays_pokemon):
- The second run, if it finishes, will count for market resolution.
- This is contingent on no additional assistance being provided to this run beyond its state at the time of the creator's comment.
- This specific run is considered on the edge of acceptable criteria.
- Any further assistance introduced to this run will be viewed with significant skepticism.

This question is managed and resolved by Manifold.

#️ Technology

#Technical AI Timelines

#Gaming

#LLMs

Get

1,000

and

3.00

25 Comments

79 Holders

173 Trades

Sort by:

So Gemini beat Pokémon blue. As I specified earlier, despite some additional help from the harness I believe it is sufficiently driven by Gemini to resolve YES. I can see how the map assistance might be considered too much by some, but at the end of the day, this is a large language model making its own decisions that beat Pokémon.

It seems clear that Pokémon doesn’t make a particularly good benchmark. I think any value it had in comparing models against each other has already been lost to Goodharting. But I remember when it was deeply unclear whether a language model could complete a sudoku puzzle, and despite the subjectivity, it seems clear that these models have come a long way. Very cool!

Can we resolve please? I know Gemini got quite a lot of help, but you commented explicitly that the run counts and I bet based on that. No room for second thoughts after the fact.

@AhronMaline yup, just wanted to verify what Julian said about no additional scaffold changes.

Gemini has beaten Pokémon Blue again.

No further changes to the scaffold were made. However there was an objection made on Discord towards counting this run:

MrCheeze — 12:54 AM
(...) the automatic mapping is more like 100x as much help as Claude's navigator, its not <2x as much
Julian Bradshaw — 12:21 PM
Can you explain your position more here? (...)
MrCheeze — 12:28 PM
Its main importance is for Safari Zone where wandering at semi-random will never work
You have to be able to do relatively clean movement, which none of the models are even close to being able to figure out without having the explored map given to them
Julian Bradshaw — 12:30 PM
And Claude Navigator is making essentially clean slate attempts every time, and it can't see the whole Safari zone at once, so it won't be able to navigate all the way through?
MrCheeze — 12:31 PM
Yeah no chance

Now, as before, I think Gemini's map system counts as an allowed knowledge file system, because it's only filled out as Gemini explores, but wanted to mention the objection here. (Also afaik Claude hasn't ever reached the Safari zone, it seems possible to me that Claude could beat Safari Zone just by writing down its exploration pathways. Also o3 has already beaten Safari Zone with a different but similar mapping tool.)

@JulianBradshaw can you link the discord convo?

@Sketchy dm'd.

Just to be explicit about https://www.twitch.tv/gemini_plays_pokemon

Assuming no additional assistance, the second run of Gemini Plays Pokemon will count for the resolution of this market (if it finishes). However, it's on the edge and I will treat any additional assistance with a lot of skepticism. More discussion here.

bought Ṁ1,000 YES from 79% to 85%

Gemini 2.5 Pro beat Pokémon Blue earlier today. However I don't think it should count as too many substantial changes were made to the scaffolding throughout the run, including ex. allowing the main Gemini to call a separate Gemini with a prompt dedicated to solving boulder puzzles in Victory Road. (a "strategist" Gemini)

In general it was deliberately a pretty loose, experimental run. Later runs may count.

@JulianBradshaw agreed. I think it was significantly over the boundary of mid-game tweaks.

I’ll have to think about if the structure itself of map labelling is too much. Probably not.

@JulianBradshaw Would the run that's just been started where the previous scaffolding that's been developed won't be changed and the model will now run autonomously count?

@DarklyMade

From the description of the second run:

There are no changes to prompts or tooling, so this run serves as a clean test of all the improvements made during the first run. There won't be any developer interventions unless Gemini becomes hard-stuck due to a system limitation.

Yes, I think this would count. Although it's quite a bit of game-specific scaffolding, that was never disallowed and none of the structure I see is that egregiously bypassing the need to have the LLM drive itself.

@Sketchy this comment has a good summary of the scaffolding https://old.reddit.com/r/ClaudePlaysPokemon/comments/1kdjysi/gemini_beats_pokemon/mqblfeh/

In particular see

The other Gemini subagent is the Boulder Puzzle Solver, who is prompted with some pretty specific instructions on what kind of reasoning to use to solve the puzzles in Victory Road - examining the gates and switches to figure out which unsolved puzzles need to be done, and what sequence of pushes would accomplish that.

@jack yea the boulder puzzle solver is the worst part to me. But I didn’t disallow game-specific prompting or scaffolding, and it’s still LLMs making the decisions here. Since the boulder puzzle solver is baked in from the start of the run I think it’s acceptable.

It does seem like the whole benchmark is getting goodhart’ed pretty quickly but I’m trying to stick to my initial criteria.

@Sketchy The part of the criteria that seems relevant is

With minimal non-LLM programmatic assistance. I think the automatic pathfinding that Claude is using is a little bit cheating, if that helps with the spirit of this market. Something roughly twice as bad would maybe start to not count.

Which I personally thought this exceeded.

bought Ṁ500 YES

@jack The thing is, the boulder-solver is just an instance of Gemini with different prompting, not some kind of programmatic algorithm.

Ah, I guess I (mis)read that criteria as about assistance in general, not just programmatic assistance. Ok, in that case maybe the thing that stands out most for programmatic assistance is

Every tile it has seen in the map is recorded and the seen layout of the current map is given to it at all times. This includes warps and their destinations, and objects. Also, the list of which tiles can actually be reached or not in the current game state is calculated and told to Gemini.

@JulianBradshaw right this is where my head is at too. If it was a programmatic boulder solver that would be different.

@jack yea agreed. Does that feel twice as bad as the Claude path finding? I’m leaning towards no, but I do see how a map of visited places and layout much more than Claude’s pathfinding.

@Sketchy Wanted to say, as far as I know there's only been two changes to the scaffold since the start of this Gemini run (run #3):

when the navigator paths Gem into a moving NPC, it waits for the NPC to move instead of bonking
fixed a flaw in the harness where cut trees are considered non-traversible when HM01 has been deposited in the PC

According to GeminiPlaysPokemon, the first one is a QOL fix to not waste time, 2nd change is a bugfix, so they feel it's legitimate.

Citing Discord:

@Sketchy I think the main thing about Claude's pathfinding thing is it only applies to the current screen, whereas Gemini's applies to the whole map (if I understand correctly)? And Claude had tons of problems seeing the map, especially around visitability and warps, e.g. the constantly trying to get through 'gate-like structures" which Gemini's scaffold bypasses.

But I don't have a very detailed understanding of the scaffold and how much it's helping Gemini

@jack Ultimately we don't know all the details of either the Claude or Gemini scaffolds (or o3 for that matter, which also started a run recently). They're not open source. But Claude also gets some information about which tiles are reachable/walkable or not per the excalidraw diagram linked on the twitch channel.

Gemini's map recording is arguably an allowed knowledge file system, since it only gets filled out as Gemini explores and sees new map tiles.

@DarklyMade I lean towards thinking this run should count. I think the main caveat here is that Gemini's harness has gotten around a lot of the typical LLM vision problems by a.) using game RAM information to specify if tiles are warps or not; b.) presenting map info in text form.

Now, in both of these cases, Claude and o3 are also getting the same info, just with (imo) less refined harnesses, so it's really a question of whether this class of harness counts at all. And I think the original market implies yes, since it's talking about Anthropic's own measurements of Pokémon progress using this class of harness.

Re: class of harnesses, here's an exchange between ClaudePlaysPokemon and GeminiPlaysPokemon from the former's stream that's enlightening:

asdfugil: wow claude has the same issue here as gemini but it's probably just a game bug
asdfugil: an unreachable warp right here
Gemini_Plays_Pokemon: yes there's an unused warp here in the game data
Gemini_Plays_Pokemon: for whatever reason
ClaudePlaysPokemon: i love how you have also had to go through knowing WAY TOO MUCH about weird quirks of pokemon red
Gemini_Plays_Pokemon: Haha yes I learned a lot of random stuff. There's also one like this at a gatehouse and an invisible non-functional warp beside the Silph Co President too
ClaudePlaysPokemon: pokemon devs y
ClaudePlaysPokemon: the one that caused me the most pain was figuring out how there is a list of tiles you can't walk between that is handled differently than collisions
Gemini_Plays_Pokemon: are you talking about moving from elevated ground to regular ground?
ClaudePlaysPokemon: yep lol
Gemini_Plays_Pokemon: yeah that's fun, I had to add special handling for that
ClaudePlaysPokemon: same, and it took me an annoying amount of reading the disassembly to figure it out

@JulianBradshaw yep, very true, but I believe Claude's reachability info is simpler and Claude can't even blindly trust that info because things like gates/doors are marked as "not reachable".

So overall I can see arguments both ways on whether this is like "twice as much programmatic assistance"

@jack So can I. I'm going to try to be definitive and say with the current setup, the run is included. But it's on the edge and I will approach any further assistance with skepticism.

Related questions

Related questions