Will a large language model beat a super grandmaster playing chess by 2028?
➕
Plus
1.8k
Ṁ1.2m
2029
60%
chance

If a large language models beats a super grandmaster (Classic elo of above 2,700) while playing blind chess by 2028, this market resolves to YES.

I will ignore fun games, at my discretion. (Say a game where Hiraku loses to ChatGPT because he played the Bongcloud)

Some clarification (28th Mar 2023): This market grew fast with a unclear description. My idea is to check whether a general intelligence can play chess, without being created specifically for doing so (like humans aren't chess playing machines). Some previous comments I did.

1- To decide whether a given program is a LLM, I'll rely in the media and the nomenclature the creators give to it. If they choose to call it a LLM or some term that is related, I'll consider. Alternatively, a model that markets itself as a chess engine (or is called as such by the mainstream media) is unlikely to be qualified as a large language model.


2- The model can write as much as it want to reason about the best move. But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.

I won't bet on this market and I will refund anyone who feels betrayed by this new description and had open bets by 28th Mar 2023. This market will require judgement.

  • Update 2025-21-01 (PST) (AI summary of creator comment): - LLM identification: A program must be recognized by reputable media outlets (e.g., The Verge) as a Large Language Model (LLM) to qualify for this market.

    • Self-designation insufficient: Simply labeling a program as an LLM without external media recognition does not qualify it as an LLM for resolution purposes.

  • Update 2025-06-14 (PST) (AI summary of creator comment): The creator has clarified their definition of "blind chess". The game must be played with the grandmaster and the LLM communicating their respective moves using standard notation.

Get
Ṁ1,000
and
S3.00
Sort by:

The psychology of using manifold is so weird. My estimate for this question is something like 10-20%. Last week it felt like there was a bit of a head of steam for this, and I thought I might be missing an opportunity if I didn't buy NO at ~50%. But now there's a limit order at 60% that I could sink my whole balance into. On the one hand, it seems like a great deal, but on the other hand, I don't want to spend my whole balance, especially with daily loans having been nerfed.

bought Ṁ1,000 NO from 62% to 60%

@MP

Consider a hypothetical future product marketed as an LLM that has an improved version of "reasoning" that can transparently write Python code and execute it using a pre-existing Python interpreter in the pursuit of more accurate answers in some scenarios (for now I'll ignore the possibility that such a feature could be added without informing anyone). This particular product does not provide any insight into its "reasoning" process, so you cannot know whether or not the pre-written Python interpreter was used.

Would such a product count as an LLM for the purposes of this market?

My interpretation would be that it would not, as it would be using a resource not located in its weights.

What Counts as LLM? Do Reasoning Models qualify as LLM?

@AlanTuring Seems clear that the answer is yes based on the description:

The model can write as much as it want to reason about the best move

@SimonWestlake interesting. If the AI can write its own code to write a chess program then it wins. I really like this question.

@AlanTuring The very next sentences:

But it can't have external help beyond what is already in the weights of the model. For example, it can't access a chess engine or a chess game database.

You raise a great question. Does something like a Python or JS interpreter count as "external help"? It's certainly allowed to write code, but I don't think it would be allowed to use an external program to execute that code.

@MP Would you agree with my interpretation?

@SimonWestlake I think AI models today come with their own internal tools. If you upload an image to O3 it uses its own internal tools to analyze the image and write you a response. I know Gemini 2.5 Pro can use its own internal tools to compile Python code to check successful compilation before returning the code answer to the user. I suspect a similar thing could happen with AI models and chess. They write their own python code to play chess then use internal tools to verify the code compiles and then plays against the user. That would require long term planning and scenario modeling which is missing from current LLMs.

@AlanTuring I think you're using a different definition of "internal". The tools you're talking about are internal in the sense that they are abstracted away from being exposed to the user, but they are surely still external from the actual LLM and its weights.

From the description (emphasis mine):

But it can't have external help beyond what is already in the weights of the model**

One big reason to bet on NO: the criteria are pretty specific. Like, say the bet was that a kangaroo will win a jousting competition against a blond haired boy between the ages of 6 and 8...

Sure, okay maybe the kangaroo'll win, I guess? But that specific event has to happen (and be publicized) before we even get to who'll beat who.

You might think that LLMs play chess more than kangaroos joust, but remember that this has to be a high-capability LLM that doesn't have access to external tools. Who makes that sort of thing, these days, and why are they entering it in chess tourneys?

Also, note the title: "Will a ..." not "Could a ...". If this event doesn't happen, it should resolve NO, not N/A.

bought Ṁ100 NO

@DanHomerick What about is so specific? It really doesn't seem that way to me.

@SimonWestlake
1. There are currently only ~30 players in the world with >2700 rating;
2. One of them has to agree to play a machine while blindfolded;
3. The machine cannot be a chess engine or similar, but must be a pure LLM

@pietrokc Got it, thanks. Point 1 is certainly a big one.

Let me get this straight.

It is July in the year of our lord 2025. Almost 3y after GPT-3 burst on to the scene, almost 1y after "reasoning" models came out. And currently, the best models still make ILLEGAL chess moves in the midgame.

And we are saying >50% chance an LLM BEATS a >2700 player?

Do people not realize how irrelevant the blindfold is? A 2000 ELO player is better than probably anyone you've ever met. Yet here is an IM (not even a GM, let alone >2700) beating the 2000 player while blindfolded: https://www.youtube.com/watch?v=4VVGlmtfEYw

What's the thinking here? People don't know that GMs see the whole board in their heads? People think the AI messiah will arrive in the next 3y (it's always 3y out), and also the AI messiah will be an LLM?

@pietrokc As LLMs have developed over the past few years they have developed abilities non-linearly/emergently. Where previous models might have failed at a simple task 100% of the time, the next model can suddenly complete the task 99% or 100% of the time. It’s very difficult to predict what the next generation of AI models will be capable of.

I’m not saying I agree with the current market probability btw.

@DylanSlagh But this market is not about "AI". It is about LLMs, which I take to mean a big neural net that assigns probabilities to the next token.

We already have "AI" that can run on a phone and beats any human, but it's not an LLM, because an LLM is close to the worst imaginable architecture for a chess program.

Of course I agree the capabilities demonstrated by LLMs over the past 3y are very surprising, especially in fuzzy domains. But it has been pretty easy to predict their performance in mathematical / algorithmic domains -- they are uniformly very brittle, very stiff, and are no good in flexible scenarios even a little outside the training distribution.

bought Ṁ50 YES

@pietrokc To quote an earlier comment:

> It's starting to look like this market is just a countdown to whenever one of the frontier labs decides to apply reasoning post-training to chess.


https://manifold.markets/MP/will-a-large-language-models-beat-a#q26h0v3fdc

@SamuelKnoche That's just buzzwords. The fact remains that LLM is almost the worst conceivable architecture for this task. They applied a hell of a lot of "reasoning post-training" to math and they're still miserable at math.

@pietrokc LLMs have gotten a lot better on math thanks to "reasoning post-training", and I'd expect the same for chess. With gpt-3.5-turbo-instruct we've already gotten a good amateur chess playing LLM without even trying properly.
https://x.com/GrantSlatton/status/1703913578036904431

@SamuelKnoche Yes, they got better at math, because they started unable to add single digit integers. They're still very bad at real math that's not in the training data.

Extremely wrong about "good amateur". The best models today STILL make illegal moves in the mid game. As I've already said.

@pietrokc
1. Yes, LLMs can't do novel math yet, but they're saturating most math benchmarks and even making meaningful progress on FrontierMath. Seems clear that RL post training is very effective at whatever thing you apply it to.

2. Yes, current best models can't play chess, but gpt-3.5-turbo-instruct could reasonably well, and is a clear demonstration that it's just a matter of being trained on the right data mix. Only reason best models are bad at chess now is that labs don't really care about it.

3. We already know that neural networks can get quite good at board games without any kind of explicit tree search since the original Alphago Zero network (a 50M parameter model, vs today's >1T param LLMs), without search, played Go at a very good amateur to low-ranked professional level.
https://gwern.net/doc/reinforcement-learning/model/alphago/2017-silver.pdf#page=5

@SamuelKnoche
1. Wrong. This is what they tell laypeople who can't check for themselves. They "saturate" a lot of benchmarks but as soon as a new set of problems comes out, that isn't in the training data, they do miserably at them. Right now you have a rare opportunity to check this for yourself, since the IMO was 2 days ago. Just run the problems through all LLMs you want. The claim that LLMs can do "research level math" (FrontierMath's claim) is laughable to anyone who actually knows math and tried the models.

> RL post training is very effective at whatever thing you apply it to.
This is just a bonkers thing to say, and contrary to 50+ years of computational complexity theory.

2. I strongly doubt chess evals from 2 years ago, when nobody really knew how to evaluate these models. Why on earth would a model without reasoning do well at chess? There is simply no mechanism for it. There are in fact theorems that it cannot be done. (An LLM without chain of thought is just a TC0 circuit, and it cannot keep track of long sequences of moves.) Why on earth would models become worse at chess once reasoning was added?

3. AlphaGo is not an LLM, it is not trained as an LLM, its inputs and outputs are not words, so this point is irrelevant.

@pietrokc
> I strongly doubt chess evals from 2 years ago...

It was clearly demonstrated to be so, replicated by many people. Again it's about the data mix. If no chess reasoning examples are provided it's not going to help much on chess, just as the best human reasoner/mathematicians will be terrible at chess if they've never played.

> AlphaGo is not an LLM, it is not trained as an LLM, its inputs and outputs are not words, so this point is irrelevant.

A single forward pass of a conv net like in alphago zero shouldn't be any more 'expressive' computationally than a forward pass of an LLM. The alphago zero network does have the advantage of having a more direct encoding of the chess board available and doesn't have to also encode how language works, but I'd expect an LLM's almost 5 magnitude higher parameter count to be able to easily compensate for this.

@SamuelKnoche
> It was clearly demonstrated to be so, replicated by many people

No, a few claims were posted on Twitter, which does not constitute a "replication". I also remember a lot of posts and videos debunking this claim. Unfortunately the model is no longer accessible so we can't check, but your claim about "data mix" is simply not tenable given everything we know. (Models with more data and reasoning are bad at chess, models routinely fail simple questions outside training distribution, models without chain of thought cannot keep track of state.)

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules