Will Grok 3 be 'the most powerful AI in the world'?
➕
Plus
309
Ṁ81k
Jul 31
25%
chance

Elon Musk is talking big: https://x.com/tsarnick/status/1815493761486708993. Says that Grok 3 will come out in December and 'should be' the most powerful AI in the world.

Resolves to YES if Grok 3 is, at the time of its release, plausibly the most powerful AI in the world according to my best judgment. Has to be at least as strong as all models publicly available at the time.

Resolves to NO if it is not the most powerful.

(Resolves NO if no such model is released by 7/23/25, to ensure this doesn't go on forever.)

As of 7/23/2024 Claude Sonnet 3.5 is IMO most powerful AI, but GPT-4o would also resolve to YES based on its position at #1 on Arena and other ways in which some people prefer it. Gemini 1.5 Pro or Advanced would not qualify, but would have counted prior to Sonnet 3.5 and GPT-4o.

(I will not take clarifying questions on my criteria here, it will be my subjective take on 'is this plausibly the best LLM I can access right now.')

  • Update 2025-05-01 (PST): - Reasoning models are a different class of AI and do not count for the purposes of resolving this market. (AI summary of creator comment)

Get
Ṁ1,000
and
S3.00
Sort by:

I do think both interpretations are reasonable, and I could argue both sides. I understand both cases, although I would still be inclined to make the same decision again.

But I have learned that once you make a decision like this, you HAVE TO stick with it, reversing yourself makes things go crazy, even if you decide you made the wrong initial decision, and the only thing you can do after that is turn it over to the mods or stick with what you said.

Given it is 5-0 thumbs up on an accusation that my actions are disingenuous (and I've been outright accused of LYING among other things, seriously WTAF) here despite the market being where it was 2 days before the ruling, honestly, which I REALLY REALLY don't appreciate, I don't need this trouble. I hereby ask the mods to take over this question so I can wash my hands of it, and they can do whatever they decide is best.

Hope everyone's happy now. Enjoy.

I do think both interpretations are reasonable, and I could argue both sides. I understand both cases, although I would still be inclined to make the same decision again.

But I have learned that once you make a decision like this, you HAVE TO stick with it, reversing yourself makes things go crazy, even if you decide you made the wrong initial decision, and the only thing you can do after that is turn it over to the mods or stick with what you said.

Given it is 5-0 thumbs up on an accusation that my actions are disingenuous (and I've been outright accused of LYING among other things, seriously WTAF) here despite the market being where it was 2 days before the ruling, honestly, which I REALLY REALLY don't appreciate, I don't need this trouble. I hereby ask the mods to take over this question so I can wash my hands of it, and they can do whatever they decide is best.

Hope everyone's happy now. Enjoy.

@ZviMowshowitz @mods Adding to mod queue

sold Ṁ9 NO

The point of this market looked to be evaluating Musk’s claim but it’s now purposely excluding models that are expected to invalidate the claim. The resolution here is no longer about it being the most “powerful AI.”

It seems disingenuous to pose the question, ‘At the time of its release, is Grok 3 plausibly the most powerful AI in the world according to my best judgment?’ and then later exclude reasoning models - even though they are clearly a subset of AI in the question's spirit.

@turtle6agqe I don’t think it’s disingenuous at all. Where’s the dishonesty? I don’t see a personal benefit to clarifying the market in either way

@Bayesian "disingenuous" does not need to be for personal gain.

not candid or sincere, typically by pretending that one knows less about something than one really does.

@turtle6agqe is simply stating that it seems like @ZviMowshowitz is lying about whether he understands the objections being raised by traders on this market.

@DavidFWatson yeah but why else would you lie?

@Bayesian To be clear, I don't think he's being disingenuous (nor do I think he’s lying), I think he's just not really thinking through the objections, he's owner of a lot of markets, trying to make quick responses to questions, some of those responses are gonna be dumb!

What's important is continuing to iterate on feedback

@Bayesian Oh, the obvious reason why one might lie is because they have some core worldview that would be undermined by telling the truth. When MAGA people are quizzed about world events outside a political context, they often get things right that they will get wrong when polled in a political context.

Are they lying for personal gain when asked in the political context? Not really. They don't really gain anything except the satisfaction of expressing their political affiliation.

As I already stated, I don't think that's what's happening here, but it's a plausible explanation in some circumstances, and a great use of the word "disingenuous"

bought Ṁ1,250 NO

By reasoning models you should only count models with thinking time in their inference, which means in the contest you should even include DeepSeek V3, Gemini 2.0 etc.

bought Ṁ75 YES

Given we’re excluding reasoning models and Grok 3 was trained on considerably more compute than any other model, why would it not be “the most powerful”.

What am I missing here? Are we just betting on whether scaling laws are a thing & will continue to apply at Grok 3 scale?

@elf it's also a lot about quality of training data, fine tuning, RLHF etc.

@elf might not be as good as Claude 3.5 Opus or Claude 4 or Gemini 2 Pro

@ZviMowshowitz what would you think if Grok-3 was the most powerful low-latency model (e.g. better than Sonnet, Gemini 2, o3-mini on low compute) but also clearly less powerful than o1?

@JoshYou

it will be my subjective take on 'is this plausibly the best LLM I can access right now

Seems like reasoning models don't count

@JoshYou When I asked the question I did not anticipate reasoning models. I am going to say that reasoning models are a different class of thing, and they don't count for this purpose.

@jim reasoning models are LLMs

@JoshYou my obsession with the truth over appearance of truth lead me to the correct conclusion.

@ZviMowshowitz All models are reasoning models to some extent. It feels pretty artificial to exclude those with a separate chain of thought. For all we know, Grok 3 might use really long chains of thought for tough questions (burning through way more inference compute than its competitors), or it might even be a high-latency reasoning model itself.

On top of that, OpenAI doesn’t seem to care much about GPT-4o anymore. These days, it’s just a crowd-pleaser with a high LLM Arena rating but slightly weaker in benchmarks compared to its version from last May. Google also seems more focused on LLM Arena, as their newest Gemini is actually weaker than the current GPT-4o in benchmarks. It looks like all the big companies are putting their main efforts into AIs with highly variable inference compute, so Grok 3 would be competing in a race that everyone else has mostly abandoned. Calling it the most powerful AI in the world is like letting a young guy compete in the senior Olympics and then crowning him world champion.

bought Ṁ50 NO

At the moment DeepSeek looks very good. But Grok is not bad at all.

No one should vote in this market because the other question from this author about Gemini being the best at the end of 2024 was resolved yes, even though it's not the end of 2024 yet.

Sorry Zvi, but this has been annoying the heck out of me.

2 traders bought Ṁ255 NO
bought Ṁ50 NO

I tend to distrust Elon Musk's predictions, by default. His ever-changing timeline for self-driving cars betrays both a lack of rigor with predictions and bad epistemology by not updating it.

Take a look at this LLM benchmark:

https://livebench.ai/

A way better/fairer ranking than lmsys imo

bought Ṁ50 YES

Seeing how good Grok 2 is makes me think it will at be on par with 3.5 Opus and whichever models OpenAI and GDM release before the end of the year.

Llama 3.1 was trained on 15k GPUs, so Grok 3 should have unprecedented scale

I recommend substituting your best judgement with ELO score on chatbot arena.

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules