Will the first AI to get IMO Gold integrate Lean?

Plus

Ṁ15k

Jul 31

chance

ALL

This question resolves when an AI first gets IMO Gold. It resolves YES if the system that gets gold uses as a subsystem any piece of code from the Lean community, including any code from the leanprover or leanprover-community GitHub organizations.

Update 2025-07-20 (PST) (AI summary of creator comment): In the case of multiple, near-simultaneous claims of achieving an IMO Gold:
- A best-faith effort will be made to determine which attempt was initiated first.
- If it is unclear which was initiated first, the benefit of the doubt will be given to the party that announced first.

Update 2025-07-22 (PST) (AI summary of creator comment): In the event of multiple, near-simultaneous claims of an IMO Gold, the market will resolve based on which attempt started first, not which attempt finished first.

Update 2025-07-22 (PST) (AI summary of creator comment): In the event of multiple claims of an IMO Gold, the market will resolve based on a single lab's attempt. The creator will use the following procedure to select that lab:
- First, identify all labs that have achieved gold under the rules.
- From that group, determine the set of labs that could have started their attempt first.
- From this set of potential first-starters, the creator will select the one with the earliest announcement.

Probably if you are reading these AI summaries, you should just read the thread below.

This question is managed and resolved by Manifold.

#AI

#Technical AI Timelines

#IMO Grand Challenge

#The Lean Theorem Prover

Get

1,000

and

3.00

24 Comments

33 Holders

120 Trades

Sort by:

Why are you resolving this before companies are announcing the results they got on the IMO? All announcements so far are violating the wishes from the IMO organizers to wait one week before announcing their AI performance. How do you know that none/the first one did not integrate Lean?

@BoltonBailey yeah I def think this might happen, like multiple labs tried to get gold at the same time, some of them using formal-to-formal, and if any of them got gold and haven't announced it yet it seems reasonable to call them first even if they waited respectfully on making their announcement

@FlorisvanDoorn fwiw i suspect google is respecting the wishes of the IMO organizers, but i don't know the context. if you have reason to believe otherwise i'd be curious to see it

@FlorisvanDoorn Good point, I guess I was thinking that the embargo had been lifted and that Deepmind and OpenAI were the main ones competing here (I had seen a message that the Deepmind CEO said he had permission from the IMO organizers to go public). If it turns out that another gold announcement gets made that claims to have been temporally first to run their system, I'll see if I can ask the mods to re-resolve. (I guess I am also not really expecting a concrete enough claim of that type to be made in any other announcements)

@BoltonBailey It was something of an oversight on my part to make this market with the ambiguity around what "first" would mean in the context of simultaneous announcements, so sorry to the traders about that.

@BoltonBailey Some math superintelligence research lab has said they would present their results on july 28, and it seems pretty plausible they’ll get gold, and they’re almost certainly integrating lean

It would imo be valuable to reopen the market til end of month, but ofc up to you

@BoltonBailey There is good reason to believe there will be such an announcement:
https://x.com/HarmonicMath/status/1947023450578763991

The ambiguity in the question is indeed unfortunate.

@Bayesian If it's no skin off your back, you can reopen this (and the similar markets I just resolved) now if your mod powers allow you to.

Everyone should be aware though, that unless there is a pretty clear statement from the putative other gold winner suggesting that they formalized and started their computer running on the problems fast, I will still want to resolve this NO.

@BoltonBailey Ok, reopened! You can set a different close date if you prefer some other time

@Bayesian Looks good to me! Thanks!

@BoltonBailey

Everyone should be aware though, that unless there is a pretty clear statement from the putative other gold winner suggesting that they formalized and started their computer running on the problems fast, I will still want to resolve this NO.

I thought back on it and i'd like to ask further clarification. It seems to have been indicated by ppl with second hand info (so not something i would classify as certain or wtv) that the IMO 2025 committee knew AI companies would want to compete on the IMO, they were collaborating with AIMO or whoever has their 10 million dollar bounty on an open source IMO gold performance, and they invited a bunch of AI labs to come in contact with them to get the questions at the same time as the human competitors to compete in 4.5 hours each day on the IMO problems. They were thus given 9 hours total, each, to solve all the problems. This may not have been the case for EG OpenAI, who didn't coordinate with the IMO in this way, and just solved the problems in 4.5h each day once the problems were made public. If this picture is correct, and multiple labs got IMO gold in 9 hours, does it matter to this market if any of the labs, like, got it at the 8th hour mark and stopped then, or stuff like that? It likely won't be known if anyone finished before anyone else, and the date of each lab's announcement will likely have no incidence on whether they finished early in the allotted time or not, and really given that nobody will have solved problem 6, they likely tried problem 6 until they ran out of time and knew they hadn't solved it, so idk what relevance there is in considering first whoever gave up on problem 6 instead of trying to solve it til the very end.
I apologize for the ramble, but yeah I wonder if you're basically treating as "first" any lab that did it in the 9 hours of allotted time, or not, assuming almost all of them, except perhaps openai, received the problems at the same time as other participants because they were officially coordinating with the IMO committee to receive them at that time.

@Bayesian I anticipated that there would be possibilities like different amounts of running time that that would make it very hard to determine which AI finished first if I had defined first on the basis of the finishing time. This is part of why I decided in my clarification response to Fynn two days ago that I would resolve based on which attempt started first rather than which attempt finished first. I also just think that "start time first" is a more sensible way to capture "first" here, since the point of the test is not to be a race.

The "9 hours total, each, to solve all the problems" comment is separately interesting to me though, since it leaves open the possibility that a lab could take more than 4.5 hours on the first three problems, thus making the task of the AI technically easier. But I don't think it changes anything unless Christiano wasn't aware of it / if it seems like it would change his opinion. I am trying to count as a "valid attempt" anything he would have counted.

@BoltonBailey tbc I used 9 hours total as a shorthand but obviously they would never spend more than 4.5hours per set of problems, that would be cheating. I also think if they receive the problems at the same time there's not a strong sense in which any of them starts first. so yeah, just making sure, can multiple labs "start first" if they were all sent the problem at the same time and probably automatically set their system running at the same time as that? like the latency of a lab receiving the problem 20ms later doesn't matter for this market, all simultaneous starts count as first? that's mostly my question bc i expect it to come up

@Bayesian For this market I am not going to use an interpretation where there are "multiple labs starting first" or partial resolves or anything like that, I feel it is cleaner if there is a single lab off of which the market gets resolved, and so once the month is over I'll make a decision about which exact lab that will be.

To elaborate perhaps a bit more, the decision procedure will be:

I will look at all the labs who have achieved gold within the rules.
I will gather, from those labs, social media or internet posts that I consider to be "announcements" of the gold performance.
I will look at the announcements of those labs to attempt to infer what possibilities over times the computer systems could have begun processing any of the problems.
I will then consider the set of labs for which there is some possibility they started first.
I will look at the announcements of those labs, and, on the basis of the timestamps of the posts and my own best judgement, pick the announcement among them I think was likeliest to have been posted earliest.
I will resolve this market according to my best analysis, on the basis of public information I can find, of whether the given AI used Lean.

So to elaborate more on the 20ms point: If I think that two bots probably started within 20ms or 10s or 10min of each other, I think I am likely to think that either could have possibly started before the other. But even if that happens, I still think I am likely to think that openAI will be in the set defined in step 3, and unless there is some other announcement before theirs that I didn't hear about, I'll resolve this NO.

But even if that happens, I still think I am likely to think that openAI will be in the set defined in step 3, and unless there is some other announcement before theirs that I didn't hear about, I'll resolve this NO.

I will look at the announcements of those labs to attempt to infer what possibilities over times the computer systems could have begun processing any of the problems.

I don't really understand this explanation. The announcements will likely not provide much if any evidence about which labs started first. why would it? but the possibilities will be, assuming {lab1, lab2, ..., labn} all solved the IMO gold, that

lab1 solved the imo first
lab2 solved the imo first

...

n. labn solved the imo first

for google specifically, they likely started before openai, because they were coordinating with the IMO organizers. and the formal-to-formal labs plausibly started before google, because they were also coordinating with the IMO organizers. but we may not know. My understanding, then, from point 5, is that announcement timestamp becomes the tie breaker? in which case, if i am correct that the preponderance of evidence will point toward openai having not started first, then google will likely win the tiebreaker and the market will resolve NO in favour of google first getting IMO gold. Is my understanding correct, under this assumption?

@Bayesian

I don't really understand this explanation. The announcements will likely not provide much if any evidence about which labs started first. why would it?

Indeed, that is why I said in the comment where I said you could reopen the market that "unless there is a pretty clear statement from the putative other gold winner suggesting that they formalized and started their computer running on the problems fast, I will still want to resolve this NO".

I chose to base it only on the announcements because I didn't want the resolution of this market to come down to chasing down offhand comments by various researchers, it seems to me that that kind of thing makes manifold less about prediction skill and more about being good at quickly reading twitter and rules lawyering. I only allowed a possibility of changing the resolution form openAI at all because I felt that if one of the announcements really did end up communicating that one team tried hard to be physically first (for the glory of it or whatever, IDK), it would feel unfair not to respect that in the resolution.

for google specifically, they likely started before openai, because they were coordinating with the IMO organizers.

I disagree with this logic, if anything I think it might be more likely that OpenAI may have started first because it was their plan all along to announce immediately after the closing ceremony, whereas it seems GDM and other were initially planning to release a week later, so they might not have felt the same time pressure to start the program to have enough time to interpret the results and prepare press releases. Regardless ...

but we may not know.

I indeed agree that based in the info in the GDM and OAI announcements, I am uncertain which came first, so if those are the only announcements we get, the set in step 3 will be {OAI, GDM} and the tiebreaker in step 5 will go to OAI and the market will resolve NO.

preponderance of evidence will point toward openai having not started first

This seems like a misunderstanding of my intent. What matters is not what the preponderance of evidence shows (only one model can have a preponderance of evidence of being first, so if this were the criterion, there would be no need for further tiebreaks), but whether there is a possibility in my mind that one model came first. The key phrase in my numbered list being "there is some possibility they started first" and the key phrase in my clarification to Finn being "If it is unclear..."

So while I am not betting at this point, it seems very likely to me that whether or not there is a third announcement OAI will win the tiebreak and this will resolve no. In order for this not to happen, we would have to have something like

Harmonic announcing formal gold
Also saying in their announcement something like "we got formal problems preprepared by the IMO committee and released to us the minute the exam period on day one ended".

@BoltonBailey Alright, thanks for the clarifications. I understand better now. and I'm sorry for misunderstanding your intent / asking maybe obtuse-sounding questions

bought Ṁ1,500 NO

GDM used pure natural language based on their blog post: https://deepmind.google/discover/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/

bought Ṁ50 NO

@BoltonBailey if both OpenAI and Google (2 days later) claim Gold, but only Google integrates Lean, how will this resolve?

@Fynn I will make a best-faith effort based on the announcements of both parties to determine which attempt was initiated first. If it is unclear, I will give OpenAI the benefit of the doubt for being first on the basis of their announcement coming first.

I think it's extremely unlikely that it will integrate multiple proof assistants, so ArbBot will try to keep the sum <= 100%.

Related questions

Related questions