This market is part of the paper: A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
This market resolves based on whether, at each specified date, all models considered SOTA are not reasoning models.
Reasoning Model Definition
A "reasoning model" must meet all of the following criteria:
- It is a Language Model - The system must be able to input and output language. As an example of what would not count: AlphaGo 
- It has been trained to use inference-time compute - The system must have undergone significant training in using more than a single forward pass before giving its final output, with the ability to scale inference compute for better performance 
- The extra inference compute produces an artifact - The way the model uses extra inference compute must lead to some artifact, like a classic chain-of-thought or a list of neuralese activations. For example, a Coconut model counts as a reasoning model here. 
State-of-the-Art (SOTA) Definition
A model is considered "state-of-the-art" if it meets these criteria:
- Widely recognized as among the 3-5 best models by the AI community consensus 
- Among the top performances on major benchmarks 
- Deployed status: The model must be either: - Publicly deployed (available via API or direct access) 
- Known to be deployed internally at AI labs for actual work (e.g., automating research, production use) 
- Models used only for testing, evaluation, or red-teaming do not qualify 
 
- Assessed as having significant overall capabilities and impact