OpenAI press-released their o1 set of “models” today. For at least a few months, there has been hype about this, fueled by OpenAI, under the codename “strawberry”. This was supposed to be their Artificial Super Intelligence (ASI) model. Twitter is abuzz with OpenAI employees sharing their talking points, and a big chunk of Twitter is verbatim repeating them. So, it becomes necessary to inject context into the happenings around the “o1 models”. That’s what this post is for. This post is a bit ad hoc — it’s a busy day for me — but it should get the points across.
The results of o1 are both impressive and not impressive
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). — OpenAI blog post
But you already knew that if you opened the Twitter app anytime today. These are phenomenal results. However, this comparison makes little sense. That vast gap in performance is from comparing a model like gpt4o to what is an agentic system.
Yes, that’s right. The o1 “models” (which OpenAI insists on calling them as models) are agentic systems. To understand this, consider this illustration of how reasoning works in the o1 system.
For any input, the system takes a number of “turns” talking to itself. All this is abstracted away in the gray “Reasoning” box. So, it’s hardly a fair comparison with a model with only one turn to respond. This is weird, unscientific stuff, but OpenAI ditched being a scientific organization long ago — they are a PR-first company that happens to employ researchers. But we digress. So yes, o1 is an agent and not a model, and other agents have far exceeded performance on many of the benchmark datasets over gpt-4o. The perceived gap between gpt-4o and o1 is a factual misrepresentation of this context.
Kudos to Omar Khattab for this mostly non-tongue-in-cheek tweet for recognizing this. Reading whatever sparsely is made available by OpenAI, o1 is not dspy, but I will bet that with a few tweaks, dspy (another multiturn system) should get close to o1 in results.
My Prediction: It will not take more than a couple of months for the o1 results to be bested by future publicly available models from Meta, Google, or Mistral.
O1 is only for “Advanced Reasoning”. But how do you know when to invoke O1?
According to OpenAI documentation, unlike the previous GPT family of models, o1 does not replace its predecessor.
o1 models offer significant advancements in reasoning, but they are not intended to replace GPT-4o in all use-cases. (bold format by openai)
If you use it, you will need a model router to know what requests to send to this model vs. gpt-4o. Even “tier 5” customers spending $50,000 or more monthly get API access with 20 RPM.
No Function Calls
What’s worse is:
For applications that need image inputs, function calling, or consistently fast response times, the GPT-4o and GPT-4o mini models will continue to be the right choice.
Function calling (tool calling) and complex reasoning tasks go hand in hand. It makes no sense that advanced reasoning problems will not require function calling.
My prediction: OpenAI will add function calling to these models. They are buying time for now.
Bonkers Pricing Strategy
OpenAI will be the first AI company to introduce “hidden fees” for model/system access:
You will not have visibility over the thought tokens, yet you must pay for them. So, if you make queries like Aaron, you will still have to pay through your nose even if the model throws the towel, and you will never get to see what you are paying for.
Obviously, OpenAI is pushing the envelope and trying to see what they can get away with. My prediction: This is not going to stick.
Brittle Context Handling
OpenAI cautions:
Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.
This means that, as a developer, you have to do more work selecting what to include in the context.
Hidden Thought Tokens
The ostensible reason behind hiding thought tokens (aka reflection tokens in literature) is “safety”. I am not cynical to doubt that as OpenAI has played the safety card since GPT-2 for not disclosing things that might jeopardize their position — weights, papers, and now intermediate model outputs. Revealing the traces (chains of thought) can result in competitors figuring out their methods. This is a weak defense, and I predict Meta, Google, and Mistral to release similar-capability models/systems with visible thought tokens sucking OpenAI into a game-theoretic loop where they are either forced to reveal thought tokens (and charge for them) like the rest of the world or hide their precious thought tokens (and as a bargain not charge for them). The latter path is a path to more burn — a path intimately familiar to OpenAI — and also a path to getting rendered irrelevant by open models.
Why now?
All this begs the question, why this release, and why release now? Llama has seriously dented OpenAI’s enterprise growth. Google has announced increasingly better and cheaper products around Gemini and Gemma. Mistral has been taunting them from Paris since its inception and continues to do so with their recent flagship releases like Pixtral. OpenAI is raising now, and they cannot afford not to be the central character.
But I think there is a more strategic product reason: This is purely to gather data — collect complex reasoning prompts and their traces on your dime. Making these “preview models” creates buzz; restricting to 30 requests a week incentivizes users to try them on hard problems.
If I were at Google or elsewhere, I would encourage a similar strategy to collect challenging problems. I am grateful Meta, Google, and other orgs. have chosen not to go down OpenAI’s closed-door path and release open-source models. Still, if we want not to create ugly monopolies, I encourage them to build low-friction tools like ChatGPT that collect high-quality I/O pairs for training.
Thanks for this very interesting analysis!
Spot on. Thanks for the thoughtful write up.