Julia has two sisters and one brother. How many sisters does her brother Martin have?
Solving this tiny puzzle requires a bit of thinking. You might mentally picture the family of three girls and one boy and then realize that the boy has three sisters. Or you might figure out a more general rule: Any boy in the family will have one more sister than any girl. In other words, the answer to such a puzzle isn’t something you immediately know, like Paris is the capital of France; it requires reasoning, a central feature of human intelligence, and one that large language models (LLMs) like GPT-4, for all their impressive behavior, struggle with.
I gave this puzzle several times to the most recent version of GPT-4 (4o) and it gave me the wrong answer each time, asserting that Martin has two sisters. Indeed, several
systematic studies have shown that GPT-4 and similar models lack abilities for robust reasoning, limiting their usefulness in many domains. Prompting tricks, such as adding
“Let’s think step by step” to the end of a query to encourage better reasoning, can improve LLMs’ problem-solving abilities in some cases, but such tricks haven’t worked well enough to turn these systems into robust reasoners.
To remedy this, many of the top artificial intelligence (AI) companies have recently created new kinds of AI systems, often called large reasoning models (LRMs), that are specially trained to perform certain types of reasoning. These include OpenAI’s
o1 and
o3 models, Anthropic’s Claude with “
extended thinking mode,” Google’s
Gemini 2.0 Flash Thinking Experimental, and the Chinese company DeepSeek’s
R1 model.
If you give an LRM a puzzle like the one above, it will generate a so-called “chain of thought”: natural language statements describing a plausible-sounding reasoning process in a step-by-step fashion. For example, I gave the “sisters” puzzle to OpenAI’s o1 model. Unlike GPT-4o, the o1 model gives me its chain of thought to arrive at the (correct) answer:
1. Julia has two sisters. That means there are three girls in total (Julia + two more).
2. Julia also has one brother, named Martin.
3. Altogether, there are four siblings: three girls and one boy (Martin).
4. From Martin’s perspective, his sisters are all three of the girls (Julia and her two sisters).
5. Therefore, Martin has three sisters.
This sounds very much like a human thinking out loud, and indeed o1 and other LRMs are trained on many examples of human-generated chains of thought. Being able to generate such reasoning steps has enabled LRMs to reach new performance heights on difficult science, math, and coding benchmarks. For example, OpenAI
reported that its o1 model placed in the top 500 students in the US in a qualifier for the USA Math Olympiad, and exceeded the accuracy of PhD scientists on a benchmark of physics, biology, and chemistry problems. Other LRMs have attained similar benchmark performances.
Some companies are betting big on LRMs as being the basis for AI assistants that are commercially lucrative. OpenAI, for example, has released its best LRMs and its associated “
Deep Research” tool to subscribers paying $200 per month, and is
said to be considering charging up to $20,000 per month for reasoning models that can carry out “PhD-level research.”
But some researchers are questioning all the excitement over LRMs, and asking if these models are, as
one headline put it, “really thinking and reasoning, or just pretending to?” That is, does their chain-of-thought training enable them to reason in a general and robust way, or are they succeeding on certain narrowly defined benchmarks by merely mimicking the human reasoning they were trained on?
I’ll say more about these questions later, but first, I’ll sketch how these models work, and how they are trained.
An LRM is built on top of a pretrained “base model,” an LLM such as GPT-4o. In the case of DeepSeek, the base model was their own pretrained LLM called V3. (The naming of AI models can get very confusing.) These base models have been trained on huge amounts of human-generated text, where the training objective is to predict the next token (i.e., word or word-part) in a text sequence.
The base model is then “post-trained”—that is, further trained but with a different objective: to specifically generate chains of thought, such as the one that o1 generated for the “sisters” puzzle. After this special training, when given a problem, the LRM does not generate tokens one at a time but generates entire chains of thought. Such chains of thought can be really long. Unlike, say, GPT- 4o, which generates a relatively small number of tokens, one at a time, when given a problem to solve, models like o1 can generate hundreds to thousands of chain-of-thought steps, sometimes totaling hundreds of thousands of generated tokens making up the chain-of-thought steps (most of which are not revealed to the user). And because customers using these models at a large scale are charged by the token, this can get quite expensive.
Thus, an LRM does substantially more computation than an LLM to generate an answer. This computation might involve generating many different possible chains of thought, using another AI model to rate each one and returning the one with the highest rating, or doing a more sophisticated kind of search through possibilities, akin to the “lookahead” search that chess- or Go-playing programs do to figure out a good move. When using a model such as o1, these computations happen behind the scenes; the user sees only a summary of the chain-of-thought steps generated.
To accomplish all this, the post-training of LRMs typically uses two machine-learning methods: “supervised learning” and “reinforcement learning.” Supervised learning might involve training the LRMs on human-generated reasoning steps (created by highly paid human experts), or on chains of thought generated by another AI model, where each step is graded by humans or by yet another AI model.
Reinforcement learning is, by contrast, an unsupervised method, in which the LRM itself generates an entire set of reasoning steps leading to an answer, and the model is “rewarded” only for getting the right answer, and for putting the reasoning steps in the right format for human readability (e.g., numbering them sequentially). The power of reinforcement learning over huge numbers of trials is that the model can learn which kinds of steps work and which do not, even though it is not given any (expensive) supervised feedback about the quality of the steps. Notably, the 2025 Turing Award—computer science’s most prestigious prize—was awarded to two researchers who helped develop the basic reinforcement learning methods that are now used to train LRMs.
Interestingly, DeepSeek showed that reinforcement learning methods, without any supervised learning at all, produced a model that did very well on many reasoning benchmarks. As the DeekSeek researchers
put it, this result “underscores the power and beauty of reinforcement learning: Rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies.” The focus on reinforcement learning rather than supervised learning was one of the factors that enabled DeepSeek to create an LRM whose training and use is much cheaper than corresponding LRMs created by US companies.
There has been substantial debate in the AI community on whether LRMs are “genuinely reasoning” or “merely mimicking” the kinds of human reasoning that is in the pretraining or post-training data.
One industry blog called o1 “the first example of a model with true general reasoning capabilities.” Others were more skeptical. Philosopher Shannon Valor
called LRMs’ chain-of-thought processes “a kind of meta-mimicry”; that is, these systems generate plausible-sounding reasoning traces that mimic the human “thinking out loud” sequences they have been trained on, but don’t necessarily perform robust, general problem-solving.
Of course, it’s not clear what “genuinely reasoning” even means. “Reasoning” is an umbrella term for many different types of cognitive problem-solving processes; humans use a multiplicity of strategies, including relying on memorized steps, specific heuristics (“rules of thumb”), analogies to past solutions, and sometimes even genuine deductive logic.
In LRMs, the term “reasoning” seems to be equated with generating plausible-sounding natural-language steps to solving a problem, and the extent to which this provides general and interpretable problem-solving abilities is still an open question. The performance of these models on math, science, and coding benchmarks is undeniably impressive. However, the overall robustness of their performance remains largely untested, especially for reasoning tasks that, unlike those the models were tested on, don’t have clear answers or cleanly defined solution steps, which is the case for many, if not most, real-world problems, not to mention “ fixing the climate, establishing a space colony, and the discovery of all of physics,” which are achievements
OpenAI’s Sam Altman expects from AI in the future. And although LRMs’ chains of thought are touted for their “human interpretability,” it remains to be determined how faithfully these generated natural-language “thoughts” represent what is actually going on inside the neural network in the process of solving a problem.
Multiple studies (carried out before the advent of LRMs) have shown that when LLMs generate explanations for their reasoning, the explanations are not always faithful to what the model is actually doing.
Moreover, the anthropomorphic language used in these models may mislead users into trusting them too much. The problem-solving steps that LRMs generate are often referred to as “thoughts”; the models themselves tell us that they are “thinking”; some models even intersperse reasoning steps with words like “Hmmm,” “Aha!” or “Wait!” to make them sound more humanlike. According to an OpenAI spokesperson, “Users have told us that understanding how the model reasons through a response not only supports more informed decision-making but also helps build trust in its answers.” But the question is, are users building trust based mainly on these humanlike touches, when the underlying model is less than trustworthy?
More research is needed to answer these important questions of robustness, trustworthiness, and interpretability in LRMs. Such research is difficult to do on models such as those of OpenAI, Google, and Anthropic, because these companies do not release their models or many details of their workings. Refreshingly, DeepSeek has released the model weights of R1, published a detailed report on how it is trained, and enabled the system to fully show its chains of thought, which will facilitate research on its capabilities. Hopefully this will inspire other AI companies to be equally open about their creations.