LLMs Can’t Calculate: Why You Should Use Tools for Math
Large Language Models like GPT have amazed the world with their ability to write, code, and reason. But ask to divide 51234 by 6, and you might be surprised by the answer — because it’s likely wrong.
Why? Because LLMs don’t actually calculate anything. They’re not calculators or symbolic math engines. They’re probabilistic models that predict the next token in a sequence based on massive text data.
This blog post explores why LLMs are bad at math — and how you can use LangChain’s “PythonREPL” tool to make them great at it. I'll walk through real examples, show you how to set it up, and explain how this simple tool transforms LLMs into powerful, math-savvy agents.
Why LLMs Struggle with Math?
Despite their impressive abilities in language, summarization, and even basic reasoning, Large Language Models (LLMs) have a fundamental weakness: they’re not good at math.
And the reason is simple, but often misunderstood: LLMs don’t actually calculate — they predict.
At their core, LLMs are trained to do one thing: predict the next token in a sequence, based on the context of previous tokens. When you ask a question like:
“Divide 51234 by 6”
The model doesn’t “compute” this like a calculator. Instead, it looks at similar questions in its training data and guesses what the answer should be based on statistical patterns it has learned.
Sometimes, it gets it right, sometime not.
LLMs Need Tools to Handle Math Reliably
This is why many modern LLM architectures are now integrating external tools — like Python interpreters, Wolfram Alpha, or code execution environments — to offload the actual computation.
One such tool is LangChain’s PythonREPL, which allows the model to delegate the math to Python, get the correct result, and then present it as part of its response.
In the next section, we’ll look at how PythonREPL works and how it can be seamlessly integrated into your AI-powered workflows.
What Is LangChain’s PythonREPL?
LangChain is an open-source framework that enables LLMs to interact with external tools like Python interpreters, databases, search engines, and more. One such tool is the PythonREPL which gives LLMs access to a real-time Python execution environment.
PythonREPL is a LangChain tool that acts as a bridge between the LLM and the Python runtime. It allows the language model to:
- Write Python code as part of its reasoning.
- Execute that code in a sandboxed environment
- Use the output to continue or revise its answer
In short, it transforms your LLM from a token guesser into a code-powered problem solver.
Setting Up LangChain with PythonREPL and Ollama (Internal LLM)
Refer my previous blog for how to setup internal LLM and expose API.
Step 1:Install Dependencies
First, make sure you have the necessary Python packages installed:
pip install langchain langchain_experimental langchain-core
Step 2: Load Your Internal LLM (via Ollama)
Get Manu Madhavan’s stories in your inbox
Join Medium for free to get updates from this writer.
Step 3: Add the PythonREPL Tool
python_repl = PythonREPL()
repl_tool = Tool(
name="python_repl",
description="A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.",
func=python_repl.run,
)Step 4: Create a Math-Agent
# 4. Create agent
agent = initialize_agent(
tools=[repl_tool],
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True #(Make it false,if you dont want to see thought and action.)Now our math Agent is ready ,lets us run whole code and see the response.
I gave the same task “Divide 51234 by 6” but this time, using the LangChain framework. The LLM delegated the task to our Math Agent, which in turn executed the calculation using the PythonREPL tool and returned the correct answer.
https://python.langchain.com/docs/security/.
Here’s step by step breakdown of how the flow works
- Input: You ask a math-related question.
- Agent Reasoning: The LLM decides it needs to use the PythonREPL tool.
- Tool Execution: It generates Python code and passes it to the REPL.
- Output Retrieval: The result is fetched and returned to the agent.
- Final Answer: The agent uses the result to respond accurately.
Note: if you enable verbose=’True’ you may observe multiple iterations of the calculation process. This occurs because LangChain’s agent with AgentType.ZERO_SHOT_REACT_DESCRIPTION uses an internal reasoning loop that looks like this
Thought → Action → Observation → (repeat) → Final Answer
For more details Refer to the Paper ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629.
Full code for your refrence
#Step 1:Import all required libraries
from langchain_community.chat_models import ChatOllama
from langchain_core.tools import Tool
from langchain_experimental.utilities import PythonREPL
from langchain.agents import initialize_agent
from langchain.agents.agent_types import AgentType
#Step 2:
# Define your remote Ollama server URL
OLLAMA_API_BASE = "http://xx.1xx.1xx.1xx:11434" # Replace with your actual server IP
# Load LLaMA model hosted on Ollama
llm = ChatOllama(
base_url=OLLAMA_API_BASE, # API endpoint of Ollama server
model="llama3.1:8b",
temperature=0.7
)
# Step 3:Add the PythonREPL Tool
python_repl = PythonREPL()
repl_tool = Tool(
name="python_repl",
description="A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.",
func=python_repl.run,
)
# Step 4 Creae a Math Agent
agent = initialize_agent(
tools=[repl_tool],
llm=llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)Conclusion
While Large Language Models (LLMs) are powerful for understanding math problems and suggesting solutions, they are not inherently designed to calculate with precision. Every number they generate is a prediction not the result of actual computation. This can lead to subtle but critical errors, especially in tasks involving:
- Complex arithmetic
- Floating-point precision
- Domain-specific math logic (like logs, matrix ops, or financial formulas)
By integrating LangChain’s PythonREPL tool, we bring the best of both worlds:
The reasoning and natural language understanding of LLMs
The mathematical accuracy of Python’s computation engine
This hybrid approach not only improves the reliability of results but also enhances the transparency of the reasoning process through Thought → Action → Observation chains, as described in the ReAct paper.
It’s important to acknowledge that recent advancements have led to LLMs specifically optimized for math and code.
These models incorporate built-in computation or are fine-tuned on high-quality math datasets often reducing hallucinations and improving symbolic reasoning.
However, even with these advances:
- No model is perfect at math without grounding in an execution environment.
- Using tools like LangChain’s PythonREPL or Wolfram Alpha still provides a layer of reliability for critical tasks.
So, while the ecosystem is evolving rapidly, the LLM + PythonREPL combo remains a practical, transparent, and extensible approach for doing math with confidence.
Thanks for reading! If you found it relevant,don’t forget to leave a clap
Reference:
[2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset
[2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
[2302.04761] Toolformer: Language Models Can Teach Themselves to Use Tools
ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629.