LLMs Can’t Calculate: Why You Should Use Tools for Math

6 min readApr 12, 2025

Large Language Models like GPT have amazed the world with their ability to write, code, and reason. But ask to divide 51234 by 6, and you might be surprised by the answer — because it’s likely wrong.

Why? Because LLMs don’t actually calculate anything. They’re not calculators or symbolic math engines. They’re probabilistic models that predict the next token in a sequence based on massive text data.

This blog post explores why LLMs are bad at math — and how you can use LangChain’s “PythonREPL” tool to make them great at it. I'll walk through real examples, show you how to set it up, and explain how this simple tool transforms LLMs into powerful, math-savvy agents.

Why LLMs Struggle with Math?

Despite their impressive abilities in language, summarization, and even basic reasoning, Large Language Models (LLMs) have a fundamental weakness: they’re not good at math.

And the reason is simple, but often misunderstood: LLMs don’t actually calculate — they predict.

At their core, LLMs are trained to do one thing: predict the next token in a sequence, based on the context of previous tokens. When you ask a question like:

“Divide 51234 by 6”

The model doesn’t “compute” this like a calculator. Instead, it looks at similar questions in its training data and guesses what the answer should be based on statistical patterns it has learned.

Sometimes, it gets it right, sometime not.

LLMs Need Tools to Handle Math Reliably

This is why many modern LLM architectures are now integrating external tools — like Python interpreters, Wolfram Alpha, or code execution environments — to offload the actual computation.

One such tool is LangChain’s PythonREPL, which allows the model to delegate the math to Python, get the correct result, and then present it as part of its response.

In the next section, we’ll look at how PythonREPL works and how it can be seamlessly integrated into your AI-powered workflows.

What Is LangChain’s PythonREPL?

LangChain is an open-source framework that enables LLMs to interact with external tools like Python interpreters, databases, search engines, and more. One such tool is the PythonREPL which gives LLMs access to a real-time Python execution environment.

PythonREPL is a LangChain tool that acts as a bridge between the LLM and the Python runtime. It allows the language model to:

Write Python code as part of its reasoning.
Execute that code in a sandboxed environment
Use the output to continue or revise its answer

In short, it transforms your LLM from a token guesser into a code-powered problem solver.

Setting Up LangChain with PythonREPL and Ollama (Internal LLM)

Refer my previous blog for how to setup internal LLM and expose API.

How to Deploy & Access LLMs on a Remote Server Using APIs(Python)

Why Access LLM Remotely?

medium.com

Step 1:Install Dependencies

First, make sure you have the necessary Python packages installed:

pip install langchain langchain_experimental langchain-core

Step 2: Load Your Internal LLM (via Ollama)

Get Manu Madhavan’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Step 3: Add the PythonREPL Tool


python_repl = PythonREPL()


repl_tool = Tool(
    name="python_repl",
    description="A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.",
    func=python_repl.run,
)

Step 4: Create a Math-Agent

# 4. Create agent
agent = initialize_agent(
    tools=[repl_tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True #(Make it false,if you dont want to see thought and action.)

Now our math Agent is ready ,lets us run whole code and see the response.

I gave the same task “Divide 51234 by 6” but this time, using the LangChain framework. The LLM delegated the task to our Math Agent, which in turn executed the calculation using the PythonREPL tool and returned the correct answer.

https://python.langchain.com/docs/integrations/tools/python/

https://python.langchain.com/docs/security/.

Here’s step by step breakdown of how the flow works

Input: You ask a math-related question.
Agent Reasoning: The LLM decides it needs to use the PythonREPL tool.
Tool Execution: It generates Python code and passes it to the REPL.
Output Retrieval: The result is fetched and returned to the agent.
Final Answer: The agent uses the result to respond accurately.

Note: if you enable verbose=’True’ you may observe multiple iterations of the calculation process. This occurs because LangChain’s agent with AgentType.ZERO_SHOT_REACT_DESCRIPTION uses an internal reasoning loop that looks like this

Thought → Action → Observation → (repeat) → Final Answer

For more details Refer to the Paper ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629.

Full code for your refrence

#Step 1:Import all required libraries

from langchain_community.chat_models import ChatOllama
from langchain_core.tools import Tool
from langchain_experimental.utilities import PythonREPL
from langchain.agents import initialize_agent
from langchain.agents.agent_types import AgentType

#Step 2:
# Define your remote Ollama server URL
OLLAMA_API_BASE = "http://xx.1xx.1xx.1xx:11434"  # Replace with your actual server IP

# Load LLaMA model hosted on Ollama
llm = ChatOllama(
    base_url=OLLAMA_API_BASE,  # API endpoint of Ollama server
    model="llama3.1:8b",  
    temperature=0.7
)
# Step 3:Add the PythonREPL Tool
python_repl = PythonREPL()


repl_tool = Tool(
    name="python_repl",
    description="A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.",
    func=python_repl.run,
)

# Step 4 Creae a Math Agent
agent = initialize_agent(
    tools=[repl_tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

Conclusion

While Large Language Models (LLMs) are powerful for understanding math problems and suggesting solutions, they are not inherently designed to calculate with precision. Every number they generate is a prediction not the result of actual computation. This can lead to subtle but critical errors, especially in tasks involving:

Complex arithmetic
Floating-point precision
Domain-specific math logic (like logs, matrix ops, or financial formulas)

By integrating LangChain’s PythonREPL tool, we bring the best of both worlds:
The reasoning and natural language understanding of LLMs
The mathematical accuracy of Python’s computation engine

This hybrid approach not only improves the reliability of results but also enhances the transparency of the reasoning process through Thought → Action → Observation chains, as described in the ReAct paper.

It’s important to acknowledge that recent advancements have led to LLMs specifically optimized for math and code.

These models incorporate built-in computation or are fine-tuned on high-quality math datasets often reducing hallucinations and improving symbolic reasoning.

However, even with these advances:

No model is perfect at math without grounding in an execution environment.
Using tools like LangChain’s PythonREPL or Wolfram Alpha still provides a layer of reliability for critical tasks.

So, while the ecosystem is evolving rapidly, the LLM + PythonREPL combo remains a practical, transparent, and extensible approach for doing math with confidence.

Thanks for reading! If you found it relevant,don’t forget to leave a clap

Reference:

[2103.03874] Measuring Mathematical Problem Solving With the MATH Dataset

[2201.11903] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[2302.04761] Toolformer: Language Models Can Teach Themselves to Use Tools

Python REPL | 🦜️🔗 LangChain

Sometimes, for complex calculations, rather than have an LLM generate the answer directly, it can be better to have the…

python.langchain.com

ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629.

LLMs Can’t Calculate: Why You Should Use Tools for Math

How to Deploy & Access LLMs on a Remote Server Using APIs(Python)

Why Access LLM Remotely?

Get Manu Madhavan’s stories in your inbox

Python REPL | 🦜️🔗 LangChain

Sometimes, for complex calculations, rather than have an LLM generate the answer directly, it can be better to have the…

Written by Manu Madhavan

No responses yet

More from Manu Madhavan

Beyond Vector Search: Testing Whether Knowledge Graphs Are RAG’s Missing Piece”

In recent years, Retrieval-Augmented Generation (RAG) has become the go-to strategy for improving Large Language Models (LLMs). Instead of…

Giving Memory to Stateless LLMs Using LangChain

In my previous blog, How to Deploy & Access LLMs on a Remote Server Using APIs (Python),I walked through the process of hosting an LLM on a…

Building an Explainable Image Search Engine with CLIP and BLIP

In today’s content-driven world, image search is more important than ever. But most search engines stop at “showing” results without ever…

How to Deploy & Access LLMs on a Remote Server Using APIs(Python)

Why Access LLM Remotely?

Recommended from Medium

I Deleted Notion and Obsidian. Here’s What Replaced Them — and Why I’m Never Going Back.

Two of the most popular productivity apps in the world. Both are gone in a week. And honestly? I don’t miss them at all.

MCP is Dead

Why you should avoid using MCP in Claude Code and what to use instead

I Did 11 Technical Interviews in 60 Days. Here Is the Pattern Nobody Tells You.

Not motivation. The actual data. Which rounds I passed, which I failed, and the one mistake I made in the first four before I figured it…

What Is the Best Local LLM for Coding in 2026?

A practical guide to choosing local coding models by hardware tier, workflow, latency, and privacy, not just benchmark screenshots.

Anthropic’s Engineer Said Kill Markdown. Here’s What He Actually Meant.

HTML vs Markdown ： Here’s the Decision Tree Both Sides Needed.

AI Isn’t Replacing Developers. It’s Doing Something Worse.

The replacement narrative is loud. The quiet one is already happening.