13 Ways To Speedup Python Loops With Minimal Effort

10 min readDec 31, 2023

In this article, I cover a few simple ways to achieve 1.3x to 970x speedup of Python for loops with minimal effort. I have included code snippets for baseline and improved versions of the code.

TL;DR

970x speedup using map() [Tip #9]
131x speedup using filterfalse [Tip #12]
57x speedup using Memoization [Tip #10]

Quantifying Performance Improvements

A commonly used functionality built into Python is the timeit module. It has a Python interface as well as a command-line interface. The code snippets used in the following sections use the former to measure the current and improved performance of loops.

100K x 10

For each approach, first, a baseline is established by running a test which involves running the function under test 100K times (loops) over 10 test runs and then calculating the average time per loop (measured in nanoseconds, ns). More details here.

A Few Simple Ways To Speedup Python Loops

#1 List Comprehension

The humble List Comprehension

# Baseline version (Inefficient way)
# Calculating the power of numbers
# Without using List Comprehension
def test_01_v0(numbers):
  output = []
  for n in numbers:
      output.append(n ** 2.5)
  return output

# Improved version
# (Using List Comprehension)
def test_01_v1(numbers):
  return [n ** 2.5 for n in numbers]

# Summary Of Test Results
     Baseline: 31.486 ns per loop
     Improved: 13.019 ns per loop
% Improvement: 58.7 %
      Speedup: 2.42x

2.4x speedup using List Comprehension

#2 Calculate Outside Loop

If you need to rely on the length of a list to iterate over, calculate outside the for loop.

# Baseline version (Inefficient way)
# (Length calculation inside for loop)
def test_02_v0(numbers):
  output_list = []
  for i in range(len(numbers)):
    output_list.append(i * 2)
  return output_list

# Improved version
# (Length calculation outside for loop)
def test_02_v1(numbers):
  my_list_length = len(numbers)
  output_list = []
  for i in range(my_list_length):
    output_list.append(i * 2)
  return output_list

# Summary Of Test Results
     Baseline: 112.135 ns per loop
     Improved:  68.304 ns per loop
% Improvement: 39.1 %
      Speedup: 1.64x

1.6x speedup by moving list length calculation outside the for loop

#3 Use Sets

Use sets for scenarios you’d otherwise use for loops for comparisons.

# Use for loops for nested lookups
def test_03_v0(list_1, list_2):
  # Baseline version (Inefficient way)
  # (nested lookups using for loop)
  common_items = []
  for item in list_1:
      if item in list_2:
          common_items.append(item)
  return common_items

def test_03_v1(list_1, list_2):
  # Improved version
  # (sets to replace nested lookups)
  s_1 = set(list_1)
  s_2 = set(list_2)
  common_items = s_1.intersection(s_2)
  return common_items

# Summary Of Test Results
     Baseline: 9047.078 ns per loop
     Improved:   18.161 ns per loop
% Improvement: 99.8 %
      Speedup: 498.17x

498x speedup using sets for scenarios you’d otherwise use nested for loops for comparisons

#4 Skip Irrelevant

Avoid redundant computations, i.e. skip irrelevant iterations.

# Example of inefficient code used to find 
# the first even square in a list of numbers
def function_do_something(numbers):
  for n in numbers:
    square = n * n
    if square % 2 == 0:
        return square

  return None  # No even square found

# Example of improved code that 
# finds result without redundant computations
def function_do_something_v1(numbers):
  even_numbers = [n for n in numbers if n%2==0]
  for n in even_numbers:
    square = n * n
    return square

  return None  # No even square found

# Summary Of Test Results
     Baseline: 16.912 ns per loop
     Improved:  8.697 ns per loop
% Improvement: 48.6 %
      Speedup: 1.94x

1.9x speedup

#5 Refactor Functions

In certain cases, directly incorporating the code of simple functions into loops can improve code compactness and execution speed.

# Example of inefficient code
# Loop that calls the is_prime function n times.
def is_prime(n):
  if n <= 1:
    return False
  for i in range(2, int(n**0.5) + 1):
    if n % i == 0:
      return False

  return True

def test_05_v0(n):
  # Baseline version (Inefficient way)
  # (calls the is_prime function n times)
  count = 0
  for i in range(2, n + 1):
    if is_prime(i):
      count += 1
  return count

def test_05_v1(n):
  # Improved version
  # (inlines the logic of the is_prime function)
  count = 0
  for i in range(2, n + 1):
    if i <= 1:
      continue
    for j in range(2, int(i**0.5) + 1):
      if i % j == 0:
        break
    else:
      count += 1
  return count

# Summary Of Test Results
     Baseline: 1271.188 ns per loop
     Improved:  939.603 ns per loop
% Improvement: 26.1 %
      Speedup: 1.35x

1.3x speedup

Key Takeaways

Calling a function involves overhead, such as pushing and popping variables on the stack, function lookup, and argument passing.
When a simple function is called repeatedly within a loop, the overhead of function calls can add up and impact performance.
Inlining the function’s code directly into the loop can eliminate this overhead, potentially leading to significant speedups.

“premature optimization is the root of all evil”
Sir Tony Hoare (popularized by Donald Knuth)

⚠️ Caution: Balance the need for code readability vs the frequency at which the function is called.

#6 Avoid Pointless Checks

Where possible, avoid pointless checks, some of which might be redundant and slow down your code. For example,

for v in list_of_values:
  if v > 0:
    # Do something for values greater than zero
  if v != 0:
    # Do something for non-zero values

💸 The second if statement (if v != 0:) is redundant because it will always execute for any value that passes the first if statement (if v > 0:). This means the code within the second if block will always run whenever the code within the first if block runs.

A more complete example to help prove the point.

def test_06_v0(numbers):
  # Example of inefficient code
  # (Redundant checks, duplicate work)
  output = []
  for v in numbers:
    if v > 0:
      output.append(v*2)
    if v != 0:  # Redundant check
      output.append(v*2)
  
  return output

def test_06_v1(numbers):
  # Improved version
  # (Remove redundant checks, no duplicate work)
  return [num*2 for num in numbers if num > 0]

# Summary Of Test Results
     Baseline: 133.312 ns per loop
     Improved: 67.379 ns per loop
% Improvement: 49.5 %
      Speedup: 1.98x

1.9x speedup by avoiding pointless checks.

Being Lazy Can Be Good

#7 Avoid Repeats

Consider avoiding repetitive calculations, some of which might be redundant and slow down your code. Instead, consider precomputation(s), where applicable.

An example to help prove the point.

def test_07_v0(n):
  # Example of inefficient code
  # Repetitive calculation within nested loop
  result = 0
  for i in range(n):
    for j in range(n):
      result += i * j
  return result

def test_07_v1(n):
  # Example of improved code
  # Utilize precomputed values to help speedup
  pv = [[i * j for j in range(n)] for i in range(n)]
  result = 0
  for i in range(n):
    result += sum(pv[i][:i+1])
  return result

# Summary Of Test Results
     Baseline: 139.146 ns per loop
     Improved:  92.325 ns per loop
% Improvement: 33.6 %
      Speedup: 1.51x

1.5x speedup by avoiding repetitive calculations

#8 Use Generators

Consider using generators.

Generators enable lazy evaluation, i.e. the expressions inside are only evaluated when you request the next value from it.
This allows the processing of large datasets without loading everything into memory at once, which would otherwise have caused performance issues.
Processing data on the fly helps reduce memory usage and improve performance.

def test_08_v0(n):
  # Baseline version (Inefficient way)
  # (Inefficiently calculates the nth Fibonacci
  # number using a list)
  if n <= 1:
    return n
  f_list = [0, 1]
  for i in range(2, n + 1):
    f_list.append(f_list[i - 1] + f_list[i - 2])
  return f_list[n]

def test_08_v1(n):
  # Improved version
  # (Efficiently calculates the nth Fibonacci
  # number using a generator)
  a, b = 0, 1
  for _ in range(n):
    yield a
    a, b = b, a + b

# Summary Of Test Results
     Baseline: 0.083 ns per loop
     Improved: 0.004 ns per loop
% Improvement: 95.5 %
      Speedup: 22.06x

22x speedup using generators.

A Few Interesting Ways

#9 Use `map()`

Use Python’s built-in map() function. It allows the processing and transformation of all items in an iterable without using an explicit for loop. This technique is referred to as mapping.

An example to help prove the point.

def some_function_X(x):
  # This would normally be a function containing application logic
  # which required it to be made into a separate function
  # (for the purpose of this test, just calculate and return the square)
  return x**2

def test_09_v0(numbers):
  # Baseline version (Inefficient way)
  output = []
  for i in numbers:
    output.append(some_function_X(i))

  return output

def test_09_v1(numbers):
  # Improved version
  # (Using Python's built-in map() function)
  output = map(some_function_X, numbers)
  return output

# Summary Of Test Results
     Baseline: 4.402 ns per loop
     Improved: 0.005 ns per loop
% Improvement: 99.9 %
      Speedup: 970.69x

970x speedup using Python’s built-in map() function instead of an explicit for loop.

How Does It Work?

The map() function takes a function object and an iterable as arguments and returns an iterator that yields transformed items on demand. What it does is that it (the map() function) applies the function (passed as the first argument) to each item in the iterable in a loop and returns a new iterator that yields transformed items on demand. Since the map() function is written in C and is highly optimized, its internal implied loop is a lot more efficient than a regular Python for loop. Hence the speed gains.
– Paraphrased from the explanation at [#5]

#10 Use Memoization

Memoization optimises algorithms with recursive function calls or repeated calculations. The idea is to cache (or “memoize”) the results of expensive function calls and return them when the same inputs appear. It can reduce redundant calculations and speed up the program.

An example to help prove the point. First, an inefficient version.

# Example of inefficient code
def fibonacci(n):
  if n == 0 or n == 1:
    return n
  return fibonacci(n - 1) + fibonacci(n-2)

def test_10_v0(list_of_numbers):
  output = []
  for i in numbers:
    output.append(fibonacci(i))

  return output

Next, an efficient version of the same functionality, but using Python’s built-in functools’ lru_cache function.

# Example of efficient code
# Using Python's functools' lru_cache function
import functools

@functools.lru_cache()
def fibonacci_v2(n):
  if n == 0:
    return 0
  elif n == 1:
    return 1
  return fibonacci_v2(n - 1) + fibonacci_v2(n-2)

def _test_10_v1(numbers):
  output = []
  for i in numbers:
    output.append(fibonacci_v2(i))

  return output

# Summary Of Test Results
     Baseline: 77.908 ns per loop
     Improved: 1.310 ns per loop
% Improvement: 98.3 %
      Speedup: 59.48x

59x speedup using Memoization using Python’s built-in functools’ lru_cache function.

How Does The `lru_cache` Function Make This Happen?

“LRU” is the acronym for “Least Recently Used”.
The lru_cache function is a decorator which can be applied to functions to enable memoization.
The lru_cache stores the results of recent function calls in a cache and can serve the cached result when the same inputs occur again, thus saving computation time.
The lru_cache function, when applied as a decorator, allows for an optional maxsize argument like so: @lru_cache(maxsize=None). The maxsize argument determines the maximum size of the cache (i.e., how many different input values it stores results for). [Reference #6]
If the maxsize argument is set to None, the LRU feature is disabled, and the cache can grow without bound. 🚨

#11 Use Vectorization

An example to help prove the point.

import numpy as np

def test_11_v0(n):
  # Baseline version
  # (Inefficient way of summing numbers in a range)
  output = 0
  for i in range(0, n):
    output = output + i

  return output

def test_11_v1(n):
  # Improved version
  # (# Efficient way of summing numbers in a range)
  output = np.sum(np.arange(n))
  return output

# Summary Of Test Results
     Baseline: 32.936 ns per loop
     Improved:  1.171 ns per loop
% Improvement: 96.4 %
      Speedup: 28.13x

28x speedup using Vectorization.

#12 Use `filterfalse`

Avoid creating intermediate lists. It helps use less memory.

An example to help prove the point. First, an inefficient version.

def test_12_v0(numbers):
  # Baseline version (Inefficient way)
  filtered_data = []
  for i in numbers:
    filtered_data.extend(list(
        filter(lambda x: x % 5 == 0,
                range(1, i**2))))
  
  return filtered_data

Next, an improved version of the same functionality using Python’s built-in itertools’ filterfalse function. [Reference #7].

from itertools import filterfalse

def test_12_v1(numbers):
  # Improved version
  # (using filterfalse)
  filtered_data = []
  for i in numbers:
    filtered_data.extend(list(
        filterfalse(lambda x: x % 5 != 0,
                    range(1, i**2))))
    
    return filtered_data

Depending on the use case, whilst there might not be a significant increase in the speed of execution, there would be a drop in the memory usage achieved by avoiding the creation of intermediate lists.

the filterfalse function helps avoid creating an intermediate list [Reference #2]

# Summary Of Test Results
     Baseline: 333167.790 ns per loop
     Improved: 2541.850 ns per loop
% Improvement: 99.2 %
      Speedup: 131.07x

131x speedup using the filterfalse function of itertools.

#13 Proper Way To Concatenate Strings

Any string concatenation operation using the + operator will be slow and consume more memory. Use join instead.

An example to help prove the point.


def test_13_v0(l_strings):
  # Baseline version (Inefficient way)
  # (concatenation using the += operator)
  output = ""
  for a_str in l_strings:
    output += a_str

  return output

def test_13_v1(l_strings):
  # Improved version
  # (using join)
  output_list = []
  for a_str in l_strings:
    output_list.append(a_str)

  return "".join(output_list)

Since this test will require an easy way to generate a large-ish list of strings, here’s a simple helper function to generate the required list of strings for the purpose of running the tests.

from faker import Faker

def generate_fake_names(count : int=10000):
  # Helper function used to generate a 
  # large-ish list of names
  fake = Faker()
  output_list = []
  for _ in range(count):
    output_list.append(fake.name())

  return output_list

l_strings = generate_fake_names(count=50000)

# Summary Of Test Results
     Baseline: 32.423 ns per loop
     Improved: 21.051 ns per loop
% Improvement: 35.1 %
      Speedup: 1.54x

1.5x speedup using the join function instead of using the + operator.

Why is the join function faster? The string concatenation operation using the + operator has a time complexity of O(n²), whereas the string concatenation operation using the join function has a time complexity of O(n). [Reference #8]

Wrap Up

This article has covered a few simple ways to achieve 1.3x to 970x speedup of Python for loops with minimal effort.

970x speedup using Python’s built-in map() function instead of an explicit for loop [Tip #9]
498x speedup using sets instead of nested for loops [Tip #3]
131x speedup using the filterfalse function of itertools [Tip #12]
59x speedup using Memoization using lru_cache function [Tip #10]

👨🏻‍💻 All code samples can be replicated via this Google Colab notebook. Due to external factors, the numbers may differ when you run the notebook.

🙌 This article first appeared on my blog.

Acknowledgement: I acknowledge the feedback provided by Danesh Forouhari on 03 Jan 2024. Thank you very much 🙏 I have incorporated most of the suggested modifications.

13 Ways To Speedup Python Loops With Minimal Effort

TL;DR

Quantifying Performance Improvements

100K x 10

A Few Simple Ways To Speedup Python Loops

#1 List Comprehension

#2 Calculate Outside Loop

#3 Use Sets

#4 Skip Irrelevant

#5 Refactor Functions

Key Takeaways

#6 Avoid Pointless Checks

Being Lazy Can Be Good

#7 Avoid Repeats

#8 Use Generators

A Few Interesting Ways

#9 Use map()

How Does It Work?

#10 Use Memoization

How Does The lru_cache Function Make This Happen?

#11 Use Vectorization

#12 Use filterfalse

#13 Proper Way To Concatenate Strings

Wrap Up

References

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Nirmalya Ghosh

More from Nirmalya Ghosh

Transcribe Singlish with OpenAI’s Whisper? “Can meh?”

About a week ago, OpenAI released a new AI model that can transcribe speech with 50% fewer mistakes than previous models. The new model…

A Quick Overview Of Active Learning

Active learning is a type of machine learning in which the learning algorithm actively selects the data it learns from, rather than…

Using Mixtral 8x7B For NLP Tasks On Small GPUs

Large language models (LLM) are made up of billions of parameters, thus posing challenges when loading them onto GPU memory for model…

Recommended from Medium

Make Python Faster by Caching Functions: Memoization

The article discusses memoization using the Python standard library. The functools.lru_cache decorator makes this so simple!

Python is No More The King of Data Science

5 Reasons Why Python is Losing Its Crown

Lists

Coding & Development

Predictive Modeling w/ Python

Practical Guides to Machine Learning

ChatGPT

8 Python Performance Tips I Discovered After Years of Coding in Python

Hey Everyone!! I wanted to share these Python Performance tips, that i feel that everyone should be aware of, since it took a lot of effort…

Concurrency in Python: Understanding Threading, Multiprocessing, and Asyncio

Table of Contents

Everything You Can Do With Python’s Bisect Module

Learn how to optimize search and keep your data sorted in Python with the "bisect" module

Large File processing with asyncio and mmap in Python

Processing (reading and writing) large files efficiently can indeed be tricky. While there are many tools, libraries, and frameworks…

#9 Use `map()`

How Does The `lru_cache` Function Make This Happen?

#12 Use `filterfalse`