13 Ways To Speedup Python Loops With Minimal Effort
In this article, I cover a few simple ways to achieve 1.3x to 970x speedup of Python for
loops with minimal effort. I have included code snippets for baseline and improved versions of the code.
TL;DR
- 970x speedup using map() [Tip #9]
- 131x speedup using filterfalse [Tip #12]
- 57x speedup using Memoization [Tip #10]
Quantifying Performance Improvements
A commonly used functionality built into Python is the timeit
module. It has a Python interface as well as a command-line interface. The code snippets used in the following sections use the former to measure the current and improved performance of loops.
100K x 10
For each approach, first, a baseline is established by running a test which involves running the function under test 100K times (loops) over 10 test runs and then calculating the average time per loop (measured in nanoseconds, ns). More details here.
A Few Simple Ways To Speedup Python Loops
#1 List Comprehension
The humble List Comprehension
# Baseline version (Inefficient way)
# Calculating the power of numbers
# Without using List Comprehension
def test_01_v0(numbers):
output = []
for n in numbers:
output.append(n ** 2.5)
return output
# Improved version
# (Using List Comprehension)
def test_01_v1(numbers):
return [n ** 2.5 for n in numbers]
# Summary Of Test Results
Baseline: 31.486 ns per loop
Improved: 13.019 ns per loop
% Improvement: 58.7 %
Speedup: 2.42x
2.4x speedup using List Comprehension
#2 Calculate Outside Loop
If you need to rely on the length of a list to iterate over, calculate outside the for
loop.
# Baseline version (Inefficient way)
# (Length calculation inside for loop)
def test_02_v0(numbers):
output_list = []
for i in range(len(numbers)):
output_list.append(i * 2)
return output_list
# Improved version
# (Length calculation outside for loop)
def test_02_v1(numbers):
my_list_length = len(numbers)
output_list = []
for i in range(my_list_length):
output_list.append(i * 2)
return output_list
# Summary Of Test Results
Baseline: 112.135 ns per loop
Improved: 68.304 ns per loop
% Improvement: 39.1 %
Speedup: 1.64x
1.6x speedup by moving list length calculation outside the for
loop
#3 Use Sets
Use sets for scenarios you’d otherwise use for
loops for comparisons.
# Use for loops for nested lookups
def test_03_v0(list_1, list_2):
# Baseline version (Inefficient way)
# (nested lookups using for loop)
common_items = []
for item in list_1:
if item in list_2:
common_items.append(item)
return common_items
def test_03_v1(list_1, list_2):
# Improved version
# (sets to replace nested lookups)
s_1 = set(list_1)
s_2 = set(list_2)
common_items = s_1.intersection(s_2)
return common_items
# Summary Of Test Results
Baseline: 9047.078 ns per loop
Improved: 18.161 ns per loop
% Improvement: 99.8 %
Speedup: 498.17x
498x speedup using sets for scenarios you’d otherwise use nested for
loops for comparisons
#4 Skip Irrelevant
Avoid redundant computations, i.e. skip irrelevant iterations.
# Example of inefficient code used to find
# the first even square in a list of numbers
def function_do_something(numbers):
for n in numbers:
square = n * n
if square % 2 == 0:
return square
return None # No even square found
# Example of improved code that
# finds result without redundant computations
def function_do_something_v1(numbers):
even_numbers = [n for n in numbers if n%2==0]
for n in even_numbers:
square = n * n
return square
return None # No even square found
# Summary Of Test Results
Baseline: 16.912 ns per loop
Improved: 8.697 ns per loop
% Improvement: 48.6 %
Speedup: 1.94x
1.9x speedup
#5 Refactor Functions
In certain cases, directly incorporating the code of simple functions into loops can improve code compactness and execution speed.
# Example of inefficient code
# Loop that calls the is_prime function n times.
def is_prime(n):
if n <= 1:
return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True
def test_05_v0(n):
# Baseline version (Inefficient way)
# (calls the is_prime function n times)
count = 0
for i in range(2, n + 1):
if is_prime(i):
count += 1
return count
def test_05_v1(n):
# Improved version
# (inlines the logic of the is_prime function)
count = 0
for i in range(2, n + 1):
if i <= 1:
continue
for j in range(2, int(i**0.5) + 1):
if i % j == 0:
break
else:
count += 1
return count
# Summary Of Test Results
Baseline: 1271.188 ns per loop
Improved: 939.603 ns per loop
% Improvement: 26.1 %
Speedup: 1.35x
1.3x speedup
Key Takeaways
- Calling a function involves overhead, such as pushing and popping variables on the stack, function lookup, and argument passing.
- When a simple function is called repeatedly within a loop, the overhead of function calls can add up and impact performance.
- Inlining the function’s code directly into the loop can eliminate this overhead, potentially leading to significant speedups.
“premature optimization is the root of all evil”
Sir Tony Hoare (popularized by Donald Knuth)
⚠️ Caution: Balance the need for code readability vs the frequency at which the function is called.
#6 Avoid Pointless Checks
Where possible, avoid pointless checks, some of which might be redundant and slow down your code. For example,
for v in list_of_values:
if v > 0:
# Do something for values greater than zero
if v != 0:
# Do something for non-zero values
đź’¸ The second if
statement (if v != 0:
) is redundant because it will always execute for any value that passes the first if
statement (if v > 0:
). This means the code within the second if
block will always run whenever the code within the first if
block runs.
A more complete example to help prove the point.
def test_06_v0(numbers):
# Example of inefficient code
# (Redundant checks, duplicate work)
output = []
for v in numbers:
if v > 0:
output.append(v*2)
if v != 0: # Redundant check
output.append(v*2)
return output
def test_06_v1(numbers):
# Improved version
# (Remove redundant checks, no duplicate work)
return [num*2 for num in numbers if num > 0]
# Summary Of Test Results
Baseline: 133.312 ns per loop
Improved: 67.379 ns per loop
% Improvement: 49.5 %
Speedup: 1.98x
1.9x speedup by avoiding pointless checks.
Being Lazy Can Be Good
#7 Avoid Repeats
Consider avoiding repetitive calculations, some of which might be redundant and slow down your code. Instead, consider precomputation(s), where applicable.
An example to help prove the point.
def test_07_v0(n):
# Example of inefficient code
# Repetitive calculation within nested loop
result = 0
for i in range(n):
for j in range(n):
result += i * j
return result
def test_07_v1(n):
# Example of improved code
# Utilize precomputed values to help speedup
pv = [[i * j for j in range(n)] for i in range(n)]
result = 0
for i in range(n):
result += sum(pv[i][:i+1])
return result
# Summary Of Test Results
Baseline: 139.146 ns per loop
Improved: 92.325 ns per loop
% Improvement: 33.6 %
Speedup: 1.51x
1.5x speedup by avoiding repetitive calculations
#8 Use Generators
Consider using generators.
- Generators enable lazy evaluation, i.e. the expressions inside are only evaluated when you request the next value from it.
- This allows the processing of large datasets without loading everything into memory at once, which would otherwise have caused performance issues.
- Processing data on the fly helps reduce memory usage and improve performance.
def test_08_v0(n):
# Baseline version (Inefficient way)
# (Inefficiently calculates the nth Fibonacci
# number using a list)
if n <= 1:
return n
f_list = [0, 1]
for i in range(2, n + 1):
f_list.append(f_list[i - 1] + f_list[i - 2])
return f_list[n]
def test_08_v1(n):
# Improved version
# (Efficiently calculates the nth Fibonacci
# number using a generator)
a, b = 0, 1
for _ in range(n):
yield a
a, b = b, a + b
# Summary Of Test Results
Baseline: 0.083 ns per loop
Improved: 0.004 ns per loop
% Improvement: 95.5 %
Speedup: 22.06x
22x speedup using generators.
A Few Interesting Ways
#9 Use map()
Use Python’s built-in map()
function. It allows the processing and transformation of all items in an iterable
without using an explicit for
loop. This technique is referred to as mapping.
An example to help prove the point.
def some_function_X(x):
# This would normally be a function containing application logic
# which required it to be made into a separate function
# (for the purpose of this test, just calculate and return the square)
return x**2
def test_09_v0(numbers):
# Baseline version (Inefficient way)
output = []
for i in numbers:
output.append(some_function_X(i))
return output
def test_09_v1(numbers):
# Improved version
# (Using Python's built-in map() function)
output = map(some_function_X, numbers)
return output
# Summary Of Test Results
Baseline: 4.402 ns per loop
Improved: 0.005 ns per loop
% Improvement: 99.9 %
Speedup: 970.69x
970x speedup using Python’s built-in map()
function instead of an explicit for
loop.
How Does It Work?
The
map()
function takes afunction
object and aniterable
as arguments and returns aniterator
that yields transformed items on demand. What it does is that it (themap()
function) applies thefunction
(passed as the first argument) to each item in theiterable
in a loop and returns a newiterator
that yields transformed items on demand. Since themap()
function is written in C and is highly optimized, its internal implied loop is a lot more efficient than a regular Pythonfor
loop. Hence the speed gains.
– Paraphrased from the explanation at [#5]
#10 Use Memoization
Memoization optimises algorithms with recursive function calls or repeated calculations. The idea is to cache (or “memoize”) the results of expensive function calls and return them when the same inputs appear. It can reduce redundant calculations and speed up the program.
An example to help prove the point. First, an inefficient version.
# Example of inefficient code
def fibonacci(n):
if n == 0 or n == 1:
return n
return fibonacci(n - 1) + fibonacci(n-2)
def test_10_v0(list_of_numbers):
output = []
for i in numbers:
output.append(fibonacci(i))
return output
Next, an efficient version of the same functionality, but using Python’s built-in functools
’ lru_cache
function.
# Example of efficient code
# Using Python's functools' lru_cache function
import functools
@functools.lru_cache()
def fibonacci_v2(n):
if n == 0:
return 0
elif n == 1:
return 1
return fibonacci_v2(n - 1) + fibonacci_v2(n-2)
def _test_10_v1(numbers):
output = []
for i in numbers:
output.append(fibonacci_v2(i))
return output
# Summary Of Test Results
Baseline: 77.908 ns per loop
Improved: 1.310 ns per loop
% Improvement: 98.3 %
Speedup: 59.48x
59x speedup using Memoization using Python’s built-in functools
’ lru_cache
function.
How Does The lru_cache
Function Make This Happen?
- “LRU” is the acronym for “Least Recently Used”.
- The
lru_cache
function is a decorator which can be applied to functions to enable memoization. - The
lru_cache
stores the results of recent function calls in a cache and can serve the cached result when the same inputs occur again, thus saving computation time. - The
lru_cache
function, when applied as a decorator, allows for an optionalmaxsize
argument like so:@lru_cache(maxsize=None)
. Themaxsize
argument determines the maximum size of the cache (i.e., how many different input values it stores results for). [Reference #6] - If the
maxsize
argument is set to None, the LRU feature is disabled, and the cache can grow without bound. 🚨
#11 Use Vectorization
An example to help prove the point.
import numpy as np
def test_11_v0(n):
# Baseline version
# (Inefficient way of summing numbers in a range)
output = 0
for i in range(0, n):
output = output + i
return output
def test_11_v1(n):
# Improved version
# (# Efficient way of summing numbers in a range)
output = np.sum(np.arange(n))
return output
# Summary Of Test Results
Baseline: 32.936 ns per loop
Improved: 1.171 ns per loop
% Improvement: 96.4 %
Speedup: 28.13x
28x speedup using Vectorization.
#12 Use filterfalse
Avoid creating intermediate lists. It helps use less memory.
An example to help prove the point. First, an inefficient version.
def test_12_v0(numbers):
# Baseline version (Inefficient way)
filtered_data = []
for i in numbers:
filtered_data.extend(list(
filter(lambda x: x % 5 == 0,
range(1, i**2))))
return filtered_data
Next, an improved version of the same functionality using Python’s built-in itertools
’ filterfalse
function. [Reference #7].
from itertools import filterfalse
def test_12_v1(numbers):
# Improved version
# (using filterfalse)
filtered_data = []
for i in numbers:
filtered_data.extend(list(
filterfalse(lambda x: x % 5 != 0,
range(1, i**2))))
return filtered_data
Depending on the use case, whilst there might not be a significant increase in the speed of execution, there would be a drop in the memory usage achieved by avoiding the creation of intermediate lists.
the
filterfalse
function helps avoid creating an intermediate list [Reference #2]
# Summary Of Test Results
Baseline: 333167.790 ns per loop
Improved: 2541.850 ns per loop
% Improvement: 99.2 %
Speedup: 131.07x
131x speedup using the filterfalse
function of itertools
.
#13 Proper Way To Concatenate Strings
Any string concatenation operation using the +
operator will be slow and consume more memory. Use join
instead.
An example to help prove the point.
def test_13_v0(l_strings):
# Baseline version (Inefficient way)
# (concatenation using the += operator)
output = ""
for a_str in l_strings:
output += a_str
return output
def test_13_v1(l_strings):
# Improved version
# (using join)
output_list = []
for a_str in l_strings:
output_list.append(a_str)
return "".join(output_list)
Since this test will require an easy way to generate a large-ish list of strings, here’s a simple helper function to generate the required list of strings for the purpose of running the tests.
from faker import Faker
def generate_fake_names(count : int=10000):
# Helper function used to generate a
# large-ish list of names
fake = Faker()
output_list = []
for _ in range(count):
output_list.append(fake.name())
return output_list
l_strings = generate_fake_names(count=50000)
# Summary Of Test Results
Baseline: 32.423 ns per loop
Improved: 21.051 ns per loop
% Improvement: 35.1 %
Speedup: 1.54x
1.5x speedup using the join
function instead of using the +
operator.
Why is the join function faster? The string concatenation operation using the +
operator has a time complexity of O(n²), whereas the string concatenation operation using the join
function has a time complexity of O(n). [Reference #8]
Wrap Up
This article has covered a few simple ways to achieve 1.3x to 970x speedup of Python for
loops with minimal effort.
- 970x speedup using Python’s built-in
map()
function instead of an explicit for loop [Tip #9] - 498x speedup using sets instead of nested
for
loops [Tip #3] - 131x speedup using the
filterfalse
function ofitertools
[Tip #12] - 59x speedup using Memoization using
lru_cache
function [Tip #10]
👨🏻‍💻 All code samples can be replicated via this Google Colab notebook. Due to external factors, the numbers may differ when you run the notebook.
🙌 This article first appeared on my blog.
Acknowledgement: I acknowledge the feedback provided by Danesh Forouhari on 03 Jan 2024. Thank you very much 🙏 I have incorporated most of the suggested modifications.
References
- timeit() Python Function Usage Guide (With Examples) (Aug, 2023)
- 15 Techniques to Optimize Python Code (Apr, 2023)
- Vectorization in Python- An Alternative to Python Loops (Mar, 2023)
- Loops in Python — Comparison and Performance (Aug, 2019)
- Python’s map(): Processing Iterables Without a Loop
- Python’s built-in functools’ lru_cache function
- Python’s built-in itertools’ filterfalse function
- Time complexity of string concatenation in Python (May, 2016)