All about rate limiters in Pipedrive — Part 1

11 min readJul 3, 2025

This blog post is the first one in a two-part series called “All about rate limiters in Pipedrive”. (See Part 2)

Both parts are co-authored by Andy Kohv and Max Goldenberg.*

Every backend system eventually hits the same wall: someone, or something, starts flooding your API. At some point, your carefully tuned infrastructure starts buckling under the load you technically allowed — but didn’t really expect.

At Pipedrive, we came to realize it a while ago. Between our public API, integrations, and internal microservices, we realized we needed to protect our large system of 800 microservices from abuse, keep response times stable, and ensure that one noisy client wouldn’t ruin it for everyone else. That led us to build not just one, but two different rate limiting solutions — each solving a distinct problem.

Why this article matters

If you’re working on scaling backend systems, this post will be a deep dive into what a real-world, production-grade rate limiting setup looks like in a large SaaS company. The two solutions we’ll cover map closely to what most companies eventually need:

A public rate limiter to defend your API from overload and abuse.
A token-based rate limiter to align API usage with customer plans and purchased packages, using a 24-hour token window.

This article is deliberately thorough. We’re not just describing what we built — we’ll walk you through the design, trade-offs, edge cases, performance constraints, and what we learned the hard way.

A quick disclaimer: rate limiting is deeply context-dependent. The “right” solution for one company might be completely wrong for another. It depends on your product, business model, traffic patterns, and infrastructure. So don’t treat this as a plug-and-play recipe. Instead, think of it as a set of battle-tested ideas that might help you design your own approach.

Along the way, we’ll also highlight a few unique decisions in our architecture. Both our public rate limiters are written in Go and use a fully asynchronous processing model — something that, as far as we know, is fairly unique. This asynchronous design has proven extremely fast, robust, and reliable, powering our API traffic with zero major issues for over four years.

What are we going to talk about

This is a comprehensive article, aimed at giving a full overview of all rate limiting solutions we have in Pipedrive. We split it into 2 parts:

Part 1 (this blog post): Everything about the backstory and the new Asynchronous Public Rate Limiter

Part 2 (will be released shortly): Will expand on this topic by talking about the Token-Based Rate Limiter

So let’s start with Part 1 backstory: what was our first version of a rate limiter, and what issues did it have?

Backstory: The First Iteration

When we first built a rate limiter at Pipedrive, we did what most teams do: we embedded it directly into our API Gateway called Barista. The logic was straightforward: for every incoming request, make a synchronous call to Memcached to check and update rate limiting counters, then either allow or block the request based on the response.

This approach worked fine for light to moderate traffic. But as our API usage grew, we started hitting the fundamental problem with synchronous rate limiting: the rate limiter itself became a bottleneck.

The Memcached Bottleneck

The core issue was simple but painful. Every single API request required a round-trip to Memcached to read the current request count, increment it, and, if the key didn’t exist yet, set a TTL for the rate-limiting window.

Under normal load, these operations were fast enough-maybe 1–2ms of added latency per request. But when traffic spiked (exactly when rate limiting is most needed), Memcached would get overwhelmed. Response times would climb from milliseconds to hundreds of milliseconds, sometimes even timing out entirely.

The Self-Defeating Fallback

To prevent rate limiting from killing our API performance, we implemented a fallback strategy: if the Memcached lookup took longer than a threshold (initially 8ms), we’d skip rate limiting entirely and let the request through. This seemed reasonable — better to occasionally let some requests through than to add hundreds of milliseconds to every API call.

But this created a problem on its own. The harder someone hammered our API, the more likely they were to overwhelm Memcached, which meant the more likely we were to disable rate limiting altogether. Our rate limiter was most likely to fail exactly when we needed it most.

We’d see this pattern repeatedly during traffic spikes:

High request volume starts overwhelming a service
Memcached gets saturated with rate limiting lookups
Memcached response times exceed our skip threshold
Rate limiting gets disabled, letting even more traffic through
The original service gets hit with the full, unthrottled load

Why Scaling Wasn’t the Answer

The obvious solution might seem like “just add more Memcached capacity.” But that missed the fundamental architectural problem. Synchronous rate limiting means your rate limiter has to be as fast and available as your main API path. Any latency or reliability issues in your rate limiting infrastructure directly impact every API request.

Plus, we realized we wanted to do more sophisticated rate limiting — different rules for different customers, complex penalty logic, integration with external systems like Cloudflare. Adding all that logic to the critical request path would only make the latency problem worse.

We needed a completely different approach.

New Asynchronous Public Rate Limiter

We realized we could flip the problem around: instead of making every request wait for rate limiting decisions, what if we made rate limiting decisions asynchronously and only blocked future requests when necessary?

The Core Architectural Shift

Our new design follows a simple principle: allow requests by default, block only when explicitly told to. Here’s how it works:

API Gateway (Barista) produces traffic events: For every API request, our gateway produces a lightweight Kafka event containing request metadata (IP, company ID, endpoint, timestamp, etc)
Rate limiter consumes asynchronously: A separate rate limiter service consumes these events, applies rate limiting logic, and maintains counters in Memcached
Block events trigger restrictions: When a rate limit is exceeded, the rate limiter produces a “block event” to another Kafka topic
API Gateway enforces blocks in-memory: The gateway consumes block events and maintains an in-memory cache of blocked clients, applying restrictions to subsequent requests
Headers added asynchronously: Rate limiting headers (current usage, reset time, etc.) are fetched from Memcached and added to responses, but this happens asynchronously and doesn’t block request processing

Why This Architecture Works

No critical path blocking. The main request processing never waits for rate limiting decisions. Even if the rate limiter service is down or Memcached is slow, API requests continue flowing normally.
Kafka as the reliability layer. Kafka provides exactly the durability and ordering guarantees we need. Events are never lost, and we can replay them if needed. The rate limiter can fall behind during traffic spikes and catch up later without impacting API availability.
Independent scaling. We can scale rate limiter instances independently of gateway instances. High traffic? Spin up more rate limiter consumers. Complex rate limiting rules? Add more processing power without touching the request path.
Graceful degradation. If there’s a problem with rate limiting infrastructure, the worst case is that we temporarily allow more traffic than we should, but the API stays responsive.

To sum it up, asynchronous nature provides many benefits: decreased latency, decoupling, improved reliability, etc.

However, it has an obvious tradeoff — some requests might still get through, while previous ones are being processed. If this is acceptable to a certain degree, we would need to set up our monitoring and, ideally, SLOs to account for such “slipped-through requests”. We will discuss this in a bit more detail in a section on Observability below.

Rate Limiting Algorithm

The second important factor that plays a role in the overall architecture is the choice of a rate limiting algorithm.

💡 Check out this visual comparison of rate limiting algorithms. It explains it better than any wall of text could.

We implemented a custom rolling window algorithm, where counters continuously slide forward rather than resetting at fixed intervals. This provides much smoother enforcement and eliminates burst allowances at window boundaries.

Rate Limiting Rules

And the last important bit about the architecture is the variety of features we wanted to provide for rate limiting rules.

The more precise the rules can be, the more control we will have over potential API abuse.

We want to be able to create both:

Global rules. Which are publicly documented here.
Custom rules. Which helps us deal with API usage overloads or incorrectly set up integrations on a case-by-case basis.

At the moment of writing this article, we have the following options to create a rate limiting rule:

Filtering:
- By company ID or User-Agent
- By microservice and per endpoint
- By the HTTP method
- By the company’s billing tier
- By authentication type (web app, API token, or OAuth)
Time range: It is possible to apply a rate limit temporarily
Cloudflare block: It is also possible to automatically block requests on the CDN level

*Adding new rate limiter rule in our backoffice plugin*

This level of granularity allows us to create custom rules to rate limit only the requests we need, so that the rest of the functionality of Pipedrive is not affected for selected customers. Because of that, rate limiting is one of the most useful tools we have to deal with cases of slowness or API abuse.

Get Andy Kohv’s stories in your inbox

Join Medium for free to get updates from this writer.

Now that the overall architectural idea is clear, let’s circle back to the tech stack we used to implement this new rate limiter.

Technologies We Used

For the microservice itself, we chose Go for several key reasons:

Ease of working with concurrency is crucial in such a high-throughput environment
Explicit error handling
Strong typing and clear interfaces
Go’s extensive standard library means you rely on far fewer external packages, so you spend much less time on constant dependency upgrades. That significantly lowers — but doesn’t eliminate the maintenance burden for a mission-critical service.

As the concept implied asynchronous processing, we needed Kafka. And for storing request counts themselves, we went with Memcached, which was used in the API Gateway stack at the time.

Lastly, we wanted to create dedicated observability for our rate limiter traffic and needed a time-series storage solution for our traffic events. For that, we decided to use Elasticsearch.

Implementation: Rules

Let’s start with rate limiter rules we just talked about.

They are created through our internal backoffice plugin and produced as events to a Kafka topic with log compaction
At startup, each Rate Limiter instance loads the entire rules topic into memory and compiles any regex-based paths, caching the compiled patterns so incoming requests are matched with minimal overhead.

We also implemented multi-DC replication:

Rules are written to a local Kafka topic in each separate datacenter
Kafka MirrorMaker then handles syncing them to global topic

Why is that necessary? Customers in Pipedrive are bound to a specific geographical region. But sometimes they need to be migrated to another region (see Migrations). Multi-DC replication keeps configs synched in all regions, eliminating the need for migrations altogether.

Implementation: Data Flow

Let’s now look at how requests are being processed and blocked.

API Gateway receives a request and produces a traffic event to Kafka.
Rate limiter consumes these events in batches
It matches the request against in-memory cached configs
It then looks up the requester’s counter in Memcached
It implements the rolling window logic and penalty handling, which is the most complex part of the codebase. The code uses various local optimizations and bitwise operations to maximize performance.
If the counter algorithm concludes that a limit is breached, it produces a block event to Kafka
API Gateway consumes these block events
It adds the block to the in-memory blocklist
It starts rejecting further traffic from the offending source with a 429 Too Many Requests response
Finally, it also uses Memcached to inject headers like x-ratelimit-remaining in responses (via a non-blocking async callback)

Implementation: Important Principles

Low-latency event production. We optimized our API Gateway to produce Kafka messages with minimal buffering and maximum throughput. Each request generates a small event that includes only essential metadata.
Batch processing. The Rate Limiter consumes events in configurable batches and processes them in parallel using multiple goroutines, which dramatically improves throughput.
Per-client consistency. As Rate Limiter operates on mutable counters, it maintains such consistency using mutex locks.
In-memory caching. Both rate limiting rules and current usage counts are cached aggressively in memory, with Memcached as the backing store for persistence and cross-instance synchronization. In-memory caching is also used in the API Gateway for the blocklist.

Implementation: The Header Challenge

One tricky aspect was providing accurate rate limiting headers in API responses. Since rate limiting happens asynchronously, we can’t know the exact current state at response time. This was our solution:

API Gateway makes a non-blocking request (without await) to Memcached for current usage data
If the data arrives before the response is sent, we include accurate headers
If not, we either omit the headers or include the last known values

This is implemented using JS Promises, allowing request processing to continue while headers are fetched in parallel.

Implementation: Cloudflare Blocking

Sometimes, certain API clients — usually simply misconfigured — create such aggressive bursts of requests that they keep slamming our system, despite the HTTP 429s.

For such cases, we use CDN-level blocking through Cloudflare.

If a default rate limit rule is exceeded and abuse continues, we create a Cloudflare firewall rule automatically
These rules block the client across all regions
Rules live for 5 minutes, but can be extended up to 30 minutes based on traffic behavior
A “master” Rate Limiter instance (elected via Consul) manages these rules per region

This gives us edge-level protection without burdening our backend, and keeps misbehaving clients out until they calm down (or fix their integrations).

Performance Results

This experiment in creating an asynchronous rate limiter turned out to be an absolute success!

The improvements were significant. After switching to the new architecture:

API latency: P99 overhead latency dropped from 44.9ms to 3.42ms — a 92% improvement
Rate limiter reliability: Zero incidents of rate limiting being disabled due to infrastructure load in over four years of operation
Stability: Rate limiter scales independently of the API gateway, allowing us to handle traffic spikes without degradation

Perhaps most importantly, the rate limiter now works best when we need it most, during traffic spikes and potential abuse scenarios.

Observability

The last piece of this puzzle is monitoring, observability, and SLOs.

In addition to rate limiting functionality, we created another Go service to aggregate all traffic events into Elasticsearch. This allowed us to create powerful dashboards for all Pipedrive public traffic:

Backoffice plugin dashboards. Used by product managers, sales, and support to review rate limit stats.
Grafana dashboards. For engineers and SREs to monitor real-time usage patterns, error rates, and rule hits.

These Grafana dashboards are one of the most used tools in our SREs’ arsenal to drill down into traffic when investigating slowness cases and incidents. Being able to see traffic patterns visually and correlate them with other events (deployments, maintenances, support cases, etc.) has proven to be absolutely crucial.

And lastly, we wanted to make sure we always know that our Rate Limiter performs well in the long run. In addition to some standard operational alerts designed to inform us if the service starts to experience some issues, we also introduced an SLO for “slipped-through events”. This SLO uses a few metrics to calculate the number of requests that went through the API Gateway when they should’ve been blocked. We formulated it like this: “99.9% of the time, the ratio of slipped through requests is less than 1 in 10000”.

In Part 2, we will talk about the Token-Based Rate Limiter, which we added to our rate limiter setup to handle longer-term API usage. Stay tuned!

Pipedrive R&D Blog

All about rate limiters in Pipedrive — Part 1

Why this article matters

What are we going to talk about

Backstory: The First Iteration

The Memcached Bottleneck

The Self-Defeating Fallback

Why Scaling Wasn’t the Answer

New Asynchronous Public Rate Limiter

The Core Architectural Shift

Why This Architecture Works

Rate Limiting Algorithm

Rate Limiting Rules

Get Andy Kohv’s stories in your inbox

Technologies We Used

Implementation: Rules

Implementation: Data Flow

Implementation: Important Principles

Implementation: The Header Challenge

Implementation: Cloudflare Blocking

Performance Results

Observability

Published in Pipedrive R&D Blog

Written by Andy Kohv

No responses yet

More from Andy Kohv and Pipedrive R&D Blog

5 examples of collaborative user interviews in Miro — Designing with users

What if your user interviews were less like interviews and more like workshops?

Socket Timeout — An Important and Sometimes Complicated Issue with Python

During my experience working with Python, I’ve had several cases where a network client was left hanging while trying to request a server.

How we choked our Kubernetes NodeJS services

Learn from our experience how to mange memory and CPU resources properly without slowing down your services.

Optimizing Data Archiving Strategies: A Comprehensive Guide to Smarter Database Management

Exploring Three Proven Methods for Efficient Data Archiving

Recommended from Medium

From Billions to Trillions: How Discord Transformed Its Message Architecture

When your platform grows from handling billions to trillions of messages, the database that got you here won’t get you there. Discord…

How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data…

Authors: Adrian Taruc and James Dalton

How Netflix Survived the AWS Apocalypse

The Circuit Breaker Pattern That Kept Streaming Alive While the Internet Fell Apart

HelloFresh’s Brownfield Leap: One App to Feed Them All!

HelloFresh’s PUMA unifies all mobile apps under a Brownfield React Native setup — boosting speed, consistency, and AI-driven innovation.

Solving Double Booking at Scale: System Design Patterns from Top Tech Companies

Learn how Airbnb, Ticketmaster, and booking platforms handle millions of concurrent reservations without conflicts

AI at the Core: How Netflix and OpenAI Reveal the New Blueprint for Product Strategy

👉 In a rush? Skip to the conclusion at the bottom for the one-minute version.