All about rate limiters in Pipedrive — Part 1
This blog post is the first one in a two-part series called “All about rate limiters in Pipedrive”. (See Part 2)
Both parts are co-authored by Andy Kohv and Max Goldenberg.*
Every backend system eventually hits the same wall: someone, or something, starts flooding your API. At some point, your carefully tuned infrastructure starts buckling under the load you technically allowed — but didn’t really expect.
At Pipedrive, we came to realize it a while ago. Between our public API, integrations, and internal microservices, we realized we needed to protect our large system of 800 microservices from abuse, keep response times stable, and ensure that one noisy client wouldn’t ruin it for everyone else. That led us to build not just one, but two different rate limiting solutions — each solving a distinct problem.
Why this article matters
If you’re working on scaling backend systems, this post will be a deep dive into what a real-world, production-grade rate limiting setup looks like in a large SaaS company. The two solutions we’ll cover map closely to what most companies eventually need:
- A public rate limiter to defend your API from overload and abuse.
- A token-based rate limiter to align API usage with customer plans and purchased packages, using a 24-hour token window.
This article is deliberately thorough. We’re not just describing what we built — we’ll walk you through the design, trade-offs, edge cases, performance constraints, and what we learned the hard way.
A quick disclaimer: rate limiting is deeply context-dependent. The “right” solution for one company might be completely wrong for another. It depends on your product, business model, traffic patterns, and infrastructure. So don’t treat this as a plug-and-play recipe. Instead, think of it as a set of battle-tested ideas that might help you design your own approach.
Along the way, we’ll also highlight a few unique decisions in our architecture. Both our public rate limiters are written in Go and use a fully asynchronous processing model — something that, as far as we know, is fairly unique. This asynchronous design has proven extremely fast, robust, and reliable, powering our API traffic with zero major issues for over four years.
What are we going to talk about
This is a comprehensive article, aimed at giving a full overview of all rate limiting solutions we have in Pipedrive. We split it into 2 parts:
Part 1 (this blog post): Everything about the backstory and the new Asynchronous Public Rate Limiter
Part 2 (will be released shortly): Will expand on this topic by talking about the Token-Based Rate Limiter
So let’s start with Part 1 backstory: what was our first version of a rate limiter, and what issues did it have?
Backstory: The First Iteration
When we first built a rate limiter at Pipedrive, we did what most teams do: we embedded it directly into our API Gateway called Barista. The logic was straightforward: for every incoming request, make a synchronous call to Memcached to check and update rate limiting counters, then either allow or block the request based on the response.
This approach worked fine for light to moderate traffic. But as our API usage grew, we started hitting the fundamental problem with synchronous rate limiting: the rate limiter itself became a bottleneck.
The Memcached Bottleneck
The core issue was simple but painful. Every single API request required a round-trip to Memcached to read the current request count, increment it, and, if the key didn’t exist yet, set a TTL for the rate-limiting window.
Under normal load, these operations were fast enough-maybe 1–2ms of added latency per request. But when traffic spiked (exactly when rate limiting is most needed), Memcached would get overwhelmed. Response times would climb from milliseconds to hundreds of milliseconds, sometimes even timing out entirely.
The Self-Defeating Fallback
To prevent rate limiting from killing our API performance, we implemented a fallback strategy: if the Memcached lookup took longer than a threshold (initially 8ms), we’d skip rate limiting entirely and let the request through. This seemed reasonable — better to occasionally let some requests through than to add hundreds of milliseconds to every API call.
But this created a problem on its own. The harder someone hammered our API, the more likely they were to overwhelm Memcached, which meant the more likely we were to disable rate limiting altogether. Our rate limiter was most likely to fail exactly when we needed it most.
We’d see this pattern repeatedly during traffic spikes:
- High request volume starts overwhelming a service
- Memcached gets saturated with rate limiting lookups
- Memcached response times exceed our skip threshold
- Rate limiting gets disabled, letting even more traffic through
- The original service gets hit with the full, unthrottled load
Why Scaling Wasn’t the Answer
The obvious solution might seem like “just add more Memcached capacity.” But that missed the fundamental architectural problem. Synchronous rate limiting means your rate limiter has to be as fast and available as your main API path. Any latency or reliability issues in your rate limiting infrastructure directly impact every API request.
Plus, we realized we wanted to do more sophisticated rate limiting — different rules for different customers, complex penalty logic, integration with external systems like Cloudflare. Adding all that logic to the critical request path would only make the latency problem worse.
We needed a completely different approach.
New Asynchronous Public Rate Limiter
We realized we could flip the problem around: instead of making every request wait for rate limiting decisions, what if we made rate limiting decisions asynchronously and only blocked future requests when necessary?
The Core Architectural Shift
Our new design follows a simple principle: allow requests by default, block only when explicitly told to. Here’s how it works:
- API Gateway (Barista) produces traffic events: For every API request, our gateway produces a lightweight Kafka event containing request metadata (IP, company ID, endpoint, timestamp, etc)
- Rate limiter consumes asynchronously: A separate rate limiter service consumes these events, applies rate limiting logic, and maintains counters in Memcached
- Block events trigger restrictions: When a rate limit is exceeded, the rate limiter produces a “block event” to another Kafka topic
- API Gateway enforces blocks in-memory: The gateway consumes block events and maintains an in-memory cache of blocked clients, applying restrictions to subsequent requests
- Headers added asynchronously: Rate limiting headers (current usage, reset time, etc.) are fetched from Memcached and added to responses, but this happens asynchronously and doesn’t block request processing
Why This Architecture Works
- No critical path blocking. The main request processing never waits for rate limiting decisions. Even if the rate limiter service is down or Memcached is slow, API requests continue flowing normally.
- Kafka as the reliability layer. Kafka provides exactly the durability and ordering guarantees we need. Events are never lost, and we can replay them if needed. The rate limiter can fall behind during traffic spikes and catch up later without impacting API availability.
- Independent scaling. We can scale rate limiter instances independently of gateway instances. High traffic? Spin up more rate limiter consumers. Complex rate limiting rules? Add more processing power without touching the request path.
- Graceful degradation. If there’s a problem with rate limiting infrastructure, the worst case is that we temporarily allow more traffic than we should, but the API stays responsive.
To sum it up, asynchronous nature provides many benefits: decreased latency, decoupling, improved reliability, etc.
However, it has an obvious tradeoff — some requests might still get through, while previous ones are being processed. If this is acceptable to a certain degree, we would need to set up our monitoring and, ideally, SLOs to account for such “slipped-through requests”. We will discuss this in a bit more detail in a section on Observability below.
Rate Limiting Algorithm
The second important factor that plays a role in the overall architecture is the choice of a rate limiting algorithm.
💡 Check out this visual comparison of rate limiting algorithms. It explains it better than any wall of text could.
We implemented a custom rolling window algorithm, where counters continuously slide forward rather than resetting at fixed intervals. This provides much smoother enforcement and eliminates burst allowances at window boundaries.
Rate Limiting Rules
And the last important bit about the architecture is the variety of features we wanted to provide for rate limiting rules.
The more precise the rules can be, the more control we will have over potential API abuse.
We want to be able to create both:
- Global rules. Which are publicly documented here.
- Custom rules. Which helps us deal with API usage overloads or incorrectly set up integrations on a case-by-case basis.
At the moment of writing this article, we have the following options to create a rate limiting rule:
- Filtering:
- By company ID or User-Agent
- By microservice and per endpoint
- By the HTTP method
- By the company’s billing tier
- By authentication type (web app, API token, or OAuth) - Time range: It is possible to apply a rate limit temporarily
- Cloudflare block: It is also possible to automatically block requests on the CDN level
This level of granularity allows us to create custom rules to rate limit only the requests we need, so that the rest of the functionality of Pipedrive is not affected for selected customers. Because of that, rate limiting is one of the most useful tools we have to deal with cases of slowness or API abuse.
Get Andy Kohv’s stories in your inbox
Join Medium for free to get updates from this writer.
Now that the overall architectural idea is clear, let’s circle back to the tech stack we used to implement this new rate limiter.
Technologies We Used
For the microservice itself, we chose Go for several key reasons:
- Ease of working with concurrency is crucial in such a high-throughput environment
- Explicit error handling
- Strong typing and clear interfaces
- Go’s extensive standard library means you rely on far fewer external packages, so you spend much less time on constant dependency upgrades. That significantly lowers — but doesn’t eliminate the maintenance burden for a mission-critical service.
As the concept implied asynchronous processing, we needed Kafka. And for storing request counts themselves, we went with Memcached, which was used in the API Gateway stack at the time.
Lastly, we wanted to create dedicated observability for our rate limiter traffic and needed a time-series storage solution for our traffic events. For that, we decided to use Elasticsearch.
Implementation: Rules
Let’s start with rate limiter rules we just talked about.
- They are created through our internal backoffice plugin and produced as events to a Kafka topic with log compaction
- At startup, each Rate Limiter instance loads the entire rules topic into memory and compiles any regex-based paths, caching the compiled patterns so incoming requests are matched with minimal overhead.
We also implemented multi-DC replication:
- Rules are written to a local Kafka topic in each separate datacenter
- Kafka MirrorMaker then handles syncing them to global topic
Why is that necessary? Customers in Pipedrive are bound to a specific geographical region. But sometimes they need to be migrated to another region (see Migrations). Multi-DC replication keeps configs synched in all regions, eliminating the need for migrations altogether.
Implementation: Data Flow
Let’s now look at how requests are being processed and blocked.
- API Gateway receives a request and produces a traffic event to Kafka.
- Rate limiter consumes these events in batches
- It matches the request against in-memory cached configs
- It then looks up the requester’s counter in Memcached
- It implements the rolling window logic and penalty handling, which is the most complex part of the codebase. The code uses various local optimizations and bitwise operations to maximize performance.
- If the counter algorithm concludes that a limit is breached, it produces a block event to Kafka
- API Gateway consumes these block events
- It adds the block to the in-memory blocklist
- It starts rejecting further traffic from the offending source with a 429 Too Many Requests response
- Finally, it also uses Memcached to inject headers like x-ratelimit-remaining in responses (via a non-blocking async callback)
Implementation: Important Principles
- Low-latency event production. We optimized our API Gateway to produce Kafka messages with minimal buffering and maximum throughput. Each request generates a small event that includes only essential metadata.
- Batch processing. The Rate Limiter consumes events in configurable batches and processes them in parallel using multiple goroutines, which dramatically improves throughput.
- Per-client consistency. As Rate Limiter operates on mutable counters, it maintains such consistency using mutex locks.
- In-memory caching. Both rate limiting rules and current usage counts are cached aggressively in memory, with Memcached as the backing store for persistence and cross-instance synchronization. In-memory caching is also used in the API Gateway for the blocklist.
Implementation: The Header Challenge
One tricky aspect was providing accurate rate limiting headers in API responses. Since rate limiting happens asynchronously, we can’t know the exact current state at response time. This was our solution:
- API Gateway makes a non-blocking request (without await) to Memcached for current usage data
- If the data arrives before the response is sent, we include accurate headers
- If not, we either omit the headers or include the last known values
This is implemented using JS Promises, allowing request processing to continue while headers are fetched in parallel.
Implementation: Cloudflare Blocking
Sometimes, certain API clients — usually simply misconfigured — create such aggressive bursts of requests that they keep slamming our system, despite the HTTP 429s.
For such cases, we use CDN-level blocking through Cloudflare.
- If a default rate limit rule is exceeded and abuse continues, we create a Cloudflare firewall rule automatically
- These rules block the client across all regions
- Rules live for 5 minutes, but can be extended up to 30 minutes based on traffic behavior
- A “master” Rate Limiter instance (elected via Consul) manages these rules per region
This gives us edge-level protection without burdening our backend, and keeps misbehaving clients out until they calm down (or fix their integrations).
Performance Results
This experiment in creating an asynchronous rate limiter turned out to be an absolute success!
The improvements were significant. After switching to the new architecture:
- API latency: P99 overhead latency dropped from 44.9ms to 3.42ms — a 92% improvement
- Rate limiter reliability: Zero incidents of rate limiting being disabled due to infrastructure load in over four years of operation
- Stability: Rate limiter scales independently of the API gateway, allowing us to handle traffic spikes without degradation
Perhaps most importantly, the rate limiter now works best when we need it most, during traffic spikes and potential abuse scenarios.
Observability
The last piece of this puzzle is monitoring, observability, and SLOs.
In addition to rate limiting functionality, we created another Go service to aggregate all traffic events into Elasticsearch. This allowed us to create powerful dashboards for all Pipedrive public traffic:
- Backoffice plugin dashboards. Used by product managers, sales, and support to review rate limit stats.
- Grafana dashboards. For engineers and SREs to monitor real-time usage patterns, error rates, and rule hits.
These Grafana dashboards are one of the most used tools in our SREs’ arsenal to drill down into traffic when investigating slowness cases and incidents. Being able to see traffic patterns visually and correlate them with other events (deployments, maintenances, support cases, etc.) has proven to be absolutely crucial.
And lastly, we wanted to make sure we always know that our Rate Limiter performs well in the long run. In addition to some standard operational alerts designed to inform us if the service starts to experience some issues, we also introduced an SLO for “slipped-through events”. This SLO uses a few metrics to calculate the number of requests that went through the API Gateway when they should’ve been blocked. We formulated it like this: “99.9% of the time, the ratio of slipped through requests is less than 1 in 10000”.
In Part 2, we will talk about the Token-Based Rate Limiter, which we added to our rate limiter setup to handle longer-term API usage. Stay tuned!