[Formats: html | pdf | kindle pdf]
Locality-sensitive hashing (LSH) is a set of techniques that dramatically speed up search-for-neighbors or near-duplication detection on data. These techniques can be used, for example, to filter out duplicates of scraped web pages at an impressive speed, or to perform near-constant-time lookups of nearby points from a geospatial data set.
Figure 1: A preview of LSH in action. Only hash collisions were used to find the weights in this image — no pairwise distances were explicitly computed.
Let’s take a quick look at other types of hash functions to get a bird’s-eye view of what counts as a hash function, and how LSH fits into that world. A traditional use for hash functions is in hash tables. As a reminder, the hash functions used in a hash table are designed to map a piece of data to an integer that can be used to look in a particular bucket within the hash table to retrieve or delete that element. Many containers with string keys, such as JavaScript objects or Python dictionaries, are based on hash tables. Although hash tables might not guarantee constant-time lookups, in practice they effectively provide them.
There are other classes of hash functions as well. For example, the SHA-1 cryptographic hash function is designed to be difficult to reverse, which is useful if you want to store someone’s password as a hashed value. Hash functions like these are called cryptographic hash functions.
Hash functions typically have these key properties:
- They map some type of input, such as strings or floats, to discrete values, such as integers.
- They’re designed so that two inputs will result in hash outputs that are either different or the same based on key properties of the inputs.
Here’s how LSH fits in: Locality-sensitive hash functions are specifically designed so that hash value collisions are more likely for two input values that are close together than for inputs that are far apart. Just as there are different implementations of secure hash functions for different use cases, there are different implementations of LSH functions for different data types and for different definitions of being close together. I’ll use the terms neighbors or being nearby to refer to points that we deem “close enough” together that we’d want to notice their similarity. In this post, I’ll give a brief overview of the key ideas behind LSH and take a look at a simple example based on an idea called random projections, which I’ll define in section 2 below.
1 A human example
It will probably be much easier to grasp the main idea behind LSH with an example you can relate to. This way we can build some intuition before diving into those random projections that we’ll use in the next section.
Suppose you have a million people from across the United States all standing in a huge room. It’s your job to get people who live close together to stand in their own groups. Imagine how much time it would take to walk up to each person, ask for their street address, map that to a lat/long pair, then write code to find geographic clusters, and walk up to every person again and tell them how to find the rest of their cluster. I cringe just thinking about the time complexity.
Here’s a much better way to solve this problem: Write every U.S. zip code on poster boards and hang those from the ceiling. Then tell everyone to go stand under the zip code where they live.
Voila! That’s much easier, right? That’s also the main idea behind LSH. We’re taking an arbitrary data type (a person, who we can think of as a ton of data including their street address), and mapping that data into a set of discrete values (zip codes) such that people who live close together probably hash to the same value. In other words, the clusters (people with the same zip code) are very likely to be groups of neighbors.
A nice benefit of the zip code approach is that it’s parallel-friendly. Instead of requiring a center of communication, every person can walk directly to their destination without further coordination. This is a bit surprising in light of the fact that the result (clusters of neighbors) is based entirely on the relationships between the inputs.
Another property of this example is that it is approximate: some people may live across the street from each other, but happen to have different zip codes, in which case they would not be clustered together here. As we’ll see below, it’s also possible for data points to be clustered together even when they’re very far apart, although a well-designed LSH can at least give some mathematical evidence that this will be a rare event, and some implementations manage to guarantee this can never happen.
2 Hashing points with projections
In this section, I’ll explain exactly how a relatively straightforward LSH approach works, explore some key parameters for this LSH system, and review why this technique is an order of magnitude faster than some other approaches.
Let’s start with an incredibly simple mathematical function that we can treat as an LSH. Define for a point by
that is, is the largest integer for which For example,
Let’s suppose we choose points at random by uniformly sampling from the origin-centered circle with radius 4:
Suppose we want to find which of our points in are close together. We can estimate this relationship by considering points to be clustered together when It will be handy to introduce the notation to indicate that and are in the same cluster. With that notation, we can write our current hash setup as
Figure 2 shows an example of such a clustering.
Figure 2: Twenty points chosen randomly in a circle with radius 4. Each point is colored based on its hash value
You can immediately see that some points are far apart yet clustered, while others are relatively close yet unclustered. There’s also a sense that this particular hash function was arbitrarily chosen to focus on the x-axis. What would have happened with the same data if we had used instead The result is figure 3.
Figure 3: The same twenty points as figure 1, except that we’re using the values (instead of values) to determine the hash-based cluster colors this time around.
While neither clustering alone is amazing, things start to work better if we use both of them simultaneously. That is, we can redefine our clustering via
Our same example points are shown under this new clustering in figure 4.
Figure 4: The same twenty points clustered via two different hashes — one using , the other using As before, points are colored based on which cluster they’re in; a cluster is the set of all points who share both their and their values.
This does a much better job of avoiding clusters with points far apart, although, as we’ll see below, we can still make some improvements.
2.1 Randomizing our hashes
So far we’ve defined deterministic hash functions. Let’s change that by choosing a random rotation matrix (a rotation around the origin) along with a random offset Given such a random and we could define a new hash function via
where I’m using the notation to indicate the first coordinate of the vector value vec. (That is, the notation means the first coordinate of the vector .) This function is the random projection I referred to earlier.
It may seem a tad arbitrary to use only the first coordinate here rather than any other, but the fact that we’re taking a random rotation first means that we have the same set of possibilities, with the same probability distribution, as we would when pulling out any other single coordinate value.
A key advantage of using randomized hash functions is that any probabilistic statements we want to make about performance (e.g., “99% of the time this algorithms will give us the correct answer”) applies equally to all data, as opposed to applying to some data sets but not to others. As a counterpoint, consider the way quicksort is typically fast, but ironically uses time to sort a pre-sorted list; this is a case where performance depends on the data, and we’d like to avoid that. If we were using deterministic hash functions, then someone could choose the worst-case data for our hash function, and we’d be stuck with that poor performance (for example, choosing maximally-far apart points that are still clustered together by our function above). By using randomly chosen hash functions, we can ensure that any average-case behavior of our hash functions applies equally well to all data. This same perspective is useful for hash tables in the form of universal hashing.
Let’s revisit the example points we used above, but now apply some randomized hash functions. In figure 4, points are clustered if and only if both of their hash values (from and ) collide. We’ll use that same idea, but this time choose four rotations as well as four offsets to define via
Figure 5 shows the resulting clustering. This time, there are 100 points since using more hash functions has effectively made the cluster areas smaller. We need higher point density to see points that are clustered together now.
Figure 5: One hundred random points clustered using four random hash functions as defined by (2). Points have the same color when all four of their hash values are the same. Each set of parallel light gray lines delineates the regions with the same hash value for each of the functions.
It’s not obvious that we actually want to use all four of our hash functions. The issue is that our clusters have become quite small. There are a couple ways to address this. One is to simply increase the scale of the hash functions; for example, set:
where is a scale factor. In this setup, larger values will result in larger clusters.
However, there is something a bit more nuanced we can look at, which is to allow some adaptability in terms of how many hash collisions we require. In other words, suppose we have total hash functions (just above, we had ). Instead of insisting that all hash values must match before we say two points are in the same cluster, we could look at cases where some number of them matches. To state this mathematically, we would rewrite equation (1) as
Something interesting happens here, which is that the relationship is no longer a clustering, but becomes more like adjacency (that is, sharing an edge) in a graph. The difference is that, in a clustering, if and then we must have as well; this is called being transitively closed. Graphs don’t need to have this property. Similarly, it’s no longer true that our similarity relationship is transitively closed.
It may help your intuition to see this new definition of in action on the same 100 points from figure 5. This time, in figure 6, there are twenty random hashes, and we’re seeing the graphs generated by equation (3) using cutoff values (values of ) of 6, 7, 8, and 9. The top-left graph in figure 6 has an edge drawn between two points and whenever there are at least 6 hash functions with out of a possible 20 used hash functions.
Figure 6: A set of 100 random points with graph edges drawn according to (3). There are 20 random hash functions used. The top-left graph uses the cutoff value The remaining three graphs, from top-left to bottom-right, have cutoff values 8, and 9 respectively; this means each graph is a subgraph (having a subset of the edges) of the previous one.
In fact, we can visualize all possible cutoff values of 6 or higher — these are values of in equation (3) — using a single image with weighted edges, as seen in figure 7. Keep in mind that we haven’t explicitly computed any pairwise distances to arrive at this data.
Figure 7: The same 100 random points from figure 6, this time rendered with edge weights that depend on how many hash collisions are present between any two points. A black edge represents the maximum of 20 hash collisions; the lightest edge represents only 6 hash collisions.
There’s another fun way to build intuition for what information our hashes provide. Let’s visualize regions of the circle where all points have the same number of hash collisions with a given query point. We can do this by showing an example query point , and shading each region based on the number of hash collisions the region’s points have with ; this is shown in figure 8. Every point in each shaded region has the same hash values for all the hash functions used. The first part of figure 8 shows a scaled version of the two-hash system (using and , similar to figure 4) that we saw before; the second part uses 5 random hashes. The darkest region contains points where all hash values collide, so for all . In a lightly shaded region that equation will only hold true for a smaller subset of the hash functions
The second part of figure 8 (with 5 hashes) shows nicer behavior, and I’ll try to explain why. Imagine that we were drawing these same images for some theoretically perfect LSH setup that somehow managed to match point to every point with for some radius ; all other points were not matched. For that perfect LSH setup, images like figure 8 would show a fixed-size circle with center at that moved along with . With that in mind as the perfect LSH result, notice that the second part in figure 8 is much closer to this ideal than the first part. Keep in mind that lookups within the shaded regions are no longer linear searches through data, but rather the intersection of hash table lookups — that is, lookups of nearby points are significantly faster.
Figure 8: The first part shows the regions where points would be matched by either two (dark regions) or just one (lighter shade) hash collision with the moving query point . The second part shows the analogous idea for 5 random hash functions; in the latter case, the lightest shaded region indicates 3 hash collisions.
It may further help your intuition to see how weighted edges connecting a point to its neighbors, like those in figure 7, change as a single query point moives. This is the idea behind figure 9, where weighted edges are drawn between a moving query point and 100 random points. Notice that the edge weightings make intuitive sense: they tend to connect strongly to very close neighbors, weakly to farther neighbors, and not at all to points beyond a certain distance.
Figure 9: Edges weighted by how many hash collisions are present between the moving query point and 100 random points. Darker edges indicate more hash collisions. This image uses 12 random hashes, and requires at least 6 hash collisions for an edge to appear.