The other day the HB team was chatting and Ben, our dev-ops master, mentioned that he wished he'd used ULIDs instead of UUIDs for a particular system.
Like any seasoned engineer, my reaction was to mumble something non-committal then sneak over to Google to try to figure out what the hell a ULID is.
Two hours later I emerged with a thousand-yard stare and the realization that the world of unique identifiers is larger and more wondrous than I ever could have imagined.
Before we get started with ULIDs, let's go back to the basics and discuss what UUIDs are:
What's the problem with "regular" ids?
Most web applications that use databases default to numeric ids that increment automatically. For example, in Rails you might see behavior like this:
p1 = Person.create! p1.id # => 1 p2 = Person.create! p2.id # => 2
The database can generate sequential ids because it stores a counter that increments on record creation.
This pattern can also be seen outside of databases. Sometimes we need to assign ids manually, and we might store a custom counter in - say - a Redis instance.
Sequential ids are easy to implement for low-volume use-cases, but they become more problematic as volume increases:
- It's impossible to create records concurrently because each insert has to wait in line to receive its id.
- Requesting a sequential id may require a network round trip and result in slower performance.
- It's difficult to scale out data stores that provide sequential ids. You have to worry about counters on different servers getting out of sync.
- It's easy for the node with the counter to become a single point of failure.
Sequential ids also leak data, which may be a problem in some cases:
- You can easily guess the ids of resources that may not belong to you.
- If you create a user and its id is 20, you know that the service has 20 users.
UUIDs are web-scale
UUIDs look a little different than sequential ids. They are 128-bit numbers, typically expressed as 32 hexadecimal digits:
UUIDs are created using specific algorithms defined in RFC 4122. They attempt to solve many of the problems that occur with sequential ids:
- You can generate UUIDs on any number of nodes without any shared state or coordination between nodes.
- They're a little less guessable than sequential ids (more on that later)
- They don't divulge the size of your dataset.
The catch is that there's a small chance of two nodes independently generating the same id. This event is called a "collision."
Many Flavors of UUID
There are five types of UUID algorithm defined in RFC 4122. They fall into two categories:
- Time and randomness-based algorithms are the ones we've been discussing. They result in a new UUID for every run.
- Type 4: A randomly-generated id. Probably our best bet for new code.
- Type 1: The ID contains the host's MAC address and the current timestamp. These are deprecated because they're too easy to guess.
- Type 2: These seem to be uncommon. They appear to be purpose-built for an antiquated form of RPC.
- Name based algorithms are a little different. They always produce the same UUID for a given set of inputs.
- Type 5: Uses an SHA-1 hash to generate the UUID. Recommended.
- Type 3: Uses an MD5 hash and is deprecated because MD5 is too insecure.
In Ruby, you can generate UUIDs via the
uuidtools gem. It supports every type, except the mysterious type 2;
# Code stolen from the uuidtools readme. :) require "uuidtools" # Type 1 UUIDTools::UUID.timestamp_create # => #<UUID:0x2adfdc UUID:64a5189c-25b3-11da-a97b-00c04fd430c8> # Type 4 UUIDTools::UUID.random_create # => #<UUID:0x19013a UUID:984265dc-4200-4f02-ae70-fe4f48964159> # Type 3 UUIDTools::UUID.md5_create(UUIDTools::UUID_DNS_NAMESPACE, "www.widgets.com") # => #<UUID:0x287576 UUID:3d813cbb-47fb-32ba-91df-831e1593ac29> # Type 5 UUIDTools::UUID.sha1_create(UUIDTools::UUID_DNS_NAMESPACE, "www.widgets.com") # => #<UUID:0x2a0116 UUID:21f7f8de-8051-5b89-8680-0195ef798b6a>
Moving on to ULIDs
ULIDs are a useful new take on unique identifiers. The most obvious difference is that they look a little different:
They are made up of two base32-encoded numbers; a UNIX timestamp followed by a random number.
01AN4Z07BY 79KA1307SR9X4MV3 |----------| |----------------| Timestamp Randomness 48bits 80bits
This structure is fascinating! If you recall, UUIDs rely either on timestamps or randomness, but ULIDs use both timestamps and randomness.
As a result, ULIDs have some interesting properties:
- They are lexicographically (i.e., alphabetically) sortable.
- The timestamp is accurate to the millisecond
- They're prettier than UUIDs :)
These open up some cool possibilities:
- If you're partitioning your database by date, you can use the timestamp embedded in the ULID to select the correct partition.
- You can sort by ULID instead of a separate created_at column if millisecond precision is acceptable.
There are some possible downsides too:
- If exposing the timestamp is a bad idea for your application, ULIDs may not be the best option.
sort by ulidapproach may not work if you need sub-millisecond accuracy.
- According to the internet, some ULID implementations aren't bulletproof.
UUIDs are and will continue to be the standard. They've been around forever, and libraries are available in every language imaginable. However, new approaches are worth considering, especially as we enter a world that's increasingly run by distributed systems. New unique-id approaches may help us solve problems that weren't prevalent at the publication of RFC4122.