The Hidden Complexity of Email Addresses
from liuzhen932
Alright, in this article, we're going to dive into the world of email addresses.
Most developers think they understand email addresses. After all, what's complicated about john@example.com? But dig deeper into the RFC specifications, and you'll discover a rabbit hole of edge cases, surprising rules, and implementation quirks that would make even seasoned engineers do a double-take.
The Deceptive Simplicity
At first glance, email validation seems straightforward. You might write a regex, check for an @ symbol, maybe validate the domain format, and call it a day. But the reality is far more nuanced. The email address specification, primarily governed by RFC 5322 and RFC 5321[^1], contains layers of complexity that most systems simply ignore or handle incorrectly.
Consider this seemingly innocent address: user+shopping@gmail.com. It's valid, widely supported, and useful for filtering emails. But what about "very.VERY.\"very@\\ \"very\".unusual"@strange.example? Believe it or not, this monstrosity is also perfectly valid according to the specifications[^2].
Breaking Down the Anatomy
An email address consists of two parts separated by the @ symbol: the local-part and the domain. The local-part can be up to 64 octets, while the domain can be up to 255 octets[^3]. But within these constraints lies a world of possibilities that most developers never encounter.
The Local-Part Mysteries
The local-part—the bit before the @ symbol—has two forms: quoted and unquoted. Unquoted local-parts can contain letters, digits, and a specific set of special characters: !#$%&'*+-/=?^_{|}~`. The period is allowed but with restrictions—it can't be first, last, or consecutive.
But here's where it gets interesting. What happens when you add quotes? Suddenly, you can include almost any ASCII character. Want to have a space in your email address? "john doe"@example.com is valid. Need to include a comma? "user,name"@example.com works too.
Take this example: normal(wtf is this?)@example.com. The parenthetical comment is completely ignored by the email system, making it equivalent to normal@example.com. Comments can appear at the beginning, middle, or end of the local-part, and they're simply stripped away during processing.
Domain Surprises
Most people expect domains to look like example.com, but the specification allows for more exotic forms. IP address literals are valid: user@[192.168.1.1] will work, though it's rarely seen outside of spam. IPv6 addresses are also supported: admin@[2001:db8::1].
Perhaps more surprisingly, domains don't always need a top-level domain. admin@example is technically valid, though ICANN strongly discourages such “dotless” domains[^4].
The Emoji Revolution
With the advent of internationalized email addresses (RFC 6530[^5]), we've entered a new era. Unicode characters are now permissible in both the local-part and domain, leading to addresses like test@россия.рф or even 👋@example.com.
But the real mind-bender comes with RFC 6532, which allows Unicode in domain literals. This means user@[💩] is theoretically valid. While most mail servers won't handle this gracefully, the specification doesn't prohibit it[^6].
The Quote Conundrum
Here's something that breaks most people's mental model: ""@example.com is valid. An empty quoted string in the local-part is perfectly acceptable, even though an empty unquoted local-part is not. The distinction between @example.com (invalid) and ""@example.com (valid) illustrates the subtle complexities in the specification.
Even more bizarre, you can have technical shell commands as email addresses. ":(){:|:&};:"@example.com is valid—it's a fork bomb wrapped in quotes. The quotes prevent interpretation, making it just another string of characters[^7].
Case Sensitivity: The Great Debate
The specification states that local-parts MUST be treated as case-sensitive[^8]. This means John@example.com and john@example.com are technically different addresses. However, the same RFC also urges that receiving hosts should deliver messages in a case-independent manner. This contradiction has led to inconsistent implementations across different mail systems.
Gmail takes this further by ignoring periods in the local-part entirely. john.smith@gmail.com, johnsmith@gmail.com, and j.o.h.n.s.m.i.t.h@gmail.com all deliver to the same inbox[^9].
Plus Addressing: The Power User's Secret
Many mail servers support sub-addressing, where everything after a plus sign in the local-part is ignored for delivery purposes. user+newsletter@example.com gets delivered to user@example.com, but the tag remains visible in the headers, allowing for powerful filtering rules[^10].
This feature, supported by Gmail, Outlook, Fastmail, and others, is invaluable for tracking email sources and creating disposable addresses. Yet many web forms reject these addresses as “invalid,” demonstrating the gap between specification and implementation.
Implementation Reality vs. Specification
While the RFCs define what's technically possible, real-world implementations are far more restrictive. Windows Live Hotmail, for instance, only allows alphanumeric characters, periods, underscores, and hyphens[^11]. Many web forms implement overly strict validation that rejects perfectly valid addresses.
This creates a frustrating situation where an address might be valid according to the specification, deliverable by some mail servers, but rejected by the very websites that need to send confirmation emails.
The Postmaster Exception
There's one special local-part that deserves mention: postmaster. This address is case-insensitive and should be forwarded to the domain's email administrator[^12]. Every domain that accepts email must provide a working postmaster address, making it a crucial administrative contact point.
International Considerations
The push for internationalized email addresses has opened up possibilities for users worldwide to have addresses in their native scripts. Chinese users can have addresses like 我買@屋企.香港, while Russian users might prefer медведь@с-балалайкой.рф[^13].
However, support remains patchy. While the specifications exist (RFCs 6530-6533), many legacy systems and poorly implemented validators still struggle with non-ASCII characters.
The Validation Nightmare
All this complexity makes email validation surprisingly difficult. A truly compliant validator must handle quoted strings, comments, internationalization, IP literals, and numerous edge cases. Most developers opt for simplified validation that covers 99% of real-world cases while rejecting some technically valid addresses.
The common advice is to send a verification email rather than relying solely on format validation. If the user receives and responds to the email, the address is valid for all practical purposes.
Looking Forward
Email addresses have evolved far beyond their original simple design. While the core concept remains unchanged, the specification has grown to accommodate global needs, security concerns, and changing technology landscapes.
Understanding these complexities isn't just academic—it affects how we build systems, validate user input, and handle edge cases. The next time you implement email validation, remember that behind every email address lies a specification rich with history, compromise, and surprising flexibility.
The humble email address, something we use dozens of times daily, contains multitudes of complexity hidden beneath its familiar facade. In a world where we take digital communication for granted, it's worth appreciating the engineering effort required to make something so complex appear so simple.
References
[^1]: RFC 5322: Internet Message Format; RFC 5321: Simple Mail Transfer Protocol
[^2]: Example from RFC 3696 demonstrating complex but valid email address syntax
[^3]: RFC 5321, Section 4.5.3.1: Size limits and minimums for email address components
[^4]: ICANN announcement 2013-08-30: New gTLD Dotless Domain Names Prohibited
[^5]: RFC 6530: Overview and Framework for Internationalized Email
[^6]: RFC 6532: Internationalized Email Headers, allowing Unicode in all email components
[^7]: Quoted strings in RFC 5322 allow most ASCII characters when properly escaped
[^8]: RFC 5321, Section 2.4: Local-part case sensitivity requirements
[^9]: Gmail support documentation on address handling and dot notation
[^10]: RFC 5233: Sieve Email Filtering: Subaddress Extension specification
[^11]: Windows Live Hotmail registration requirements (archived documentation)
[^12]: RFC 5321, Section 4.5.1: Required postmaster address specification
[^13]: RFC 6530-6533: Internationalized email address examples and implementation
- e-mail.wtf – Some of the examples in this article were inspired by