Can UTF-8 string contain zerobytes? I'm going to send it over ascii plaintext protocol, should I encode it with something like base64?

share|improve this question
4  
Yes it can contain exactly one value of zero (right at the end) – Chris Aug 2 '11 at 4:40
3  
UTF-8 uses 8 bits so you can't send it over ASCII (7-bit) plaintext. Base64 encoding would help. Not because of null bytes, though. – Tim Pietzcker Aug 2 '11 at 4:40
    
thanks everyone for quick response, I'll use encoding or fix proto. – einclude Aug 2 '11 at 5:07
up vote 57 down vote accepted

Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

The possible code points and their UTF8 encoding are:

Range              Encoding  Binary value
-----------------  --------  --------------------------
U+000000-U+00007f  0xxxxxxx  0xxxxxxx

U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                   10xxxxxx

U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                   10yyyyxx
                   10xxxxxx

U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                   10zzyyyy
                   10yyyyxx
                   10xxxxxx

You can see that all the non-zero ASCII characters are represented as themselves while all mutibyte sequences have a high bit of 1 in all their bytes.

You may need to be careful that your ascii plaintext protocol doesn't treat non-ASCII characters badly (since that will be all non-ASCII code points).

share|improve this answer
4  
Pacerier, there is no such thing as invalid UTF8. By definition, if it's not valid, it's not UTF8 :-) – paxdiablo Jan 26 '15 at 19:41
2  
The definition of UTF-8 has been so overloaded by too many to mean "bytes to be intepreted as UTF-8" instead of the original "bytes according to UTF-8". – Pacerier Jan 30 '15 at 19:52
2  
Pacerier, you raise a good point, and that may be the case, but then they're just wrong. As wrong as people who try to claim EBCDIC is ASCII, COBOL is C, or French is Swahili :-) I can see no reasonable interpretation that would call something UTF8 if it wasn't actually valid according to the UTF8 rules. If it's not valid UTF8, then it just some sort of arbitrary bytestream. – paxdiablo Jan 31 '15 at 5:04
1  
Nice.​​​​​​​​​​​​​​​ – Pacerier Feb 1 '15 at 13:59
1  
@gardarh: no, the UTF-8 encoding of 0x0800 is not 08, 00, it's e0, a0, 80, with no zero byte in sight. See fileformat.info/info/unicode/char/0800/index.htm for more details but it's basically the first value in my third range in the answer, with all bytes having the high bit set, hence no possibility of 00. – paxdiablo Nov 1 '16 at 12:01

A UTF-8 encoded string can have most values from 0x00 to 0xff in a given byte position for of backing memory (although a few specific combinations are not allowed, see http://en.wikipedia.org/wiki/UTF-8 and the octet values C0, C1, F5 to FF never appear).

If you are transporting across a channel such as an ASCII stream that does not support binary data, you will have to appropriately encode. Base64 is broadly supported and will certainly solve that problem, though it is not entirely efficient since it uses a 64 character space to encode data, whereas ASCII allows for a 128 character space.

There is a sourceforge project that provides base 91 encoding, which is more space efficient while avoiding non-printable characters http://base91.sourceforge.net/

share|improve this answer
    
thank you, Eric J., this is one very useful link – einclude Aug 2 '11 at 5:09
1  
I don't think your first sentence is correct. The sequence 11111110 could only occur in a seven-unit sequence, which I believe is not specified, and 11111111 can *never` appear as far as I know. (How would it? Perhaps in a hypothetical extension to more than seven code units?) – Kerrek SB Aug 2 '11 at 8:00
    
You can use base-128 on ASCII or UTF-8 channels, that's even more efficient: stackoverflow.com/a/3956975/309483 – Janus Troelsen Feb 18 '14 at 11:09
    
Your first sentence is not correct. According to page 2 of RFC 3629 (an internet standard published in 2003-11), "The octet values C0, C1, F5 to FF never appear." – user824425 Oct 26 '15 at 8:28
    
@Rhymoid: Thanks, I was not aware of that. Any idea why? Updated my answer accordingly. – Eric J. Oct 26 '15 at 14:41

ASCII text is restricted to byte values between 0 and 127. UTF-8 text has no such restriction - text encoded with UTF-8 may have its high bit set. So it's not safe to send UTF-8 text over a channel that doesn't guarantee safe passage for that high bit.

If you're forced to deal with an ASCII-only channel, Base-64 is a reasonable (though not particularly space-efficient) choice. Are you sure you're limited to 7-bit data, though? That's somewhat unusual in this day.

share|improve this answer
    
You can use base-128 to deal with binary data in an UTF-8/ASCII-only channel, because the lower 128 byte values are all single-byte codepoints, AFAIK. – Janus Troelsen Feb 18 '14 at 11:12

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.