(cache)> Most of these don't actually use UTF16 but UCS2 with surrogates Most of these ...

masklinn · on Dec 3, 2018

> Most of these actually use UTF16.

All of .Net, Java and Javascript[0] allow unpaired surrogates, meaning none of them uses UTF-16. I expect the same happen for Symbian, OpenOffice, Qt, … So did Python on pre-FSR narrow builds.

Swift is migrating to UTF-8 as part of its ABI stabilisation: https://forums.swift.org/t/string-s-abi-and-utf-8/17676

[0] unclear for ObjC/NSString but given the string interface leaks it being an array of utf-16 code units I'd expect the same

edit: https://stackoverflow.com/a/33558934/8182118 indicates that both NSString and Swift strings can contain unpaired surrogates, so not UTF-16 either.

Const-me · on Dec 3, 2018

> meaning none of them uses UTF-16

Depends on your definition of “uses”, and I disagree with yours.

It’s always technically possible to make invalid data. For example, you can write a web app that will send Content-type: application/json and invalid utf-8 in the response.

Yes, technically these programming languages allow unpaired surrogates in strings. This is not necessarily a bad thing, there’re valid uses for such strings. For example, you can concatenate strings without decoding UTF16 into code points, i.e. the code will be slightly faster, but while you’re concatenating, in some moment of time the destination will contain an unpaired surrogate.

I hope you agree golang uses utf-8 strings. Despite that, the unicode/utf8 standard package has ValidString function, which means you can still create invalid strings in go: https://golang.org/src/unicode/utf8/utf8.go Does it mean golang uses wtf-8 instead of utf-8?

masklinn · on Dec 3, 2018

> Depends on your definition of “uses”, and I disagree with yours.

It really does not, unless you're also disagreeing with the definition of "is". Unpaired surrogates are not valid in a UTF-16 stream. If there are unpaired surrogates it's not UTF-16. If the system normally allows and generates unpaired surrogates, it's not dealing in UTF-16.

> It’s always technically possible to make invalid data. For example, you can write a web app that will send Content-type: application/json and invalid utf-8 in the response.

And that will blow up in the client, because your shit system has sent it garbage.

> Yes, technically these programming languages allow unpaired surrogates in strings.

Making them not-UTF-16.

> This is not necessarily a bad thing, there’re valid uses for such strings.

Like taking a pile of garbage and making it into a bigger pile of garbage.

Which is not relevant to to the issue at hand: such piles of crap are not UTF-16.

> For example, you can concatenate strings without decoding UTF16 into code points

That's not actually an example of your claims, you can concatenate valid UTF-16 without decoding it into codepoints as well.

> I hope you agree golang uses utf-8 strings.

I most certainly do not, why would you hope I agree with an obviously incorrect statement, what is wrong with you?

> Despite that, the unicode/utf8 standard package has ValidString function, which means you can still create invalid strings

Well you've got cause and effect reversed but yes you can create non-utf8 go strings, making Go's "strings" not utf8 at all.

> Does it mean golang uses wtf-8 instead of utf-8?

No, WTF-8 is something well-defined[0].

Golang's "strings" are just random bags of bytes, much like C's, and they're no more utf-8 than C's: you may assume they are of whatever encoding you're interested in but with no guarantee whatsoever that's actually the case and your assumptions may well blow up in your face.

[0] https://simonsapin.github.io/wtf-8/

Const-me · on Dec 4, 2018

Yes, unpaired surrogates are invalid UTF16.

No, just because languages allow representing invalid strings in their type system doesn’t mean they use some other encoding.

Ability to represent invalid data is often a good thing. English language allows representing all kind of garbage, but this is OK, see e.g. Jabberwocky by Lewis Carroll.

There are programming languages who enforce strings encoding and other constraints with strict type systems and/or runtimes. They are only practical for very limited set of problems. The majority of real-world software has to deal with invalid data: compilers do because users type invalid programs, any sufficiently complex systems do because they receive data from external components written in other languages or from I/O. I think not being able to represent invalid strings is a bug, not feature.

Mainstream i.e. practical languages allow representing invalid strings, provide functionality to validate & normalize them, and often raise runtime errors when you try to use them in a way that makes this a problem. For example, .NET throws an ArgumentException saying "Invalid Unicode code point found at index ##" when you try to normalize an invalid string.