> Most of these don't actually use UTF16 but UCS2 with surrogates
Most of these actually use UTF16.
.NET has System.Globalization.StringInfo class, Java has string methods like String.codePointCount. Operator [] returns 2-byte code points in these languages because backward compatibility. Newer languages don’t need to be backward compatible see e.g. https://docs.swift.org/swift-book/LanguageGuide/StringsAndCh... but their internal format is still UTF-16. If you need to interop with any of these languages, your life will be much easier if your C++ code uses UTF-16 as well.
All of .Net, Java and Javascript[0] allow unpaired surrogates, meaning none of them uses UTF-16. I expect the same happen for Symbian, OpenOffice, Qt, … So did Python on pre-FSR narrow builds.
Depends on your definition of “uses”, and I disagree with yours.
It’s always technically possible to make invalid data. For example, you can write a web app that will send Content-type: application/json and invalid utf-8 in the response.
Yes, technically these programming languages allow unpaired surrogates in strings. This is not necessarily a bad thing, there’re valid uses for such strings. For example, you can concatenate strings without decoding UTF16 into code points, i.e. the code will be slightly faster, but while you’re concatenating, in some moment of time the destination will contain an unpaired surrogate.
I hope you agree golang uses utf-8 strings. Despite that, the unicode/utf8 standard package has ValidString function, which means you can still create invalid strings in go: https://golang.org/src/unicode/utf8/utf8.go Does it mean golang uses wtf-8 instead of utf-8?
> Depends on your definition of “uses”, and I disagree with yours.
It really does not, unless you're also disagreeing with the definition of "is". Unpaired surrogates are not valid in a UTF-16 stream. If there are unpaired surrogates it's not UTF-16. If the system normally allows and generates unpaired surrogates, it's not dealing in UTF-16.
> It’s always technically possible to make invalid data. For example, you can write a web app that will send Content-type: application/json and invalid utf-8 in the response.
And that will blow up in the client, because your shit system has sent it garbage.
> Yes, technically these programming languages allow unpaired surrogates in strings.
Making them not-UTF-16.
> This is not necessarily a bad thing, there’re valid uses for such strings.
Like taking a pile of garbage and making it into a bigger pile of garbage.
Which is not relevant to to the issue at hand: such piles of crap are not UTF-16.
> For example, you can concatenate strings without decoding UTF16 into code points
That's not actually an example of your claims, you can concatenate valid UTF-16 without decoding it into codepoints as well.
> I hope you agree golang uses utf-8 strings.
I most certainly do not, why would you hope I agree with an obviously incorrect statement, what is wrong with you?
> Despite that, the unicode/utf8 standard package has ValidString function, which means you can still create invalid strings
Well you've got cause and effect reversed but yes you can create non-utf8 go strings, making Go's "strings" not utf8 at all.
> Does it mean golang uses wtf-8 instead of utf-8?
No, WTF-8 is something well-defined[0].
Golang's "strings" are just random bags of bytes, much like C's, and they're no more utf-8 than C's: you may assume they are of whatever encoding you're interested in but with no guarantee whatsoever that's actually the case and your assumptions may well blow up in your face.
No, just because languages allow representing invalid strings in their type system doesn’t mean they use some other encoding.
Ability to represent invalid data is often a good thing. English language allows representing all kind of garbage, but this is OK, see e.g. Jabberwocky by Lewis Carroll.
There are programming languages who enforce strings encoding and other constraints with strict type systems and/or runtimes. They are only practical for very limited set of problems. The majority of real-world software has to deal with invalid data: compilers do because users type invalid programs, any sufficiently complex systems do because they receive data from external components written in other languages or from I/O. I think not being able to represent invalid strings is a bug, not feature.
Mainstream i.e. practical languages allow representing invalid strings, provide functionality to validate & normalize them, and often raise runtime errors when you try to use them in a way that makes this a problem. For example, .NET throws an ArgumentException saying "Invalid Unicode code point found at index ##" when you try to normalize an invalid string.
Most of these actually use UTF16.
.NET has System.Globalization.StringInfo class, Java has string methods like String.codePointCount. Operator [] returns 2-byte code points in these languages because backward compatibility. Newer languages don’t need to be backward compatible see e.g. https://docs.swift.org/swift-book/LanguageGuide/StringsAndCh... but their internal format is still UTF-16. If you need to interop with any of these languages, your life will be much easier if your C++ code uses UTF-16 as well.