Description
TL;DR https://play.golang.org/p/LPkPTRF7fC
The above code looks quite plain and obvious, except it does something completely different than you'd expect (feel free to run it). The obvious thing that should happen is that it counts the number of bits set in the given string. The non-obvious thing that happens is that the mask is actually 0, not 0x01.
So, what happened there? The abuse in the above code is around the invisible Unicode characters that mark following text to be right-to-left or left-to-right. Since Go permits arbitrary Unicode characters to be present in string literals, it also allows me to have a string of the form "bla bla blaabc". Since we're dealing with valid Unicode sequences here, any modern editor/website will actually interpret those special characters, causing the content in between the two special marks to be reversed to the end of the line (alas still part of the string literal).
In my playground code this is abused by having the following source code:
str, mask := "Hello, World!<rtl>10x<ltr>", 0
Which will be displayed by all modern editors/websites as:
str, mask := "Hello, World!", 0x01
The security aspect of this issue is social engineering attacks. The "display" line of my sample code is obvious beyond doubt, so noone will ever inspect such a thing; however it managed to flip one bit in a mask (imagine doing this for file permission umasks). The issue is that such code could easily get past reviews and into a final product.
The tricky part is how to avoid these issues. My only meaningful suggestion would be to disallow these two special characters in string literals. This does break Go 1.0 compatibility guarantees, however I think it's worth it (I can't really figure out a meaningful use case for it). Using it in a single line string will screw up the display of the source code, so it doesn't make sense to do it, and using it in a multi line string is questionable at best. I think requiring users to explicitly use \x notation for adding these characters it a good compromise to protect source code sanity.
Activity
cznic commentedon May 3, 2017
String literals must be able to hold any byte sequence whatsoever. Using a string to hold binary data is not uncommon and this proposal will introduce a forbidden byte sequence.
I understand the problem (nice trick BTW), but I don't think this is a way to solve it.
bradfitz commentedon May 3, 2017
@cznic,
Yes, but not in their raw form. Only if they're escaped.
Even today:
... you need to escape the double quote. We could require that RTL/LTR characters also need to be escaped.
cznic commentedon May 3, 2017
Yes, but that's unfortunately only a partial solution: https://play.golang.org/p/ZL_QF7xc-e.
karalabe commentedon May 3, 2017
bradfitz commentedon May 3, 2017
@karalabe, yes.
cznic commentedon May 3, 2017
I don't like to say that, but then we are back to square 1 (modulo the backtick). To make things even worse, there's also the Go 1 compatibility promise and I'm quite unsure if such breaking change qualifies under it.
karalabe commentedon May 3, 2017
That's the catch 22 :)
karalabe commentedon May 3, 2017
Perhaps a nice compromise (alas it makes for ugly parsing rules and introduces corner cases) is to forbid the two characters in single line strings (irrelevant of the type) and only allow it in multiline strings for lines that are fully inside (i.e. not the same line as the opening or closing tick is on). In reality I think it's enough to disallow for the closing tick's line.
These "forbidden" cases would already even now screw up the code so I cannot imagine anyone benevoletly using the in such a way. Yet by limiting to the closing mark only, we can retain the functionality of rtl characters that some may have in backtick code.
All in all personally I'd just forbid everywhere but I'm not accustomed to rtl use cases so I can't really comment on how people use them in completely valid and meaningful ways.
jimmyfrasche commentedon May 3, 2017
This could be handled by a linter of some kind, but then people need to be aware of that tool and run it.
I know this doesn't match all the criteria for go vet, but adding it there may be best.
If go vet is added to go test per #18084 it would be run by default in a lot of situations making it harder to sneak this into a codebase.
Of course, that doesn't close the hole: it just makes it smaller.
Given how unlikely this is to be done on purpose without malicious intent, breaking Go 1 (which is allowed when it comes to security) and making it illegal across the board probably wouldn't cause anyone any pain.
randall77 commentedon May 3, 2017
I think solving this problem in the language is the wrong place.
How about in your code review tool? It would be as simple as highlighting weird/unusual UTF8, homographs, ... Then a human can decide whether it is ok or not.
If you're not doing code reviews, you have bigger problems.
karalabe commentedon May 3, 2017
GitHub flat out stated they don't care about these kinds of attack vectors. Given that most Go code is hosted on GitHub, most reviews will be done there too.
Also these two characters are not homoglyphs, they are invisible.
20 remaining items
rsc commentedon Jun 13, 2017
Does someone want to come up with a list of what a vet "unicode" check would check?
rsc commentedon Oct 10, 2017
I think we agree that the first step is to add a vet check and only after building experience with it think about actual language restrictions (or not). Marking this proposal-hold until there is a proposal (a short list would be fine) of what a vet "unicode" check would check.
/cc @nigeltao @mpvl
dolmen commentedon Aug 13, 2020
In #40717 I propose 3 codepoints to add to that list for a vet check: U+115F, U+1160, U+3164
holiman commentedon Nov 1, 2021
"‘Trojan Source’ Bug Threatens the Security of All Code": https://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/
Apparently some researchers with the University of Cambridge have given this (type of) bug a cool-sounding name now.
komuw commentedon Nov 1, 2021
https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html
lint: add linter for unicode directional formatting characters