proposal: spec: disallow LTR/RTL characters in string literals?

TL;DR https://play.golang.org/p/LPkPTRF7fC

The above code looks quite plain and obvious, except it does something completely different than you'd expect (feel free to run it). The obvious thing that should happen is that it counts the number of bits set in the given string. The non-obvious thing that happens is that the mask is actually 0, not 0x01.

So, what happened there? The abuse in the above code is around the invisible Unicode characters that mark following text to be right-to-left or left-to-right. Since Go permits arbitrary Unicode characters to be present in string literals, it also allows me to have a string of the form "bla bla blaabc". Since we're dealing with valid Unicode sequences here, any modern editor/website will actually interpret those special characters, causing the content in between the two special marks to be reversed to the end of the line (alas still part of the string literal).

In my playground code this is abused by having the following source code:

str, mask := "Hello, World!<rtl>10x<ltr>", 0

Which will be displayed by all modern editors/websites as:

str, mask := "Hello, World!", 0x01

The security aspect of this issue is social engineering attacks. The "display" line of my sample code is obvious beyond doubt, so noone will ever inspect such a thing; however it managed to flip one bit in a mask (imagine doing this for file permission umasks). The issue is that such code could easily get past reviews and into a final product.

The tricky part is how to avoid these issues. My only meaningful suggestion would be to disallow these two special characters in string literals. This does break Go 1.0 compatibility guarantees, however I think it's worth it (I can't really figure out a meaningful use case for it). Using it in a single line string will screw up the display of the source code, so it doesn't make sense to do it, and using it in a multi line string is questionable at best. I think requiring users to explicitly use \x notation for adding these characters it a good compromise to protect source code sanity.

String literals must be able to hold any byte sequence whatsoever. Using a string to hold binary data is not uncommon and this proposal will introduce a forbidden byte sequence.

I understand the problem (nice trick BTW), but I don't think this is a way to solve it.

@cznic

@cznic,

String literals must be able to hold any byte sequence whatsoever.

Yes, but not in their raw form. Only if they're escaped.

Even today:

      s := "`\"`"

... you need to escape the double quote. We could require that RTL/LTR characters also need to be escaped.

We could require that RTL/LTR characters also need to be escaped.

Yes, but that's unfortunately only a partial solution: https://play.golang.org/p/ZL_QF7xc-e.

It would need to be forbidden from `` as well.

…

On May 2, 2017 19:01, "cznic" ***@***.***> wrote: We could require that RTL/LTR characters also need to be escaped. Yes, but that's unfortunately only a partial solution: https://play.golang.org/p/ZL_QF7xc-e. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20209 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH6Gbwbo-lFuOIrMxjxn-Vhz7D-cgn0ks5r11N1gaJpZM4NOMYM> .

@karalabe

@karalabe, yes.

It would need to be forbidden from `` as well.

I don't like to say that, but then we are back to square 1 (modulo the backtick). To make things even worse, there's also the Go 1 compatibility promise and I'm quite unsure if such breaking change qualifies under it.

That's the catch 22 :)

Perhaps a nice compromise (alas it makes for ugly parsing rules and introduces corner cases) is to forbid the two characters in single line strings (irrelevant of the type) and only allow it in multiline strings for lines that are fully inside (i.e. not the same line as the opening or closing tick is on). In reality I think it's enough to disallow for the closing tick's line.

These "forbidden" cases would already even now screw up the code so I cannot imagine anyone benevoletly using the in such a way. Yet by limiting to the closing mark only, we can retain the functionality of rtl characters that some may have in backtick code.

All in all personally I'd just forbid everywhere but I'm not accustomed to rtl use cases so I can't really comment on how people use them in completely valid and meaningful ways.

This could be handled by a linter of some kind, but then people need to be aware of that tool and run it.

I know this doesn't match all the criteria for go vet, but adding it there may be best.

If go vet is added to go test per #18084 it would be run by default in a lot of situations making it harder to sneak this into a codebase.

Of course, that doesn't close the hole: it just makes it smaller.

Given how unlikely this is to be done on purpose without malicious intent, breaking Go 1 (which is allowed when it comes to security) and making it illegal across the board probably wouldn't cause anyone any pain.

I think solving this problem in the language is the wrong place.
How about in your code review tool? It would be as simple as highlighting weird/unusual UTF8, homographs, ... Then a human can decide whether it is ok or not.

If you're not doing code reviews, you have bigger problems.

GitHub flat out stated they don't care about these kinds of attack vectors. Given that most Go code is hosted on GitHub, most reviews will be done there too.

Also these two characters are not homoglyphs, they are invisible.

Does someone want to come up with a list of what a vet "unicode" check would check?

@nigeltao

I think we agree that the first step is to add a vet check and only after building experience with it think about actual language restrictions (or not). Marking this proposal-hold until there is a proposal (a short list would be fine) of what a vet "unicode" check would check.

/cc @nigeltao @mpvl

In #40717 I propose 3 codepoints to add to that list for a vet check: U+115F, U+1160, U+3164

"‘Trojan Source’ Bug Threatens the Security of All Code": https://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/

Apparently some researchers with the University of Cambridge have given this (type of) bug a cool-sounding name now.

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

lint: add linter for unicode directional formatting characters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: spec: disallow LTR/RTL characters in string literals? #20209

20 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

proposal: spec: disallow LTR/RTL characters in string literals? #20209

Description

Activity

cznic commented on May 3, 2017

bradfitz commented on May 3, 2017

cznic commented on May 3, 2017

karalabe commented on May 3, 2017

bradfitz commented on May 3, 2017

cznic commented on May 3, 2017

karalabe commented on May 3, 2017

karalabe commented on May 3, 2017

jimmyfrasche commented on May 3, 2017

randall77 commented on May 3, 2017

karalabe commented on May 3, 2017

20 remaining items

rsc commented on Jun 13, 2017

rsc commented on Oct 10, 2017

dolmen commented on Aug 13, 2020

holiman commented on Nov 1, 2021

komuw commented on Nov 1, 2021

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions