The boundaries which divide Life from Death are at best shadowy and vague. Who shall say where the one ends and where the other begins?
This post looks at the details of how a program running under Windows splits its command line into individual arguments.
Parsers
We’re only going to consider the two most common command line parsers– parse_cmdline and CommandLineToArgvW. The first, parse_cmdline, is part of the Microsoft CRT library initialization code and is called automatically at program startup to generate argv[]. It has both an ANSI and a Unicode version. CommandLineToArgvW is called explicitly by the programmer and has just a Unicode version. There is no standard function to un-parse a command line— that is, that takes a set of arguments and outputs a command line that re-generates the same set of arguments (something like ArgvToCommandLine) but I will give you a sample utility that does just that.
Whitespace Defined
I already mentioned in my previous post that the command line is divided into arguments at whitespace boundaries by both parse_cmdline and CommandLineToArgvW. Conceptually the process is simple— the parser scans the command line string from left to right looking for sequences of one or more whitespace characters which it discards. The substrings that remain between these blocks of whitespace become the individual arguments. This simplistic description ignores for the time being the handling of special characters like double quote and does not take into account the special handling of the first argument. We will discuss these issues later.
Both parse_cmdline and CommandLineToArgvW define whitespace as just space (0×20) and horizontal tab (0×09).
Given that whitespace is afforded special treatment, the obvious question is how do you supply an argument that contains literal whitespace characters and prevent them from being interpreted as an argument separator? We’ve already seen the answer— by adding a double quote character [”] to the command line to switch the parser state from InterpretSpecialChars to IgnoreSpecialChars. This has to be done somewhere after the previous separator whitespace, and before the literal space. After encountering this double quote character (which is removed) the parser stops recognizing whitespace as an argument separator. A subsequent double quote character (also removed) re-enables this recognition. I will show later how this toggling behavior can be selectively overridden.
The following has already been stated (with different wording), but deserves repeating:
Quoting serves only to tell the parser to stop or restart interpreting whitespace as an argument separator. It does not determine where an argument begins or ends.
You do not need to enclose an entire argument containing spaces in double quotes— just make sure the recognition of whitespace as an argument separator is disabled (state is IgnoreSpecialChars) prior to any literal whitespace.
Escaping Defined
Since the double quote character is now given special meaning, you have a problem similar to the one you had for whitespace: how do you insert a literal double quote character without toggling the parser state?
This is where the concept of escaping comes in. Another special character is defined, the backslash ([\]), which means don’t interpret the next character as special. If you want to include a literal double quote character, whether the state is InterpretSpecialChars or IgnoreSpecialChars, just precede it by a backslash: [\”]. The word escape reflects the idea that you are temporarily escaping from the normal flow of processing. The concept is subtly different from that of quoting because it’s active for only a single character after which the state automatically switches back to normal processing. Double quotes on the other hand switch and latch the parser state until another double quote character is seen.
Once again this begs the question— how do you include a real escape character? It’s beginning to feel like we’re never going to get off this treadmill! As soon as we define a way to handle a special case, our so-called fix introduces a new special case. We turned whitespace as the special case into double quote as the special case, then double quote as the special case into backslash as the special case.
Luckily for us, the chain is broken at this point through a simple mechanism: allow the escape character to escape itself! Conceptually, it works a lot like string constants in source code— to include a literal backslash, just supply two backslashes: [first\\second] for [first\second]. No new special case is introduced.
In an ideal world that’s all there would be to it. You would separate arguments with whitespace, protect literal whitespace by including a double quote character to enter IgnoreSpecialChars, place a backslash before any double quote that you don’t want to change the parser state, and always double up (or self escape) backslashes that are not escaping something.
Not so fast! Dealing With a Proliferation of Backslashes
But we don’t live in an ideal world. If you read the Microsoft documentation, you’ll see a set of rules that complicate this ideal behavior. I’m going to explain what I think is the reasoning behind some of these rules with the hope that doing so will help you understand and remember them better.
But first I’ll just list the Microsoft rules from the documentation (with a few comments in red)
- Arguments are delimited by white space, which is either a space or a tab.
- The caret character (^) is not recognized as an escape character or delimiter. The character is handled completely by the command-line parser in the operating system before being passed to the argv array in the program. potentially confusing mix of information about cmd.exe and the executable’s processing of the command line. You should keep these separate in your mind!
- A string surrounded by double quotation marks (“string”) is interpreted as a single argument, regardless of white space contained within. A quoted string can be embedded in an argument. what exactly does this mean? (see discussion below)
- A double quotation mark preceded by a backslash (\”) is interpreted as a literal double quotation mark character (“). independent of the parser state
- Backslashes are interpreted literally, unless they immediately precede a double quotation mark.
- If an even number of backslashes is followed by a double quotation mark, one backslash is placed in the argv array for every pair of backslashes, and the double quotation mark is interpreted as a string delimiter. reinforces the error-prone concept that double quotes enclose some meaningful string, or (worse) a complete argument
- If an odd number of backslashes is followed by a double quotation mark, one backslash is placed in the argv array for every pair of backslashes, and the double quotation mark is “escaped” by the remaining backslash, causing a literal double quotation mark (“) to be placed in argv.
- (there is one more undocumented rule which we will discuss later)
I don’t know what is meant by the statement a quoted string can be embedded in an argument (from the third rule). To me it suggests you can so something like the following and still have the entire string interpreted as a single argument:
["She said "you can't do this!", didn't she?"]
But this is interpreted as four arguments:
argv[1] = [She said you] argv[2] = [can't] argv[3] = [do] argv[4] = [this!, didn't she?]
Or maybe it means you can embed a quoted string in the middle if you’re not already in a quoted string? That’s true, but then you can’t have spaces elsewhere:
[Shesaid"you can't do this!",didn'tshe?]
Not quite.
So just forget about quoted strings and embedded quotes.
This happens to be a good example to demonstrate some of the things I have been saying. I’ll show the parse image first then point out just the interesting parts:
- 3) Switch to IgnoreSpecialChars
- 4) Switch to InterpretSpecialChars
- 8) Single double quote character- Enter IgnoreSpecialChars
- 9) Single double quote character- Return to InterpretSpecialChars
The second double quote character, immediately before [you], does not end the first argument. This particular situation seems to causes conceptual difficulty for some people.
The double quote character simply causes the parser to leave IgnoreSpecialChars and re-enter InterpretSpecialChars (indicated by the change from red to green). Because there is no whitespace, the argument continues. But because the parser is now in InterpretSpecialChars, the space between [you] and [can’t] does start a new argument. So does the space between [can’t] and [do], as well as between [do] and [this!].
Finally, the third double quote character (before the comma) causes another transition into IgnoreSpecialChars, preventing any further arguments from being generated.
Backslash Rules
Backslashes are very common because they’re the file system path separator under Windows. The writers of the C/C++ runtime library must have realized that it’s not only inconvenient, but also just plain ugly to have to fully escape these backslashes:
A UNC path such as:
[\\SomeComputer\subdir1\subdir2\]
Would need to be passed to CreateProcess in the command line string as:
[\\\\SomeComputer\\subdir1\\subdir2\\]
Which would in turn be expressed in source code as:
["\\\\\\\\SomeComputer\\\\subdir1\\\\subdir2\\\\"]
Ouch!
I believe that it was to avoid this proliferation of backslashes that Microsoft introduced the rule that backslashes are interpreted literally (except when they precede a double quote character).
But this rule results in an ambiguity: when the parser encounters [\”], is it an escaped double quote or a backslash followed by an un-escaped quote?
The strange-sounding rules for backslashes resolve the ambiguity. It might be more easily understood if you think of an escaped double quote as a single special compound character [\”]— call it an e-quote— instead of two separate characters and re-word the rule (in terms of creating the command line instead of parsing it) to read something like: Escape any literal backslashes that occur immediately prior to either a double quote or an e-quote character so they’re not interpreted as an escape character.
The intent may have been to facilitate the quoting of paths generated or received by a script or executable (not a human)— i.e., to allow you to always unconditionally enclose paths in double quote characters without further escaping. But doing so fails if the path contains a trailing backslash because the backslash is interpreted as an escape, as we have seen.
The following example illustrates the problem. The command line is unexpectedly split into just the two arguments shown. The final two components are appended to the second argument because the parser state is still IgnoreSpecialChars after the escaped double quote character and remains so until the end of the command line, where the parser ends in the IgnoreSpecialChars state. You should think of it that way instead of as an un-closed quoted string that is implicitly closed at the end of the command line:
Command line:
[test.exe "c:\Path With Spaces\Ending In Backslash\" Arg2 Arg3]
Actual arguments generated:
[test.exe] [c:\Path With Spaces\Ending In Backslash" Arg2 Arg3]
Probably what was expected:
[test.exe] [c:\Path With Spaces\Ending In Backslash\] [Arg2] [Arg3]
You might think you can avoid this by always stripping trailing backslashes. But doing so fails for one important special case— a root directory:
The path
[c:\]
(the root directory of drive c:) has a different meaning than
[c:]
(the current working directory on drive c:, not necessarily the root).
You can unconditionally remove trailing backslashes from paths, except for a root directory which must be explicitly handled as a special case.
Parser Specifics
So far we’ve covered that the command line is split on whitespace, that you can disable or re-enable (toggle) this splitting using a double quote character, and that you can mask the special toggling behavior of the double quote character by escaping it with a backslash.
We also covered the special rules Microsoft introduced for backslashes to avoid the need to escape backslashes in paths.
We’re now going to delve into the specifics of the parsers themselves.
The following pseudocode is for both parse_cmdline and CommandLineToArgvW. It ignores the special handling of the first argument. Except for one very minor difference (highlighted), it is the same for both. Ironically, the one situation where the behavior is not the same involves the undocumented rule I alluded to when listing the Microsoft rules, above.
I have not rigorously verified the following pseudocode. If you have Visual Studio installed you can find the actual source code for parse_cmdline in the file stdargv.c in the CRT source directory.
One aspect of parse_cmdline that I do not cover is the expansion of wildcards (* and ?), a feature that must be compiled into the executable.
State = InterpretSpecialChars while(command line string not finished) { advance past leading whitespace (space or tab) count and advance past leading backslashes if (current character is ["]) { for each pair of leading backslashes counted, output a single [\] if (a backslash is leftover) { skip the leftover [\] and append ["] to the current argument } else if (the current ["] is followed by a second ["]) && State == IgnoreSpecialChars { skip the first ["] and append the second ["] to the current argument if (parser is CommandLineToArgvW) { // parse_cmdline remains in IgnoreSpecialChars State = InterpretSpecialChars; } } else { // toggle parser state: if (State == InterpretSpecialChars) { State = IgnoreSpecialChars } else { State = InterpretSpecialChars } } } else { for each leading backslash, output a single [\] if (next character is space or tab) { if (State = InterpretSpecialChars) { start a new argument; } } else { append the current character to the current argument. } } }
Other than how the first argument is handled (discussed later), the only difference between parse_cmdline and CommandLineToArgvW is what the parser does after it encounters two double quote characters in a row when the state is IgnoreSpecialChars. Both parsers treat the first double quote character as a kind of escape for the second one and discard it (this is the undocumented rule). The second double quote causes CommandLineToArgvW to exit IgnoreSpecialChars, while parse_cmdline remains in IgnoreSpecialChars.
You are not likely to encounter a practical command line that causes the two parsers to generate different results. Because I have seen numerous examples of contrived command lines that demonstrate the difference, and people seem to be interested in how to explain them, I’m going to go over one extreme example that I encountered on the Internet:
[DumpArgs foo""""""""""""bar]
parse_cmdline generates the following arguments for the prior command line:
[DumpArgs] [foo"""""bar] 5 literal double quote characters
But CommandLineToArgvW generates these:
[DumpArgs] [foo""""bar] only 4 literal double quote characters
I’ve seen people try, unsuccessfully, to explain this example by the usual method of looking for pairs of double quote characters. But if you simply scan from left to right as I have shown, tracking the parser state, you’ll find that the difference is due to the fact that CommandLineToArgvW exits then re-enters IgnoreSpecialChars repeatedly (every time it encounters two consecutive double quote characters), but parse_cmdline only does so one time.
This will be easier to visualize by seeing the diagrams.
First we see how parse_cmdline parses the example. It removes a double quote both when entering and when leaving IgnoreSpecialChars (at points 3 and 4). While the state remains IgnoreSpecialChars, it removes one of each of the five pairs of double quote characters (the points marked ’6′), for a total of seven double quote characters removed:
Next we see how CommandLineToArgvW parses the same example. Like parse_cmdline, this parser removes a double quote character every time it enters or leaves IgnoreSpecialChars. Unlike parse_cmdline, where IgnoreSpecialChars is entered and left only once, here it happens four times each, at the points marked 3 (enter) and 5 (leave). So eight double quote characters are removed instead of just the seven that parse_cmdline removes:
- 3) Switch to IgnoreSpecialChars
- 4) Switch to InterpretSpecialChars
- 5) First of 2 double quote characters in a row escapes next double quote. Next double quote switches to InterpretSpecialChars
- 6) First of 2 double quote characters in a row escapes next double quote
- 7) CommandLineToArgvW: saw 2 double quote characters in a row. Return to InterpretSpecialChars
- 8) Single double quote character- Enter IgnoreSpecialChars
- 9) Single double quote character- Return to InterpretSpecialChars
To convince you this is not just academic, the next example is my attempt to come up with a plausible real-world scenario (though to me it still seems unlikely to occur):
Suppose we have a processing pipeline where the first program generates the string [hello world]. The next 2 stages in the pipeline blindly double-quote the argument and pass it on, generating first [”hello world”], then [”"hello world”"]. Finally, the last stage double quotes the entire command line and passes it on to FinalProgram.exe:
[FinalProgram.exe "first second ""embedded quote"" third"]
Command Line Arguments From CommandLineToArgvW:
arg 0 = [FinalProgram.exe] arg 1 = [first second "embedded] arg 2 = [quote] arg 3 = [third]
Command Line Arguments From argv Array (argc = 2):
argv[0] = [FinalProgram.exe] argv[1] = [first second "embedded quote" third]
Here, we again see the undocumented rule come into play— two double quote characters in a row while the state is IgnoreSpecialChars are interpreted as an escaped double quote character. Both parsers consume the first one and output the second one. But the difference is that for CommandLineToArgvW, there is a transition back to InterpretSpecialChars, while parse_cmdline remains in IgnoreSpecialChars.
First Command Line Argument
Both parsers process the first argument differently than the remainder of the command line. There may be a good reason for this, but I can’t think of one and I think it just causes additional confusion without providing much, if any benefit.
To compound the confusion, there is a much greater difference between the two parsers for the first argument than there is for the remainder of the command line.
I will discuss the two parsers separately.
Pseudocode for parse_cmdline Handling of First Argument
The following is pseudocode for parse_cmdline:
ParserState = InterpretSpecialChars loop while not end of command line and not end of argv[0] { if (char is space or tab) and (ParserState is InterpretSpecialChars) { Overwrite the whitespace char with string terminator End argv[0] } else if (char is ["] { Toggle ParserState Discard the ["] } else { Append char to argv[0] } } parse remainder of command line normally
Notes for parse_cmdline
- You can enter and exit IgnoreSpecialChars as many times as you want, the same as when parsing the remainder of the command line. The double quote characters are removed.
The following command line:
["F"i"r"s"t S"e"c"o"n"d" T"h"i"r"d"]
generates just one argument (when it appears at the start of the command line):
[First Second Third]
If you trace it carefully you will find that the state is IgnoreSpecialChars when each of the two spaces is encountered.
- You cannot use either one of the usual ways of escaping double quote characters (with a backslash or with another double quote character):
The backslash in the following does not escape the double quote that follows it and the two pairs of back-to-back double quote characters do nothing (they’re simply removed because they just cause the state to toggle then immediately toggle back):
[F""ir"s""t \"Second Third"]
Even though the 6th double quote looks like it’s escaped, the backslash does not escape anything when parsing the first argument. Therefore this double quote causes the state to change back to InterpretSpecialChars and the following space ends the first argument. Therefore two arguments are generated instead of the single argument ([First “Second Third]) that would be generated for the same text later in the command line:
argv[0] = [First \Second] argv[1] = [Third]
- If the first character of the command line is a space or tab, an empty first argument is generated and the remainder of the command line is parsed normally:
This command line:
[ Something Else]
Generates these 3 arguments:
argv[0] = [] argv[1] = [Something] argv[2] = [Else]
Pseudocode for CommandLineToArgvW Handling of First Argument
The following is pseudocode for CommandLineToArgvW:
if ((first char is >= 0x01) and (first char <= 0x20) { arg[0] is empty string } else if ((first char is ["]) { Discard the ["] loop while not end of arg[0] { if (char is ["]) { Discard the ["] End arg[0] } else if end of command line { End arg[0] } else { Append char to arg[0] } } } else loop while not end of arg[0] { if (char is >= 0x01 and char <= 0x20) { Discard char End arg[0] } else if end of command line { End arg[0] } else { Append char to arg[0] } } } parse remainder of command line normally
Notes for parse_cmdline
CommandLineToArgvW processing of the first argument has some (at least to me) strange behavior, most-notably the way it sometimes accepts any character between 0×01 and 0×20 as whitespace.
- The same as for parse_cmdline, if the first character of the command line is whitespace, an empty first argument is generated and the remainder of the command line is parsed normally. However, any character between 0×01 and 0×20, inclusive, is considered whitespace!
This command line:
[ Something Else]
Generates these 3 arguments:
argv[0] = [] argv[1] = [Something] argv[2] = [Else]
The first character is shown as a space, but any character between 0×01 and 0×20 will cause an empty first argument. If the second character, again shown as a space, is something in the same range, other than a space or tab, it will become the first character of the next argument
- If the first character is a double quote then any non-zero character is accepted as part of the first argument (even 0×01 through 0×20) until another double quote or the end of the command line is encountered.
If the [*] in the next line is really \x05 (or any other character in the range 0×01 through 0×20):
["123 456*abc\def"ghi]
It generates these arguments:
[123 456*abc\def] [ghi]
(it is correct that [ghi] is a new argument even though there is no whitespace after the second double quote)
- If the first character is not a double quote and not a space, then any non-whitespace character, including a double quote is accepted as part of the argument. Here also, whitespace is defined as any character between 0×01 and 0×20, inclusive (not just space and tab)
If the [*] in the next line is really \x05 (or any other character in the range 0×01 through 0×20), then it acts as an argument separator:
[123"456"*abc]
These two arguments are generated:
[123"456"] [abc]
You can download a sample CommandLines.txt file containing all the sample command lines in this post. It can be used with the RunTest utility. The file will need to be renamed as CommandLines.txt before using it.
The last thing I need to cover is the behavior of cmd.exe and batch files. I hope to get this posted in the next week or so. See you next time!
Post a Comment