Re: Bug report: Concat operator parsing error.

On Mon, Nov 1, 2021 at 3:49 PM Alex Light <allight@google.com> wrote:

Interesting. I wasn't aware that '0.' was a valid number token (in lua or elsewhere). This does still seem like a situation where the '..' token should have a higher precedence than the number token in parsing similar to how .. has higher precedence than the '.' token when parsing names (so 'x..y' is parsed correctly as '(concat x y)'). In any event, this was just a minor inconsistency I noticed with lua as it's the only non-reserved word where spaces are meaningful as far as i can tell.

-Alex

On Mon, Nov 1, 2021 at 12:03 PM Coda Highland <chighland@gmail.com> wrote:
On Mon, Nov 1, 2021 at 12:48 PM Alex Light <allight@google.com> wrote:
Lua fails to parse code such as '0.."foo"' because it misinterprets '0.' as the start of a number and fails to recognize the concat operator. The lua syntax indicates this should be permitted.

For example:

% lua
Lua 5.4.3 Copyright (C) 1994-2021 Lua.org, PUC-Rio
> 0 .. "hi"
0hi
> 0.."hi"
stdin:1: malformed number near '0..'

-Alex

The Lua syntax indicates no such thing. The grammar is defined in terms of tokens, and as with the vast majority of programming languages, tokens are always processed as the longest series of consecutive characters that matches a token. Otherwise, it would be ambiguous whether << is the left-shift operator or two less-than operators side by side. "0." matches the definition of a Numeral token, being a numeric constant with a radix point, therefore it must be parsed as a Numeral. However, the documentation only permits a Numeral to contain *A* radix point, so the appearance of a second radix point means that the token is malformed.

Now, the argument could be made that the second . should be interpreted as the start of a new token because including it would no longer match a token. C and _javascript_ both parse "0.." as meaning the floating-point constant 0, followed by the dot operator. However, that would still make the Lua code in your example erroneous, as it would fail with `<name> expected near '"hi"'`. Under no circumstances would that ever be parsed as a concat operator.

The argument could also be made that the definition I gave for a token above is not explicitly noted in the Lua documentation, making it a documentation defect. This is plausibly true, and it's up to the PUC-Rio team to decide if it's necessary to add such a call-out or if it's sufficient to assume that it's consistent with the way it's done in other languages. (The documentation also doesn't define what a numeric constant is aside from saying that it can have an optional fractional part and an optional decimal exponent. We reasonably assume that a numeric constant is formed of the ASCII digits 0 through 9.)

/s/ Adam

Precedence is a grammar-level thing. Tokenizing the code happens before you get to the grammar. `..` does not have precedence over `.` when parsing names. It is merely longer, and as I said the tokenizer extracts the longest token it can from the input. Lua doesn't even know that it's looking at names on both sides when it decides that `..` is the concat operator there.

Another example where whitespace distinguishes between tokens is -- [[ versus --[[. The former is a line comment that's ignoring what would otherwise be the start of a long string. The latter is the beginning of a block comment.

One place where whitespace does NOT distinguish between parses when you might expect it to is this example from the documentation (section 3.3.1):

a = b + c

(print or io.write)('done')

This is parsed as calling c as a function! You have to use a non-whitespace token to disambiguate here, such as a semicolon or a do/end block.

(To go off on a tangent: In C++, the > symbol is used as a bracket in addition to being used as a comparison operator and part of the right-shift operator. Until C++11, if you had nested <> brackets, you had to manually insert whitespace so that it was > > instead of >>, because the tokenizer would always parse >> as the right shift operator. C++11 made it mandatory to be able to leave out that whitespace. The grammar was, in effect, altered to allow the right shift operator to close two angle brackets, although there are other ways you could modify the parser to get the same effect. However, as a consequence, if you want to use the right shift operator inside of angle brackets, you must put the _expression_ in parentheses, which breaks source code compatibility!)

/s/ Adam