lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]




On 2017-10-10 11:48 PM, Jonathan Goble wrote:
On Tue, Oct 10, 2017 at 10:06 PM Soni L. <fakedme@gmail.com <mailto:fakedme@gmail.com>> wrote:



    On 2017-10-10 10:54 PM, Martin wrote:
    > On 10/10/2017 10:43 PM, Soni L. wrote:
    >> However, I noticed something kinda weird:
    >>
    >> string.match("''", "%f[']'", 2) --> nil
    >> string.match("''", "^'", 2) --> '
    > string.match("''", "%f[']'", 2). From position 2 of string [['']]
    > locate to index <i> such that s[i] ~= [[']] and s[i + 1] ==
    [[']] and
    > s[i + 1] == [[']].
    >
    > There is no such index so nil is returned. In case of string
    [['' ']]
    > there is match.
    >
    > string.match("''", "^'", 2). From position 2 of string [['']] find
    > string [[']]. Do not skip characters (due "^" anchor). This matches
    > [[']], second apostrophe.
    >
    > -- Martin
    >

    But from position 1, the first pattern matches. From reading the
    manual,
    both operate on the start of the subject string, so either the first
    should match or the second should fail.


Regarding the frontier pattern: it does not match ON a character. It matches the boundary BETWEEN two characters, specifically the boundary immediately before the next character to be matched. So when matching from position 1, the frontier pattern attempts to match the boundary immediately before the first character in the string. Since there is no previous character at this boundary, the frontier pattern matches against "\0" in lieu of the previous character.

"\0" ~= [[']] and [[']] (the first character of the string) == [[']], so the frontier pattern successfully matches. The literal [[']] then matches the first character of the string. If you passed this pattern to string.find instead, it would return a start and end point both equal to 1, since only the first character was matched. The frontier pattern does not consume the source string or add anything to the overall match; it's essentially just an assertion.

Now when you match from position 2, Lua first tries to match the frontier pattern at the boundary immediately before the character next to be matched. In this case, that is the second apostrophe, so the boundary to be checked is between the first and second characters of the source string. The character previous to that boundary is the first apostrophe, which is equal to an apostrophe, so the frontier assertion fails and the function returns nil.

The takeaways here are that frontier patterns check a boundary, not a character, and specifying an explicit start does not prevent Lua from checking the character previous to that if the pattern starts with a frontier assertion.

Now to the second pattern. The "^" character can be a bit surprising when an explicit start position is specified, because the description in the manual does not match the actual behavior. The manual states that the caret "anchors the match at the beginning of the subject string", but in fact, the caret actually anchors the match at the starting point (i.e. it prevents the matching engine from skipping characters).

Thus your second pattern attempts to match [[']] beginning at position 2, without skipping characters, Position 2 is in fact [[']], so the match succeeds. As a further example, consider string.match("'!'", "^'", 2); this will fail to match and return nil because the [[']] cannot match at position 2 and the caret prevents it from skipping characters. The fact that the first character is [[']] is irrelevant, since you've overridden that by telling Lua to start at position 2 instead.

The bug in this second issue appears to be in the documentation, not the code, since there's no good reason why Lua should ignore an explicit start argument when the caret is given, and valid reasons why it shouldn't ignore the start argument. Consider a simple tokenizer that repeatedly matches consecutive tokens without skipping anything else; the pattern could be prefixed with a ^ and suffixed with a position capture, and on each iteration, the result from the previous position capture is fed to the next call as the start argument. In this case, you would rely on the ^ meaning "anchor to start point", and a nil result would mean a syntax error in whatever you're tokenizing.

Or you could interpret the caret definition to define the beginning of the subject string as starting at the explicit start position. In which case frontier is wrong.

It's one of:

- caret is wrong
- frontier is wrong
- manual is wrong

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.