lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Sat, Jun 15, 2013 at 3:52 PM, Roberto Ierusalimschy
<roberto@inf.puc-rio.br> wrote:
>
> You can already easily implement this ǵetchar' in standard Lua (except
> that it assumes a well-formed string):
>
>   S = "∂ƒ"
>   print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1))  --> '∂', 4
>   print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4))  --> 'ƒ', 6
>   print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6))  --> nil

Thanks for this pattern trick, it helped me to improve my
`getcodepoint()` routine (although I eventually found a faster
method). A validation routine won't be of much help
if you deal with a document that sports multiple encodings.

If you want to validate the characters on the go and always get a position as
second argument, you need something like this:

If the character is not valid, it returns `false, position`. At the
end of the stream, it returns nil, position + 1.

    local s_byte, s_match, s_sub = string.byte, string.match, string.sub

    function getchar(S, first)
        if #S < first then
            return nil, first
        end

        local match, next = S:match("^([^\128-\191][\128-\191]*)()", first)

        if not match then
            return false, first
        end

        local first, n = s_byte(match), #match
        local success
            =  first < 0x128 and n == 1
            or first < 0x224 and n == 2
            or first < 0x240 and n == 3
            or first < 0x248 and n == 4
            or first < 0x252 and n == 5
            or first < 0x254 and n == 6
            --UTF-16 surrogate code point checking left out for clarity.

        if success then
            return match, next
        else
            return false, first
        end
    end


or this (idem in Lua 5.1/5.2, but twice as fast in LuaJIT, where
`gmatch()` is not compiled):


function utf8_get_char_jit_valid2(subject, i)
        if i > #subject then
            return nil, i
        end

        local byte, len = s_byte(subject,i)

        if byte < 128 then
            return s_sub(subject, i, i), i + 1

        elseif byte < 192 then
            return false, i

        elseif byte < 224 and s_match(subject, "^[\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 1), i + 2

        elseif byte < 240 and s_match(subject,
            "^[\128-\191][\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 2), i + 3

        elseif byte < 248 and s_match(subject,
            "^[\128-\191][\128-\191][\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 3), i + 4

        elseif byte < 252 and s_match(subject,
            "^[\128-\191][\128-\191][\128-\191][\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 4), i + 5

        elseif byte < 254 and s_match(subject,
            "^[\128-\191][\128-\191][\128-\191][\128-\191][\128-\191]",
            i + 1) then
                return s_sub(subject, i, i + 5), i + 6

        else
            return false, i
        end
    end


This is not that complex, but still rather slow in Lua, and the same
goes for getting the code point to perform a range query (useful to
test if a code point is part of some alphabet).

To that, end, you could provide a `utf8.range(char, lower, upper)`, though.

This assumes you don't deprecate patterns in the next Lua version (or
the one after, to ease the transition?).

But I understand the need to balance features and light weight.
`getchar()` and `getcodepoint()` are damn useful to write parsers, but
if LPeg is part of the next version, the point is probably moot.

-- Pierre-Yves