|
As an aside, I like the demarcation point of "Lua does UTF-8, but it does not know Unicode." It is always good to be clear what you are *not* trying to do.
On Apr 16, 2014 2:10 AM, "Hisham" <h@hisham.hm> wrote:
> Recent threads here on lua-l and discussion on Twitter about the
> necessity of including UTF-8 support into core Lua (as opposed to a
> library) got me thinking about how hard would it be to get proper
> UTF-8 support in Lua patterns.
> The idea is to avoid things like this:
>
> Lua 5.2.3 Copyright (C) 1994-2013 Lua.org, PUC-Rio
> > print( ("páscoa"):match("[é]") )
> Ã
string.match() is being passed a utf8 haystack, and a utf8 pattern. But string.match does not interpret its inputs as utf8; it silently interprets *both* as single-byte character-set strings. As a function, match()'s result is in the byte-string range regardless of the domain.
The good news is that this works most of the time. The bad news is that it is silent when it fails. Because pattern-matching is used in many security-sensitive contexts, the failure can be driven by an attacker into security failures.
The pattern's behavior has not concerned me as much. There are two ways of things getting into patterns. The dangers of unsafe handling of outside strings incorporated into patterns are relatively well-known. Vigorous C compiler warnings have stamped out "printf(s)".
You already need a good "quote_as_literal_pattern" function if you are constructing patterns from other strings. Let's call its result q. You can use it like this:
"/"..q
" +"..q.."$"
Given a perfect quoter, here are some things you still can't do with it:
"["..q.."]"
You need a different quoting tool for this anyway, since the language inside [] is different from the outer language.
q.."+"
The + symbol will only apply to the last byte of the literal. As it happens, the repetition can only happen on characters in the ASCII subset when both haystack and pattern are utf8. This is because start bytes are distinct from all continuation bytes, and we know any multibyte sequences at the end of q are complete, since we assumed q was in utf8.
q.."*"
This is not likely to do what people want even in ASCII; consider "arbitrary".."*". But given a final utf8 character, this pattern is disastrous when interpreted as bytes. Let's take "€".."*". This is "\xE2\x82\xAC".."*" and it partially matches other characters. If there is a Korean won currency symbol, "₩" coded "\xE2\x82\xA9", the "€*" matches "\xE2\x82" with zero repetitions of "\xAC". If we are capturing this as a match, we now get a string outside utf8. Where does the "\xA9" end up?
If this were part of a larger pattern, "€*.", we could recover depending on the haystack. "₩" would match, as would "€x". But "₩€" would again split up a utf8 character as "." matches the head of "€", "\xE2". So there's no way to be certain isolated "€*" is broken, or is not broken. But there is no way to describe its behavior in terms of utf8 alone.
I do not want to teach the entire world about the implementation details of utf8 just so they can avoid serious errors, and worse, errors which are just not explainable in the naive mental model of strings. Because I am one of those "fascist pigs with a read-only mind" I would prefer a utf8.* analog of string.*, one which will blow up on operations which are ill-defined on the domain and range. In this case, any patterns which destructure utf8 code sequences are fair game to be blown up at (notional) pattern compile time, or when the execution engine notices it's doing it or is permitted to do it.
Since we're moving functionality out of liblua.so into Lua-language libraries[1], analysis of patterns could be performed there, giving a utf8.match wrapper which undoubtedly memoizes patterns. The problem is that very common patterns like ".." are beyond analysis.
If I had one wish for utf8.match, it would be for "." to either match complete utf8 characters or fail.
...but wait a minute, that’s exactly what the range [\0-\u10FFFF] means with Hisham's patch, right? So a utf8.match wrapper could parse the pattern, replacing all instances of "." with the universal range.[2] It would still have to check for the repetition qualifiers "+-*?" following a non-ASCII literal. In the case of "€*" it would rewrite it as a single-character range, "[\u20AC]*".
If you think this starts to sound like an enormous kludge, I would agree. It would be much simpler with some assistance from C.
--
Jay
[1]: Moved out of liblua.so, moved into hypothetical Lua language libraries, never to be seen again. Or rather, to be seen again, occasionally, in n slightly differently broken implementations. I have a hypothesis about what stays in liblua.so: code which would otherwise be require""d for every example in a chapter of _Programming in Lua_. I have some hope that utf8.match will go into liblua.so on this basis. Can you imagine opening the UTF-8 chapter's pattern-matching section with the awful kludge of the pattern-hacking, memoizing string.match wrapper? Wouldn't it be simpler just to write the C than to have to explain a hundred lines of Lua workaround code?
[2]: The kind of pedants reading (and writing!) footnotes about universal ranges in UTF-8 are going to wonder why I do not need to exclude the surrogate range from "[\0-\u10FFFF]". I could, but I don't have to. We already said we're working on valid UTF-8 strings, and valid UTF-8 strings do not include sequences coding the surrogate range.
If that kind of dependency makes you reach for assert(), you're not alone. More on that later, perhaps.