lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> While character classes like "%g" would require the entire Unicode
> tables, what about patterns like this:
> 
>     utf8.match("\u{e4}", "[\u{e4}-\u{e6}\u{f3}-\u{f5}]")
> 
> It wouldn't require Unicode tables but "just" UTF-8 support for the
> matching functions.
> 
> Would that be possible without adding too much bloat? When I had to
> match codepoint ranges before, I had to use multiple patterns to match
> certain ranges of UTF-8 encodings.

I believe you are asking to add this kind of class into the
pattern-matching constructions in Lua. That would require some
non-trivial changes to the engine, as the whole engine would have to
be 'utf8' aware. For instance, a class repetition such as [aá]* could
not just count the number of bytes it matched, but would have to count
the number of characters. That is not compatible with the byte-oriented
behavior, so the engine would need two modes (or maybe two different
engines).

(That is a problem of the current simple implementation. For a more
powerful engine, such as LPeg, that already handles subexpressions,
it would be much easier.)

-- Roberto