[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Unicode
- From: Björn De Meyer <bjorn.demeyer@...>
- Date: Sun, 26 May 2002 23:04:27 +0200
Roberto Ierusalimschy wrote:
>
> The problem here is that pattern modifiers (`*', `+', etc.) in Lua work
> only over a single char. If someone writes "ã*", she wants the "whole"
> ã to repeat (and not only the last byte in the representation of ã), so
> pattern matching must be `UTF-8 aware' (and `UTF-8-able'...)
Oops! Thanks for pointing this out!
This problem related to the fact that currently, in Lua,
a-umlaut is a "latin" character that lies outside the ASCII 7 bit
range. Such characters are not directly compatible with UTF-8,
and need to be converted to a 2-byte representation.
But, and please correct me if I am wrong, but don't lua regexps
support multi-byte patterns? Can't you do a (abc)+ to match
"abcabcabc" ? If that is (made) possible, the modification to gsub and
substr should be relatively simple.
--
"No one knows true heroes, for they speak not of their greatness." --
Daniel Remar.
Björn De Meyer
bjorn.demeyer@pandora.be