lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Roberto Ierusalimschy <> writes:
> When NUM_CHARS is 1, I guess you can do this:
>   string.find(s, "[^\128-\191]", index)
> In general, one thing to be decided is how much we can stretch the
> standard library to provide utf8 functions. For instance, the following
> code interacts through all code points in a string:
>   s = "aloáéíЉМНЊО"
>   for oneutf8 in string.gmatch(s, ".[\128-\191]*") do
>     print(oneutf8)
>   end

Hmm, no doubt, but I still think this is worthwhile function to provide,
for several reasons.

 * It probably _is_ worthwhile to provide the backwards operation
   (NUM_CHARS = -1).  (scanning backwards in a string is not as common
   as forwards, but it's definitely something people do)

 * Even if using UTF-8 is generally agreed-upon, having user code
   littered with magic constants like "[^\128-\191]" is ... very
   confusing and extremely easy to screw up (even though I know how
   UTF-8 works, I can never keep straight which upper bits are

 * Given that this is probably a _very_ common operation, an obvious and
   easily understandable name for it would valuable for making user code
   readable, at little cost.

 * The implementation as a function essentially trivial, so given that
   it's a common operation it would be nice to provide something that's
   more efficient than going through the regexp machinery.

I think it would be a good general goal of even a very minimal UTF-8
library to try and relieve users from thinking about the _details_ of
the encoding where doing so is easy (sometimes it's not, of course, and
then well no choice but to punt).


Success, n. The one unpardonable sin against one's fellows.