Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Miles Bader <miles@...>
Date: Fri, 10 Feb 2012 12:18:28 +0900

Roberto Ierusalimschy <roberto@inf.puc-rio.br> writes:
> When NUM_CHARS is 1, I guess you can do this:
>
>   string.find(s, "[^\128-\191]", index)
>
> In general, one thing to be decided is how much we can stretch the
> standard library to provide utf8 functions. For instance, the following
> code interacts through all code points in a string:
>
>   s = "aloáéíЉМНЊО"
>   for oneutf8 in string.gmatch(s, ".[\128-\191]*") do
>     print(oneutf8)
>   end

Hmm, no doubt, but I still think this is worthwhile function to provide,
for several reasons.

 * It probably _is_ worthwhile to provide the backwards operation
   (NUM_CHARS = -1).  (scanning backwards in a string is not as common
   as forwards, but it's definitely something people do)

 * Even if using UTF-8 is generally agreed-upon, having user code
   littered with magic constants like "[^\128-\191]" is ... very
   confusing and extremely easy to screw up (even though I know how
   UTF-8 works, I can never keep straight which upper bits are
   which...).

 * Given that this is probably a _very_ common operation, an obvious and
   easily understandable name for it would valuable for making user code
   readable, at little cost.

 * The implementation as a function essentially trivial, so given that
   it's a common operation it would be nice to provide something that's
   more efficient than going through the regexp machinery.

I think it would be a good general goal of even a very minimal UTF-8
library to try and relieve users from thinking about the _details_ of
the encoding where doing so is easy (sometimes it's not, of course, and
then well no choice but to punt).

-miles

-- 
Success, n. The one unpardonable sin against one's fellows.

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Miles Bader
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy

Prev by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread