Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Miles Bader <miles@...>
Date: Fri, 10 Feb 2012 10:59:52 +0900

Roberto Ierusalimschy <roberto@inf.puc-rio.br> writes:
> A very basic support for UTF-8, in the lines suggested by Miles Bader,
> seems a good start. Something more or less like this:

Oooh, nice to see something real!

Maybe I'm missing something, but there seems to be missing a way to
efficiently compute "incremental" character byte-offsets in a string,
which might be used when iterating over utf8 characters a string
(possibly starting from some deep interior point).

[In my prev message I called this "char_offset" (maybe not such a good name):

    utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX]

Your utf8.byteoffsets seems the closest in spirit, but won't be
efficient in many cases because it always has to scan the string from
the beginning.

Maybe if you added an optional "start_offset" parameter to
utf8.byteoffsets:

   utf8.byteoffset(s, l, [start_offset])
      -> offset (in bytes) where 'l'-th code point from START_OFFSET (in
         bytes, default 1) starts

I think many higher-level utf8-aware interfaces will probably tend to
be written in terms of string byte-offsets, having an efficient way to
operating on interior string segments is important.

Consider an "output unicode characters to MUMBLE" function:

   function output_unicode_chars_to_mumble (mumble, string, start, end)
      start = start or 1
      end = end or #string

      -- iterate over STRING, outputting a single character at a time
      while start < end do
         local codepoint = utf8.codepoints (string, start)
         output_unicode_codepoint_to_mumble (mumble, codepoint)
         start = utf8.byteoffset (string, 1, start)   -- increment START
      end
   end

Thanks,

-miles

-- 
"She looks like the wax version of herself."
     	   	    		   [Comment under a Paris Hilton fashion pic]

Follow-Ups:
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Miles Bader
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy

Prev by Date: Re: [ANN] Lua 5.1.5 (rc1) now available
Next by Date: Re: How to follow 80 Column format in Lua
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread