lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Roberto Ierusalimschy <roberto@inf.puc-rio.br> writes:
> A very basic support for UTF-8, in the lines suggested by Miles Bader,
> seems a good start. Something more or less like this:

Oooh, nice to see something real!

Maybe I'm missing something, but there seems to be missing a way to
efficiently compute "incremental" character byte-offsets in a string,
which might be used when iterating over utf8 characters a string
(possibly starting from some deep interior point).

[In my prev message I called this "char_offset" (maybe not such a good name):

    utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX]

Your utf8.byteoffsets seems the closest in spirit, but won't be
efficient in many cases because it always has to scan the string from
the beginning.

Maybe if you added an optional "start_offset" parameter to
utf8.byteoffsets:

   utf8.byteoffset(s, l, [start_offset])
      -> offset (in bytes) where 'l'-th code point from START_OFFSET (in
         bytes, default 1) starts

I think many higher-level utf8-aware interfaces will probably tend to
be written in terms of string byte-offsets, having an efficient way to
operating on interior string segments is important.

Consider an "output unicode characters to MUMBLE" function:

   function output_unicode_chars_to_mumble (mumble, string, start, end)
      start = start or 1
      end = end or #string

      -- iterate over STRING, outputting a single character at a time
      while start < end do
         local codepoint = utf8.codepoints (string, start)
         output_unicode_codepoint_to_mumble (mumble, codepoint)
         start = utf8.byteoffset (string, 1, start)   -- increment START
      end
   end

Thanks,

-miles

-- 
"She looks like the wax version of herself."
     	   	    		   [Comment under a Paris Hilton fashion pic]