Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
From: Roberto Ierusalimschy <roberto@...>
Date: Fri, 10 Feb 2012 00:49:58 -0200

> Maybe I'm missing something, but there seems to be missing a way to
> efficiently compute "incremental" character byte-offsets in a string,
> which might be used when iterating over utf8 characters a string
> (possibly starting from some deep interior point).
> 
> [In my prev message I called this "char_offset" (maybe not such a good name):
> 
>     utf8.char_offset (STRING, BYTE_INDEX, NUM_CHARS) => NEW_BYTE_INDEX]
> 
> Your utf8.byteoffsets seems the closest in spirit, but won't be
> efficient in many cases because it always has to scan the string from
> the beginning.

When NUM_CHARS is 1, I guess you can do this:

  string.find(s, "[^\128-\191]", index)


In general, one thing to be decided is how much we can stretch the
standard library to provide utf8 functions. For instance, the following
code interacts through all code points in a string:

  s = "aloáéíЉМНЊО"
  for oneutf8 in string.gmatch(s, ".[\128-\191]*") do
    print(oneutf8)
  end

(Of course, it does not detect invalid sequences.)

-- Roberto

Follow-Ups:
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Miles Bader

References:
- Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Jay Carlson
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Dirk Laurie
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Rob Hoelz
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Sam Roberts
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Roberto Ierusalimschy
- Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua), Miles Bader

Prev by Date: Re: How to follow 80 Column format in Lua
Next by Date: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Previous by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Next by thread: Re: Unicode and UTF-8 the Lua way, mid-discussion (was Re: What do you miss most in Lua)
Index(es):
- Date
- Thread