lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Jun 15, 2013, at 6:56 PM, Pierre-Yves Gérardy wrote:

> On Sat, Jun 15, 2013 at 10:08 PM, Jay Carlson <nop@nop.com> wrote:
>> UTF-8 is constructed such that Unicode code points are ordered lexicographically under 8-bit strcmp. So you can replace that with
>> 
>> function utf8.inrange(str single_codepoint, str lower_codepoint, str upper_codepoint)
>>  return single_codepoint >= lower_codepoint and single_codepoint <= upper_codepoint;
>> end
> 
> I hadn't realized this. I'm acreting knowledge on the go, I've yet to
> rigorously explore Unicode... I find UTF-8 beautiful in lots of
> regards. UTF-16 baffles me, though. Do you know why they reserved
> codepoints, which are supposed to correspond to symbols, to the
> implementation details of an encoding? I whish there was a UTF-16'
> that followed the UTF-8 strategy.

Originally, Unicode was sold as "double-wide ASCII". "All you have to do to support the world's scripts is use 2-byte characters." Then they decided 64k codepoints was *not* enough for everyone. Before they ran out of space, they allocated the surrogate blocks to let existing software using 2-byte-characters have access to the other planes. It's a good design given the constraints.

UCS-2 was attractive; all codepoints were a fixed size. Adding UTF-16 to support the astral planes meant codepoints were *not* all the same size, and once you had to deal with that, other variable-width codes like UTF-8 were more competitive.

>> and you don't need to extract the codepoint from a longer string if you write "< upper_codepoint_plus_one"; this lets you test an arbitrary byte offset for range membership.
> 
> I don't understand what you mean here :-/

I'm sorry, that really was cryptic. Let's look at an ASCII range tester: is the first character of this string a capital letter?

first_member = "A"
last_member = "Z"
after_last = string.char(string.byte(last_member)+1)

-- outside the range
assert(not( "" >= first_member ))
assert(not( "@home" >= first_member ))

-- inside the range
assert( "Alphabet" >= first_member )
assert( "Yow" < last_member )
assert( "Zymurgy" < after_last )

-- outside
assert(not( "[" < after_last ))

Any string inside the range must be strictly less than the very first value after the range.

This same property is true of UTF-8 strings. Any string starting with a codepoint inside the range will be strictly less than the string consisting of the first codepoint outside the range. You can test membership of the codepoint at any byte-index without extracting it. You could use C code like this:

  strcmp(s+7, first_member) >= 0 && strcmp(s+7, after_last) < 0

as long as you know s+7 is a valid offset inside the string. (This mostly works for invalid offsets too.)

Jay