lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Sat, Jun 15, 2013 at 10:08 PM, Jay Carlson <nop@nop.com> wrote:
> UTF-8 is constructed such that Unicode code points are ordered lexicographically under 8-bit strcmp. So you can replace that with
>
> function utf8.inrange(str single_codepoint, str lower_codepoint, str upper_codepoint)
>   return single_codepoint >= lower_codepoint and single_codepoint <= upper_codepoint;
> end

I hadn't realized this. I'm acreting knowledge on the go, I've yet to
rigorously explore Unicode... I find UTF-8 beautiful in lots of
regards. UTF-16 baffles me, though. Do you know why they reserved
codepoints, which are supposed to correspond to symbols, to the
implementation details of an encoding? I whish there was a UTF-16'
that followed the UTF-8 strategy.

> and you don't need to extract the codepoint from a longer string if you write "< upper_codepoint_plus_one"; this lets you test an arbitrary byte offset for range membership.

I don't understand what you mean here :-/

-- Pierre-Yves