lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


2014-04-19 1:50 GMT+02:00 Hisham <h@hisham.hm>:
> On 17 April 2014 21:55, Keith Matthews <keith.l.matthews@gmail.com> wrote:
>> It is indeed a well defined target. Don't get me wrong, the computer
>> scientist side of me would love to see pattern matching with support
>> for UTF-8 encoded Unicode code points in Lua. It would be useful for
>> low-level manipulation of UTF-8 data. However, my software engineer
>> side thinks that it's not worth opening that can of worms.

> In that case, if one starts from the assumption that people will
> mistake UTF-8 for Unicode and mishandle Unicode by using UTF-8, what's
> the point of having any UTF-8 support in core Lua at all?
>
> I started from the assumption that UTF-8 support in Lua meant
> low-level (as in codepoint-level) UTF-8 manipulation and nothing else.
> I was just trying to assess how complete this UTF-8 support could be.

IMHO UTF-8 support in Lua 5.3 is exactly at the point where the law of
diminishing returns kicks in.

You can split up UTF-8 into codepoints represented as integers,
either via an iterator (utf8.codes) or via a return list (utf8.codepoint).
You can find out where a codepoint starts if you know the location of
any byte in it (utf8.offset) and you can recognize codepoints using the
string library (utf8.charpatt). You can build UTF-8 from codepoints
(utf8.char) and you can get its length (utf8.len).

Also don't forget that string.intpack and string.unpackint adds
further capabilities.

We could have had fewer than six functions, since some can
be writte in terms of the others, but they are so frequently needed
that the utility/cost ratio is high enough.

Beyond that, as Kenneth says, we start opening a can of worms.
Things that some people absolutely must have are things that
other people would never need, and the utility/cost ratio drops.

On the face of it, the modest target of a library for codepoints
that corresponds function-by-function to the byte-oriented string
library is a worthy compromise.

But implementation as a UTF8-specific change to the pattern
engine misses a great opportunity.

Suppose the string library had a function `string.defineclass`.

> string.defineclass("%u",utf8.charpatt)
stdin:1: attempt to redefine a built-in character class
> string.defineclass("%Z",utf8.charpatt)
stdin:1: bad argument #1 to 'defineclass' (must be lowercase)
> string.defineclass("%z",utf8.charpatt) -- then do some great stuff with %z and %Z
> string.defineclass("%z",nil) -- %z and %Z not available anymore

What could be allowed as second argument (string, table, function?)
will come out during implementation.

If the proposed "text" mode could be implemented on top of
that, we will have gained a lot more.