lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Petite Abeille wrote:
> What's wrong with hex sequences?
> 
> print( '\xE2\x86\x92' )

I guess the problem is that most tables only tell you the codepoint but
not the UTF-8 encoding. The UTF-8 encoding currently has the advantage
that it is clear to the user why the usual pattern matching fails.

However, Mike does have a point. If someone builds UTF-8 libraries it
would be quite convenient if codepoint escape sequences are available.
However, I guess the sequences shouldn't be advertised in the Lua manual
and the limitations have to be clearly stated. Without the correct
library functions to support them, the codepoint escapes are likely to
cause confusion.

I wonder how compact you can store the character classes for the 65k
codepoints in the BMP and the lowercase/uppercase pairs (for
string.lower, string.upper). Maybe that can be compressed far enough to
be included in official Lua (5.3?). That would be great.

-- David


PS: An illustration for the usefulness of UTF-8 libraries:

In my JSON implementation I wanted to include the JavaScript-regexp

> /[\\\"\x00-\x1f\x7f-\x9f\u00ad\u0600-\u0604\u070f\u17b4\u17b5\u200c-\u200f\u2028-\u202f\u2060-\u206f\ufeff\ufff0-\uffff]/g


In Lua that turned to:

> local function quotestring (value)
>   -- based on the regexp "escapable" in https://github.com/douglascrockford/JSON-js
>   value = fsub (value, "[%z\1-\31\"\\\127]", escapeutf8)
>   if strfind (value, "[\194\216\220\225\226\239]") then
>     value = fsub (value, "\194[\128-\159\173]", escapeutf8)
>     value = fsub (value, "\216[\128-\132]", escapeutf8)
>     value = fsub (value, "\220\143", escapeutf8)
>     value = fsub (value, "\225\158[\180\181]", escapeutf8)
>     value = fsub (value, "\226\128[\140-\143\168\175]", escapeutf8)
>     value = fsub (value, "\226\129[\160-\175]", escapeutf8)
>     value = fsub (value, "\239\187\191", escapeutf8)
>     value = fsub (value, "\239\191[\190\191]", escapeutf8)
>   end
>   return "\"" .. value .. "\""
> end

(fsub is just an optimization for gsub).

Or LPeg:

>   local SpecialChars = (R"\0\31" + S"\"\\\127" +
>     P"\194" * (R"\128\159" + P"\173") +
>     P"\216" * R"\128\132" +
>     P"\220\132" +
>     P"\225\158" * S"\180\181" +
>     P"\226\128" * (R"\140\143" + S"\168\175") +
>     P"\226\129" * R"\160\175" +
>     P"\239\187\191" +
>     P"\229\191" + S"\190\191") / escapeutf8
> 
>   local QuoteStr = g.Cs (g.Cc "\"" * (SpecialChars + 1)^0 * g.Cc "\"")

(I guess there are already libraries and Lua bindings to make this
easier, but the point of my JSON library was to stay independent and
easy to use in environments like MUD clients where you might not have
much more than pure Lua).