Re: UTF-8 patterns in Lua 5.3

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: UTF-8 patterns in Lua 5.3
From: Jay Carlson <nop@...>
Date: Sun, 20 Apr 2014 15:42:13 -0400

On Apr 19, 2014 10:00 PM, "Keith Matthews" <keith.l.matthews@gmail.com> wrote:

> On Sat, Apr 19, 2014 at 1:00 PM, Jay Carlson <nop@nop.com> wrote:

> > As an aside, I like the demarcation point of "Lua does UTF-8, but it does
> > not know Unicode." It is always good to be clear what you are *not* trying
> > to do.
>
> I'm no Unicode expert, but this doesn't make sense to me. UTF-8 is
> merely a Unicode encoding, so of course Lua 5.3 work 2 "knows"
> Unicode.

OK, different angle:

Lua has read RFC 3629 https://tools.ietf.org/html/rfc3629 , and it SHOULD be easy to write programs which conform to that standard.

Lua knows nothing about NFC or IDN or anything like that. If you want NFC, you MAY write it in Lua or MAY call your platform's C frameworks.

I'll note again that slnunicode uses Tcl's compact Unicode tables

Anyway, there is one bit of utf8 functionality I want in liblua.so not mentioned so far: safe handling of invalid sequences. Personally, I am going to assert() validity everywhere. But I know that "crash the current process" is unpopular. http://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences describes a nasty set of bugs of the form

<span style=width:100%$> _onMouseOver_=doBadStuff()...

where the $ is an incomplete sequence. If processors delete the invalid sequence it can turn that into

<span style=width:100% _onMouseOver_=doBadStuff()...

Which is Bad Stuff. UTR#36 proposes "always produce some output for all input" as a general rule for converters. If the $ is turned to U+FFFD ( � REPLACEMENT CHARACTER), attackers are not able to delete any syntactically important following characters.

I think a Lua utf8 version of "convert and keep going" needs to follow a rule like this. "Keep going" is what I think people are going to demand for a wrapped version of io.lines().

> Of course, it provides no Unicode algorithms since they typically need
> lookup tables larger than Lua itself,

I wish this meme would go away. It's misleading in two different ways.

First, subsetted but useful character class info is a lot smaller than people think. https://github.com/LuaDist/slnunicode/blob/1.1/slnudata.c compiles to 14k of const/text segment data. Three and a half pages.[1]

Second, a very large number of systems have system libraries to handle Unicode character classification and normalization, and those libraries tend to be paged into RAM because other software is touching them non-stop. That makes the actual memory overhead for Lua something like the number of GOT/PLT entries dirtied to dynamically link those libraries.

To be clear: this is stuff I'm not talking about putting in liblua.so. Code to do lookups on slnudata-style data maybe. Lua user code can slurp in binary lookup blobs as strings, then use them in utf8 iterators. Perhaps this is how to handle utf8.match character classes.

--
Jay

[1]: For a lot of people, that 14k slnudata.o is less than ~3.5k on disk, because if you're counting individual pages, you're using LZO (btrfs etc) or gzip (OS X) or LZMA (squashfs) for on-disk compression. I suspect the 2k I got with xz-tiny is not actually achievable in a real squashfs.

Yes, I have a port of Lua to Teensy 3.1 too. I don't need a lecture about code size and embedded environments, thanks.

Follow-Ups:
- Re: UTF-8 patterns in Lua 5.3, Dirk Laurie
- Re: UTF-8 patterns in Lua 5.3, Roberto Ierusalimschy

References:
- UTF-8 patterns in Lua 5.3, Hisham
- Re: UTF-8 patterns in Lua 5.3, Jay Carlson
- Re: UTF-8 patterns in Lua 5.3, Keith Matthews

Prev by Date: Re: [PATCH] make light userdata a little bit heavier
Next by Date: Re: [PATCH] make light userdata a little bit heavier
Previous by thread: Re: UTF-8 patterns in Lua 5.3
Next by thread: Re: UTF-8 patterns in Lua 5.3
Index(es):
- Date
- Thread