[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Could Lua itself become UTF8-aware?
- From: Jay Carlson <nop@...>
- Date: Sun, 30 Apr 2017 17:01:23 -0400
[Why do I write the same messages every year or two? I forget when and whether I've sent them, and every time I write about the subjects, I learn more about them. Hopefully this is more coherent a summary than before, although it needs a lot more polishing. Maybe there's a workshop paper in here if I ever finish the code.]
On Apr 29, 2017, at 9:21 AM, Roberto Ierusalimschy <firstname.lastname@example.org> wrote:
>> At present all the entries from 0x80 to 0xFF in the constant array
>> luai_ctype in lctype.c are zero: no bit set.
>> There are three unused bits. Couldn't two of them be used to mean
>> UTF8_FIRST and UTF8_CONT?
As long as people still check for things like UTF-16 surrogates in UTF-8. IMO C code is almost necessary for very unsexy things like this, because it can be done once, correctly, and with good performance.
Are these ctype values going to be turned into pattern classes?
>> This is only the first step, but if the idea is shot down here already,
>> the others need not be mentioned.
> This particular idea has very low cost, so I don't see why to shot it
> down before knowing the rest of the story. What does it mean for Lua
> to be "UTF-8 aware"?
One thing I should make clear at the start of this: I agree that there is not more than one kind of string in Lua, at least not in the obvious future. There can't be a type-distinction between UTF-8, ASCII, or byte buffer--I think?
So what would I like?
I'd like to have Lua better support checking whether something is RFC-legal UTF-8. My goal is to have prompt, syntactically local diagnosis of string manipulation errors. I don't want either me or the bozos down the hall passing my function a bogus byte stream when it needs be in UTF-8 to have a correct program.
I already heavily assert(is_utf8) on application core utilities; that way I won't let bogus byte buffers propagate into my internal text models. For a big class of string-manipulating programs, not letting bad UTF-8 in magically solves it on output, as many strings are just passed through. (IIRC, the biggest place you can shoot yourself in the foot is by using "." in Lua patterns. But you're safe if all your "."s are surrounded by literal codepoints.)
Somewhere I have a patch for 5.3 which uses a stolen alignment byte in a TString to store character subset info. Lua strings are immutable, but the implementation can keep notes about them.
I use the three-valued logic of "string known to be in a charset", "known not in a charset", and "unknown". is_utf8(s) just looks at the UTF-8 field, and if its membership is known, it returns it. Otherwise it scans the string as usual. and stores what it has found in the TString flags before returning.
This is a lot cheaper than it looks, and I think it can be done for minimal costs for most existing programs. The library would be opt-in. The great majority of costs are placed on people who want utf8 validation. But there are some things that only the language can do cheaply.
The first thing we can check for is "does this string belong to ASCII?". For strings we're fully hashing, this test is very cheap: just OR together all the bytes as we process them. Check the high bit.
If we're not fully hashing a string, we don't know anything about it, so set all flags to "unknown".
However, if we know a string is definitely in ASCII, it is also in UTF-8, so we mark the string as known to be both UTF-8 and ASCII whenever we notice. There are a lot of strings in this world written in ASCII, so marking them as UTF-8 too is cheap.
If a string flunks the ASCII test (or its ASCII status is unknown because it's long), the string might be legal UTF-8 data. The first time somebody asks us whether a string is in UTF-8 with is_utf8, we have to walk the entire string and determine if it's legal. Similarly, we do remember the result in the TString before we return.
The good news about ASCII and UTF-8 is that they're closed under concatenation. When two strings are concatenated, ("a".."‡") we can automatically set the result's "known" flags by combining the flags on the two strings. We may not know anything about one of the strings, in which case the result's status is unknown. If anybody had ever called is_ut8f() on "a" and "‡", we'd immediately know the result is in UTF-8 and know it's not ASCII.
To be honest, the place where I stopped working on this was before adding a library to liblua. I had proved to myself that this would work on a small scale, and examining TStrings in the debugger showed them working right. Next goal was to make an analog of string.* in utf8.*; all calls to the utf8 library will blow up on receiving or producing invalid UTF-8. Then I got distracted.
Conveniently, any strings that come out of utf8.* functions have to be validated by is_utf8(); this means the strings produced by the library already have "is UTF-8" and "is ASCII" set accordingly.
I've got two more status flags left in TString. To use the first, I've considered "has no NULs" or "is in UTF-8 and has no NULs", since I've needed both from time to time.
If you've stuck around this long, I have a premature optimization: The last free flag space in TString could be use for diagnosis of why a string failed to validate. "UTF-8 only broken at start of string", "UTF-8 only broken at end", "UTF-8 broken internally", "unknown". Clearly if you concatenate any two strings with any validation failures in the middle, you get a non-UTF-8 string. But there is a chance that concatenating two non-UTF-8 strings will produce a valid string; perhaps a buffer split up a sequence.
If anybody has a good benchmark for me to try, I can figure out how much slowdown this causes for existing code. My hope for the performance hit is less than ~2% slower on existing string.*-only Lua code, and significantly faster than sprinkling assertions around. As with LINQ-for-Lua, no promises I'l ship it.
Fascists like me would type:
local string = utf8
in most code.
: How did I make it through that message without writing any footnotes?