|
On 2017-05-01 05:34 PM, Jay Carlson wrote:
On May 1, 2017, at 9:05 AM, Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:I'd like to have Lua better support checking whether something is RFC-legal UTF-8.What is wrong with 'utf8.len'?Nothing I see now. Considering that you fixed it three years ago[1], I am embarrassed; I had written my own and kept it. The assert-heavy style for is_utf8 is O(n*m). If you know the rules for UTF-8 manipulation in Lua, the number of callpoints, m, can stay small.
Perhaps you might wanna memoize (with "__mode") such calls? (if the string is unchanged between calls, it's an O(1) instead of O(n) call, thus you get O(n+m) instead. if it does get changed, interning still ends up being more expensive.)
I have been saved several times by failed is_utf8 assertions, usually in strings not from my code. OK, my definition of "saved" includes "not producing results outside the domain, possibly causing the next program to crash. maybe." This level of obsession is probably not for everyone. Jay [1]: https://github.com/lua/lua/commit/3a044de5a1df82ed5d76f2c5afdf79677c92800f
-- Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.