lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Thu, Apr 17, 2014 at 08:55:36PM -0400, Keith Matthews wrote:
<snip>
> I applied your patch to Lua 5.3 work 2, and copied your example:
> 
> (2) > print( ("p??scoa"):match("[^??]*$") )
> p??scoa
> 
> Great! The patch works. I carefully typed the same example by hand:
> 
> (3) > print( ("pa??scoa"):match("[^e??]*$") )
> scoa
> 
> See the difference? It looks the same, but I used a combining acute
> accent (U+0301) for a?? and e??. So even with your patch, you don't
> avoid this problem: it was only pushed to another abstraction layer.
> Example (1) fails because string.match works at the byte level while
> "??" is composed of two bytes, and example (3) fails because the
> patched string.match works at the code point level while "e??" is
> composed of two code points.
> 
> In a way, this is even worse: a programmer will likely type the code
> and write the unit tests with the same text editor, using precomposed
> characters, and everything will appear to work. The bug will only crop
> up much later when the pattern is used on real-world data that
> contains combining characters.
> 

Sometime in the near future there's going to be an avalanche of Unicode
exploits: from string comparison bugs to perhaps even buffer overflows, all
because people erroneously believed they understood how it works, or didn't
consider that what they didn't know might actually matter in ways that
weren't obvious to them. (I certainly don't understand all the relevant
security-critical bits.)

This hasn't been helped by Windows' and Java's early adoption of UTF-16,
because they engender the idea that Unicode text can be indexed just like
ASCII as long as your datatype is wide enough. Yes, many people understand
that UTF-16 won't fit all the codepoints, but they erroneously think all one
needs to do is upgrade to a wider character type, et voila, you've mastered
I18N.

IMO, dealing with Unicode strings requires novel string handling primitives.
Regular expressions are a great place to start from a practical viewpoint,
because they already operate on string vectors and provide property classes,
so people don't worry about indexing. And you can hide a tremendous amount
of complexity behind predefined patterns. It's just that implementing it all
properly is a huge undertaking. And if you don't implement it properly
you're literally shipping buggy code, either intentionally or negligently,
and it will cause havoc eventually.

I think Lua is a great language to play around with ideas for new string
primitive operations, what with its support for metatables, iterators, and
first-class C modules. But hacking the existing low-level string pattern
library is just not one of the avenues I think makes sense. For one thing,
you lose one of its best features--speed and simplicity; you can get really
far by treating UTF-8 as ASCII, such as safely parsing most protocols and
formats.