lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


uri cohen wrote:
[...]
My question is on how can I verify my port works? Other than toy scripts I created, I'm looking for a comprehensive set of tests I can run in order to verify all important language feature were not broken...

Unicode is harder than it looks... one reason Lua doesn't really use it is that once you start dealing with Unicode you start finding places where you get conflicting requirements.

For example, é can be represented as both U+301 U+0065, or as U+00E9. Do these compare equal? They are technically the same thing.

What about sorting order? Does Ё (U+0401) sort before, after, or equal to Ë (U+00CB)? For that matter, what about ഐ (U+0D10) and ᚔ (U+1694)? What about E (U+0045), Ε (U+0395), Е (U+0415), ⋿ (U+22FF), ⴹ (U+2D39), E (U+FF25), 𝐄 (U+1D404), 𝐸 (U+1D438), 𝑬 (U+1D46C), 𝔼 (U+1D53C), 𝖤 (U+1D5A4), 𝗘 (U+1D5D8), 𝘌 (U+1D60C), 𝙀 (U+1D640), 𝙴 (U+1D674), 𝚬 (U+1D6AC), �𝛦 (U+1D6E6), 𝜠 (U+1D720), or 𝝚 (U+1D75A)?

Do you mean UTF-16 or UCS-2? UCS-2 can't handle some of the really freaky Unicode characters like 𝌆 (U+1D306) or 🀎� (U_1F00E) --- I don't even have the font to display that last one!

And, most importantly of all, can you still use Lua strings to represent arbitrary binary data, or is the data forced into well-formed UTF-16?

One reason people tend to use UTF-8 in Lua is not that it solves all these problems, but that it cleanly divides the problems into soluble ones and non-soluble ones! And it turns out that most people don't care about the non-soluble ones. Unfortunately, once you start trying to *natively* support Unicode, you suddenly find yourself having to care about these things...

(adjusts signature)

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│
│ ⍎'⎕',∊N⍴⊂S←'←⎕←(3=T)⋎M⋏2=T←⊃+/(V⌽"⊂M),(V⊝"M),(V,⌽V)⌽"(V,V←1⎺1)⊝"⊂M)'
│ --- Conway's Game Of Life, in one line of APL