[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Testing LUA: verify the correctness of a UTF16 LUA port
- From: David Given <dg@...>
- Date: Thu, 15 Oct 2009 05:16:40 +0100
Joshua Jensen wrote:
[...]
It does the equivalent of wcscmp(), only it doesn't rely on the C
runtime to achieve this. That's because on some non-Visual C++
compilers, sizeof(wchar_t) != 2. sizeof(lua_WChar) is always 2.
Yes, in the Unix world it's always 4 (wchar_t is an int).
It's easier in the console world --- you've got complete control over
all the text on your system, so you can ensure you're not using any
weird stuff like RTL, surrogates, unsupported combining characters, etc.
[...]
I would consider ditching the LuaPlus wide character support if there
was a small library that supported UTF-8 and allowed easy embedding of
UTF-8 string types in Lua source files.
Well, UTF-8 in Lua source files already Just Works. (They're treated by
Lua as Bags of Bytes.) As far as libraries go, I wrote some very simple
UTF-8 parsing code for WordGrinder:
http://wordgrinder.svn.sourceforge.net/viewvc/wordgrinder/wordgrinder/src/c/utils.c?view=markup
This will let you read and write raw code points from/to a string in a
relatively simple manner.
Thinking about this, a while back I did actually find that Unicode has
real rules for splitting up a UTF-8 string into 'characters', each of
which is an arbitrary-sized string representing a single drawable thing
(I forget the exact term --- grapheme clusters?). So theoretically it
ought to be possible to *truly* do random-access on a string. Maybe I
should revisit this at some point.
--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│
│ ⍎'⎕',∊N⍴⊂S←'←⎕←(3=T)⋎M⋏2=T←⊃+/(V⌽"⊂M),(V⊝"M),(V,⌽V)⌽"(V,V←1⎺1)⊝"⊂M)'
│ --- Conway's Game Of Life, in one line of APL