lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


> > An important consideration to be made is whether all strings are Unicode
> > or whether a new Unicode type is to be added (as is done in Python).
> 
> I think we can live outside these two options. Strings may contain
> Unicode data or not (e.g. they may contain raw binary data, as now).
> If you call a function from the new "utf8" library, it will assume
> the string is a Unicode-utf8 string.

This approach sounds reasonable. If the utf8 library is the only Unicode 
string manipulation library then this will effectively be using UTF-8 for the 
internal encoding of Unicode strings. This has the advantage of bringing some 
backward compatibility characteristics, but probably decreases efficiency. 
Most langauges I know of use UTF-16 for encoding Unicode strings but this 
choice depends on a number of options so it not necessarily valid for Lua.

> > It is essential that such byte patterns [non-valid Unicode character]
> > do not exist in the internal encoding since this opens several
> > security issues.
> 
> I think it would be easier to allow such patterns (among other things
> because strings may contain other stuff besides Unicode data), and to
> check for consistency when needed (that is, inside the functions of the
> "utf8" library).

Yes, that is equally good. The essential feature is not to allow invalid bit 
patterns be interpreted as valid UTF-8/UTF-16/etc data. This is normally done 
by ensuring that any Unicode strings created are guaranteed to be valid, but 
this would not permit binary data to be stored in this datatype. Checking 
consistency on read may bring a small performance penalty but this probably 
will not be significant.

However it would be desirable to check consistency of Unicode data as it is 
read from files, since then errors would be caught immediately rather than 
later during processing. A Unicode I/O library would be necessary anyway since 
data may have to be read in, or outputted in a format other than UTF-8. 
Consistency of UTF-8 strings must also be checked before they are written out 
to Unicode files.

Steven Murdoch.