lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hi:

On Sat, Feb 22, 2014 at 1:07 PM, Enrico Colombini <erix@erix.it> wrote:
> I may be mistaken, not being an Unicode expert (to put it mildly) but I am
> under the impression that using a 'traditional' line input function for
> UTF-8 (with or without '\0') could open another, larger, can of worms.

The same could be said for using any char set which has support for
extended things, like those format which used to set the high bit for
soft spaces / soft new lines. But utf8 is dessigned in a way in which
I can fopen my config file, written with my nifty X win utf8 aware
editor, read 'Aquí' from a line, 'allá' from other and use
sprintf("%s/%s" to join them and then fopen the 'Aquí/allá' file.

> The set of line terminators and white space characters seems to be
> different; for example, U+2028 is a line separator and cannot be recognized
> by a simple test on the value returned by getc(). An UTF-8 oriented line
> iterator would probably be needed.

The standard functions normally only deal with a single line
terminator, in the ascii range, which UTF8 includes verbatim. This
kind of stuff is only relevant for text processing systems, for this
you read in binary, or getc & compose, as they are notnormally made of
simple lines and words.

Francisco Olarte.