lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Edgar Toernig wrote:
> 
> Björn De Meyer wrote:
> > ...
> > supply your own replacements for isalpha() and isalnum().
> > Fortunately, with UTF-8, you can see from a single byte
> > whether a character is part of an "alphabetical" sequence.
> 
> I'm not a UTF-8 expert but I doubt that.  How's that gonna work?
> Afaik, the "alphabetical" characters are spread out around the
> whole charset...
> 
> Ciao, ET.

Well, first of all let me clarify that apart from 
[a-z][A-Z] I would consider any valid character outside 
the 7 bit ANSI range as "alphabetical", or more precisely,
as acceptable for an identifier name. In UTF-8 encoding,
you can see from the current byte wether it belongs to 
the 7-bit range, or to a sequence that encodes for a 
non-ANSI Unicode character.

Basically the utf8_isalpha would need to become:

int utf8_isalpha(int ch)
{
  return 
  ( 
    isalpha(ch) 
    || ((ch >= 0x80) && (ch <= 0xfd)) 
  ); 
}
 
The bytes 0xfe, and 0xff are invalid in UTF-8, 
so they are the only ones in the non-ASCII 
8-bit range that are not part of the 
encoding of an "identifier name" character. 


-- 
"No one knows true heroes, for they speak not of their greatness." -- 
Daniel Remar.
Björn De Meyer 
bjorn.demeyer@pandora.be