lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Tue, Sep 28, 2010 at 06:38:53AM -0300, Luiz Henrique de Figueiredo wrote:
> > The utf-8 bom, by definition of unicode, is actually a "space" character.
> > Shall we just treat utf-8 bom like a normal space character, instead of
> > strip it off? Is that easier to handle in the lexer?
> 
> In Lua 5.2 you don't even have to patch the lexer: just edit lctype.c
> and say that 0xFF and 0xFE are whitespace. This of course is not the
> perfect solution, because BOM is a 2-byte entity, not a 1-byte one...

Note that the utf-8 representation of the BOM is in fact 0xEF,0xBB,0xBF not
0xFF,0xFE and that those characters are (in iso-8859-1)
lowercase-i-with-diaresis, right-chevron and upside-down-question-mark.  While
uncommon, they're very much not whitespace.

So if Windows Notepad is adding 0xFF, 0xFE then not only is it adding a BOM to
a file encoding which the Unicode standard does not recommend has one; but it's
actually adding the *wrong* marker.

No concessions should be made in Lua for this.  If someone wants to do a
Windows-specific fix for this, they're welcome, but as you say, they should
just patch the Lua core themselves.

All in all, Microsoft should not be encouraged to let this abomination stand.

D.

-- 
Daniel Silverstone                         http://www.digital-scurf.org/
PGP mail accepted and encouraged.            Key Id: 3CCE BABE 206C 3B69