lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

I am new to the list.
I have serached the archives but did not find
(or perhaps did not quite understand) how to 
handle UTF-8 characters in Lua and LPeg.

Here is the situation,
I wrote a very simple domain-specific language compiler
using LPeg and Lua.  The source text comes from a web page
(I use dojo tookit dijit.Editor to provide text editing) and the page is
encoded in UTF8.

(and I am new to lua, compiler writing and 
web development, so I may be missing something fundamental).

The text gets sent via JSON to my PHP, in PHP I remove HTML tags
and replace <br /> with new end of line character (PHP_EOL)

/* Convert to UTF-8 before doing anything else, input is also utf-8
    so this is just in case */
$utf8_text = iconv( 'utf-8', "utf-8", $raw_text ); 
/* Strip HTML tags */
$utf8_text = strip_tags( $utf8_text );   
/* Decode HTML entities */
$utf8_text = html_entity_decode( $utf8_text, ENT_COMPAT, "UTF-8" );

Then I simply invoke
lua using proc_open and pass the source text from the web page
via STDIN to lua (note, it is not possible to open STDIN as 'binary'

In lua, I have specifed for LPEG the following grammar for space

local space=lpeg.S('\r\n\f\t ')^1

I have spent about 2 days to figure out why certain things do not
get recognized by my compiler, and it turns out that it only happens if
I put a space preceeding my identifiers in the web page.

So I am tracing it down from the web page, to json, to PHP, to lua
and see that I am getting 0xC2A0 from the web page when I enter
the 'space'.

I am thinking now that this messes up LPeg when trying to match
for the space.  I would like to tell LPeg to also understand
0xC2A0 as a space.

But I do not know how to do that, this would be ideal solution though,
I think.

Alternatively, may be there is a way to replace 0xC2A0 with
ascii space (as I for my purposes -- space is a 'dead' symbol
just used to separate keywords and operators)

But again, I do not know how to do that either -- as I am somewhat
confused between hex notation, two bytes and strings.

I am using Lua 5.1.3 and LPeg 0.8x

On a side note -- I really like Lua and LPEG is very easy for my
simple brain to understand.  Between lua, php and javascript -- it all
seems like one language with closures, iterators, dynamic
with just different libraries (although Lua has the more elegant syntax
of the 3 :-), but PHP has lots more functions and soon closures )

Thank you in advance
  V S P

-- - Email service worth paying for. Try it for free