[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: [Q] handling 0xC2A0 (space in utf8)
- From: "V S P" <toreason@...>
- Date: Thu, 16 Oct 2008 15:20:18 -0400
I am new to the list.
I have serached the archives but did not find
(or perhaps did not quite understand) how to
handle UTF-8 characters in Lua and LPeg.
Here is the situation,
I wrote a very simple domain-specific language compiler
using LPeg and Lua. The source text comes from a web page
(I use dojo tookit dijit.Editor to provide text editing) and the page is
encoded in UTF8.
(and I am new to lua, compiler writing and
web development, so I may be missing something fundamental).
The text gets sent via JSON to my PHP, in PHP I remove HTML tags
and replace <br /> with new end of line character (PHP_EOL)
/* Convert to UTF-8 before doing anything else, input is also utf-8
so this is just in case */
$utf8_text = iconv( 'utf-8', "utf-8", $raw_text );
/* Strip HTML tags */
$utf8_text = strip_tags( $utf8_text );
/* Decode HTML entities */
$utf8_text = html_entity_decode( $utf8_text, ENT_COMPAT, "UTF-8" );
Then I simply invoke
lua using proc_open and pass the source text from the web page
via STDIN to lua (note, it is not possible to open STDIN as 'binary'
In lua, I have specifed for LPEG the following grammar for space
local space=lpeg.S('\r\n\f\t ')^1
I have spent about 2 days to figure out why certain things do not
get recognized by my compiler, and it turns out that it only happens if
I put a space preceeding my identifiers in the web page.
So I am tracing it down from the web page, to json, to PHP, to lua
and see that I am getting 0xC2A0 from the web page when I enter
I am thinking now that this messes up LPeg when trying to match
for the space. I would like to tell LPeg to also understand
0xC2A0 as a space.
But I do not know how to do that, this would be ideal solution though,
Alternatively, may be there is a way to replace 0xC2A0 with
ascii space (as I for my purposes -- space is a 'dead' symbol
just used to separate keywords and operators)
But again, I do not know how to do that either -- as I am somewhat
confused between hex notation, two bytes and strings.
I am using Lua 5.1.3 and LPeg 0.8x
On a side note -- I really like Lua and LPEG is very easy for my
seems like one language with closures, iterators, dynamic
with just different libraries (although Lua has the more elegant syntax
of the 3 :-), but PHP has lots more functions and soon closures )
Thank you in advance
V S P
http://www.fastmail.fm - Email service worth paying for. Try it for free