lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



On 27-Oct-05, at 2:30 PM, Walter Cruz wrote:

Hi all. Somedays algo, someone sen a mail to the list asking for a htmlentities function.

Well, I think about how can I get a more complete list of htmlentities to use as the table to the translation..

There is a complete (and official) list at www.w3.org, for each version of HTML (they are very similar)

For HTML 4.01, you can get the entities from:

http://www.w3.org/TR/html401/HTMLlat1.ent
http://www.w3.org/TR/html401/HTMLsymbol.ent
http://www.w3.org/TR/html401/HTMLspecial.ent

(The equivalent HTML 4 files are at similar urls, with html4 in place of html401.)

For XHTML:

http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml-lat1.ent
http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml-special.ent
http://www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml-symbol.ent

In each case, lat1 contains entities corresponding to all unicode codepoints from U+00A0 to U+00FF (the latin1 character set); special.ent contains some special characters, including < > & " (and ' in the case of xhtml) as well as the euro (U+20AC), and symbol.ent contains a variety of symbols used in mathematics and technical writing. Some of these are necessary to cope with the 32 code points in CP 1252 (used by some versions of Microsoft Windows) which differ from the ISO standards. (This is the one where the euro symbol has code point 0x80; you can find a complete conversion chart at http://www.microsoft.com/typography/unicode/1252.htm)

The following code snippet may help interpret the entity definitions; it should work with either the HTML 4 or the XHTML entity formats, but I haven't tested it.

ent2uni, uni2ent = {}, {}
function readents(filename)
  for l in filename:lines() do
    local name, val = l:match '^<!ENTITY (%w+)%s*CDATA "&#(%d+);"'
    if name then
      ent2uni[name], uni2ent[val] = val, name
    end
  end
end