In short, yes and no. Lua gives you the bare bones support and enough rope and not much else. Unicode is a large and complex standard and questions like "does lua support unicode" are extremely vague.
Some of the issues are:
Lua strings are fully 8-bit clean, so simple uses are supported (like storing and retrieving), but there's no built in support for more sophisticated uses. For a fuller story, see below.
A Lua string is an arbitrary sequence of values which have at least 8 bits (octets);
they map directly into the
char type of the
C compiler. (This may be wider than eight bits, but eight bits are guaranteed.) Lua does not reserve any value, including
That means that you can store a
UTF-8 string in Lua without problems.
UTF-8 is just one option for storing Unicode strings. There are many other encoding schemes,
UTF-32 and their various big-endian/little-endian variants. However, all of these are simply sequences of octets and can be stored in a Lua string without problems.
Input and output of strings in Lua (using the io library) uses
ANSI C does not require the
stdio library to
handle arbitrary octet sequences unless the file is opened in binary mode;
furthermore, in non-binary mode, some octet sequences are converted into other ones (in order
to deal with varying end-of-line markers on different platforms).
This may affect your ability to do non-binary file input and output of Unicode
strings in formats other than
UTF-8 strings will probably be safe because
UTF-8 does not use control characters such as
\r as part of multi-octet
encodings. However, there are no guarantees; if you need to be certain, you must
use binary mode input and output. (If you do so, line-endings will not be converted.)
Unix file IO has been 8-bit clean for a long while. If you are not concerned with portability and are only using Unix and Unix-like operating systems, you can almost certainly not worry about the above.
If your use of Unicode is restricted to passing the strings to external libraries which support Unicode, you should be OK. For example, you should be able to extract a Unicode string from a database and pass it to a Unicode-aware graphics library. But see the sections below on pattern matching and string equality.
Literal Unicode strings can appear in your lua programs. Either a
encoded string can appear directly with 8-bit characters or you can use
\ddd syntax (note that
ddd is a decimal
number, unlike some other languages). However, there is no facility for
encoding multi-octet sequences (such as
\U+20B4); you would need to
either manually encode them to
UTF-8, or insert individual octets in the
correct big-endian/little-endian order (for
Unless you are using an operating system in which a
char is more
than eight bits wide, you will not be able to use arbitrary Unicode
characters in Lua identifers (for the names of variables and so on).
You may be able to use eight-bit characters outside of the
Lua uses the
isalnum to identify valid
characters in identifiers, so it will depend on the current locale.
To be honest, using characters outside of the
in Lua identifiers is not a good idea, since your programs will not compile
in the standard
Lua string comparison (using the == operator) is done byte-by-byte. That means that == can only be used to compare Unicode strings for equality if the strings have been normalized in one of the four Unicode normalizations. (See the [Unicode FAQ on normalization] for details.) The standard Lua library does not provide any facility for normalizing Unicode strings. Consequently, non-normalized Unicode strings cannot be reliably used as table keys.
If you want to use the Unicode notion of string equality, or use Unicode strings as table keys, and you cannot guarantee that your strings are normalized, then you'll have to write or find a normalization function and use that; this is non-trivial exercise!
The Lua comparison operators on strings (< and <=) use the
strcoll which is locale dependent. This means that two strings
can compare in different ways according to what the current locale is.
For example, strings will compare differently when using Spanish
Traditional sorting to that when using Welsh sorting.
It may be that your operating system has a locale that implements the sorting algorithm that you want, in which case you can just use that, otherwise you will have to write a function to sort Unicode strings. This is an even more non-trivial exercise.
UTF-8 was designed so that a naive octet-by-octet string comparison
of an octet sequence would produce the same result if a naive octet-by-octet
string comparison were done on the
UTF-8 encoding of the octet sequence.
This is also true of
UTF-32BE but I do not know of any system which uses
that encoding. Unfortunately, naive octet-by-octet comparison is
not the collation order used by any language.
(Note: sometimes people use the terms
UCS-4 for "two-byte"
and four-byte encodings. These are not Unicode standards; they come from the
ISO/IEC 10646-1:2000 and currently
differ in that they allow codes outside of the Unicode range, which runs from
Lua's pattern matching facilities work character by character.
In general, this will not work for Unicode pattern matching, although
some things will work as you want. For example,
will not match all Unicode upper case letters. You can match
individual Unicode characters in a normalized Unicode string, but
you might want to worry about combining character sequences.
If there are no following combining characters,
match only the letter
a in a
UTF-8 string. In
UTF-16LE you could
"a%z". (Remember that you cannot use
\0 in a Lua pattern.)
If you want to know the length of a Unicode string there are different answers you might want according to the circumstances.
If you just want to know how many bytes the string occupies, so
that you can make space for copying it into a buffer for example,
then the existing Lua function
string.len will work.
You might want to know how many Unicode characters are in a string.
Depending on the encoding used, a single Unicode character may
occupy up to four bytes. Only
UTF-32BE are constant
length encodings (four bytes per character);
UTF-32 is mostly a
constant length encoding but the first element in a
should be a "Byte Order Mark", which does not count as a character.
UTF-32 and variants are part of Unicode with the latest version,
Some implementations of
UTF-16 assume that all characters are two
bytes long, but this has not been true since Unicode version 3.0.
UTF-8 is designed
so that it is relatively easy to count the number of unicode symbols in
a string: simply count the number of octets that are in the ranges
0x7f (inclusive) or
0xF4 (inclusive). (In decimal,
0-127 and 194-244.) These are the codes which
can start a UTF-8 character code. Octets
0xF5 to 0xFF
(192, 193 and 245-255) cannot
appear in a conforming UTF-8 sequence; octets in the range
(128-191) can only appear in the second and subsequent octets of a multi-octet
encoding. Remember that you cannot use
\0 in a Lua pattern.
For example, you could use the following code snippet to count UTF-8 characters in a string you knew to be conforming (it will incorrectly count some invalid characters):
local _, count = string.gsub(unicode_string, "[^\128-\193]", "")
If you want to know how many printing columns a Unicode string will
occupy when you print it out using a fixed-width font (imagine you are
writing something like the Unix
ls program that formats its
output into several columns), then that is a different answer again.
That's because some Unicode characters do not have a printing width,
while others are double-width characters. Combining characters are
used to add accents to other letters, and generally they do not
take up any extra space when printed.
So that's at least 3 different notions of length that you might want at
different times. Lua provides one of them (
others you'll need to write functions for.
There's a similar issue with indexing the characters of a string by
string.sub(s, -3) will return the last 3 bytes of
the string which is not necessarily the same as the last three
characters of the string, and may or may not be a complete
You could use the following code snippet to iterate over UTF-8 sequences (this will simply skip over most invalid codes):
for uchar in string.gfind(ustring, "([%z\1-\127\194-\244][\128-\191]*)") do -- something end
As you might have guessed by now, Lua provides no support for things like bidirectional printing or the proper formatting of Thai accents. Normally such things will be taken care of by a graphics or typography library. It would of course be possible to interface to such a library that did these things if you had access to one.
See UnicodeIdentifers for platform independent Unicode Lua programs.