lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


I think it's good that Lua keeps treating strings as byte strings. This makes it very transparent what's going on (no implicitly assumed encoding nonsense) and makes strings suitable for working with binary data.

For the most part, it turns out that byte string operations don't break nearly as much on UTF-8 strings as you may assume they do: Okay, string length is off, but at least you know the byte length, which is also very useful, if not more useful. But finding strings works, assuming both the string and the needle use the same encoding and the needle is not used as a pattern. Same for literal string replacements.

The only part where it does break are patterns, which are "character"-centric with "character classes" and quantifiers only applying to characters (bytes). This can yield quite surprising results e.g. if you were to use character ranges of UTF-8 encoded codepoints: Rather than getting a-b, you'll get <last byte of a>-<first byte of b>. Similarly, if you try to quantify a multibyte codepoint, you'll only quantify the last byte.

There are two ways to fix this. One would be to make patterns character-aware, e.g. in this case UTF-8 aware. But another option would be extending patterns to include proper RegEx: Multibyte characters could then be quantified as (c)<quantifier>; the parentheses around UTF-8 encoded codepoints could be made implicit, such that c<quantifier> would work out of the box. [x-y] could be treated as a lexicographic bytestring range if x and y are multibyte.

Note also that length in codepoints does not take into account graphemes, grapheme clusters etc. and thus is not the "real" string length as perceived by the user either. Taking all of this into account would require Lua to be significantly bloated by including the Unicode database. This is best left to libraries.

Operations such as finding the length of strings and search and replace should IMO at least by default not operate on UTF-8 encoded bytestrings. This has the potential to completely wreck the performance of naive programs which assume finding the length of a string is a constant time operation (as is currently the case). Most if not all string operations get a linear time term in their runtime (say, a substring operating on codepoint indices rather than byte indices). Simply converting codepoint to byte indices for every operation will be too slow for the majority of applications; using codepoint indices requires consideration from the programmer.

On the other hand, UTF-32 (e.g. slices of runes in Golang) takes 4 times the memory compared to an ASCII bytestring. For Lua standards this may be acceptable, but it would also need to be explicit IMO.

Finally, Lua already has a simple utf8 library (https://www.lua.org/manual/5.4/manual.html#6.5) for dealing with UTF-8 encoded bytestrings. It doesn't provide a pattern reimplementation, but it does give you much bang for the buck and forces you to make it explicit when you're using expensive UTF-8 operations on bytestrings.

- Lars

On 30.07.23 00:07, Roger Leigh wrote:

I also ran into this problem a few weeks back.  One portable solution is to escape the ISO-8859-1 characters with \x so that the source file encoding can be UTF-8 but the string literals will remain ISO-8859-1.  This keeps the tests passing.

 

This doesn’t solve the problem that the string length, search, replace and manipulation functions don’t work with multibyte encodings like UTF-8, which I suspect is the default encoding for pretty much everyone nowadays on Unix platforms, with other platforms having adopted Unicode well before that.  Has moving the internal string representation to UTF-8 been considered?  Or tagging strings with the encoding so that they can be converted as needed into the appropriate encoding?

 

Kind regards,

Roger

 

 

From: Michael Lenaghan <michaell@dazzit.com>
Sent: Saturday, July 29, 2023 10:42 PM
To: lua-l@lists.lua.org
Subject: Five Lua test files are ISO-8859-1 encoded

 

Hello, all.

 

Five Lua test files are actually ISO-8859-1 encoded:

 

  • db.lua
  • files.lua
  • pm.lua
  • sort.lua
  • strings.lua

 

Two of the files have tests that count bytes, so you can’t just convert them to UTF-8. Well, not if you want your tests to succeed. :-)

 

Not fatal — the tests work as they are! — but unusual in an increasingly UTF-8 world.

 

The real problem is that it’s such an increasingly UTF-8 world that many editors don’t try to auto-detect the encoding. Save any changes in such an editor — hello, VS Code! — and you corrupt the files.