lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 2/7/2012 3:11 PM, Egil Hjelmeland wrote:
What about sorting/collating? That would be useful. But is that a big-table-thing in Unicode?

What language are you sorting for?

In Spanish, "LL" comes after "LZ" in sort order, and "CH" comes after "CZ", but not in any other language. ("LL" and "CH" are considered "letters", and so therefore sort differently).

There are many other rules that apply to specific languages and contexts; a dictionary sort in English is different than an alphanumeric sort, for instance (a1,a10,a9 compared to a1,a9,a10, respectively).

But the short answer is: Yes, it's one of the things you need a huge table and supporting source code to completely handle in Unicode. Or an even bigger table if you REALLY need to do it right in multiple locales.

If you're curious, there's a table generator online where you can create those Big Tables based on what you need (mapping from other charsets, a break iterator, collators, rule based number format handling, and more). [1] Which is good, because ALL of it comes to about 18Mb. Keep in mind that's JUST the data table size, and not the code needed to parse the data table, which in one case built to a 900k DLL on Windows (with 300k of embedded tables) just for generic collation handling (trying to make collation sane, though not actually correct, for all languages).

More examples of strange collation exceptions and general detail about collation can be found on unicode.org [2].

Tim

[1] http://apps.icu-project.org/datacustom/
[2] http://www.unicode.org/reports/tr10/