lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 2019-12-02 06:58, Antonio Scuri wrote:
   Yes, the problem is not the conversion itself.

   IUP already use Unicode strings when setting and retrieving native
element strings. The problem is we can NOT change the IUP API without
breaking compatibility with several thousands of lines of code of existing
applications. To add a new API in parallel is something we don't have time
to do. So we are still going to use char* in our API for some time.  UTF-16
would imply in using wchar* in everything.

   What we can do now is to provide some mechanism for the application to be
able to use the returned string in IupFileDlg in another control and in
fopen. Probably a different attribute will be necessary. Some applications
I know are simply using UTF8MODE_FILE=NO, so they can use it in fopen, and
before displaying it in another control, converting that string to UTF-8.

Best,
Scuri

G'day,

GNU/Linux user here, using Lua and IM/CD/IUP across multiple distros,
mainly Debian-based (Ubuntu, LinuxMint, but hopefully trying to expand the
supported-distro list over time).

Remember that there is a very wide gulf between:

        - Unicode (which defines abstract code points, their meaning, and
          ways to handle localisation such as left-to-right versus
          right-to-left rendering); and

       - Code point representation (UTF-8 is the clear leader here,
         and Lua itself (not the IUP* tools) now includes a "utf8"
         library), UTF-7, UTF-16, UCS-4, etc.  There is no attempt by
         this library to move from the (relatively straightforward) job
         of representing code points, through to the much, much more
         demanding job of interpreting and rendering the code point
         stream as a Unicode entity.


The choice of UTF-8 by both POSIX-compliant OSes (for all interfaces
that are defined by POSIX) and by Lua choosing to bundle the utf8 library
itself in Version 5.3 shows a strong preference for UTF-8:

        "UTF-8 Support"

        https://www.lua.org/manual/5.3/manual.html#6.5

An early essay that lays out the UTF-8/Unicode landscape, comes from the
2003 article in the blog "Joel on Software:

        "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

        https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

However, the present IM/CD/IUP discussion has moved beyond POSIX
boundaries, in its move from code point representation, up to graphical
user interface interpretation and rendering.

For fixed-multibyte-encodings, issues such as big-endian versus
little-endian encoding can arise -- hence the Byte Order Mark at the
start of character strings (0xFEFF for UTF-16, I think).  These can
become cumbersome when manipulating strings (e.g. appending two UTF-16
strings, of different endianness).  UTF-8 representation avoids this
need.

Unfortunately, this is where my knowledge stops.  I know enough that
UTF-8 should work on all POSIX-defined interfaces, and that the Lua
library utf8 is now defined and bundled, very strictly staying away from
the explosion in complexity that happens when you move from code point
representation to a conforming Unicode implementation.

My best guess, at this point, is that, given that there are now quite a
large number of programs that work on both Windows and *NIX platforms, go
and study how they handle this problem, rather than reinvent the wheel.
For example, Tcl uses UTF-8 encoding, and it graphical partner, Tk, takes
UTF-8 strings for the definition of Tk widgets, and converts it as
required. From the "I18n" page of the Tcl/Tk documentation (Version 8.1):

        "Fonts, Encodings, and Tk Widgets"

        https://tcl.tk/doc/howto/i18n.html

        Tk widgets that display text now require text strings in Unicode/UTF-8
        encoding. Tk automatically handles any encoding conversion necessary
        to display the characters in a particular font.

        If the master font that you set for a widget doesn't contain a glyph
        for a particular Unicode character that you want to display, Tk
        attempts to locate a font that does. Where possible, Tk attempts to
        locate a font that matches as many characteristics of the widget's
        master font as possible (for example, weight, slant, etc.). Once Tk
        finds a suitable font, it displays the character in that font. In other
        words, the widget uses the master font for all characters it is capable
        of displaying, and alternative fonts only as needed.

        In some cases, Tk is unable to identify a suitable font, in which case
        the widget cannot display the characters. (Instead, the widget displays
        a system-dependent fallback character such as "?") The process of
        identifying suitable fonts is complex, and Tk's algorithms don't always
        find a font even if one is actually installed on the system. Therefore,
        for best results, you should try to select as a widget's master font
        one that is capable of handling the characters you expect to display
        For example, "Times" is likely to be a poor choice if you know that you
        need to display Japanese or Arabic characters in a widget.

        If you work with text in a variety of character sets, you may need to
        search out fonts to represent them. Markus Kuhn has developed a free
        6x13 font that supports essentially all the Unicode characters that can
        be displayed in a 6x13 glyph. This does not include Japanese, Chinese,
        and other Asian languages, but it does cover many others. The font is
        available at

                  http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html

        His site also contains many useful links to other sources of fonts and
        font information.

Hope this helps,

s-b etc etc