[Date Prev][Date Next][Thread Prev][Thread Next]
- Subject: Re: Could Lua itself become UTF8-aware?
- From: Paul Merrell <marbux@...>
- Date: Sat, 29 Apr 2017 23:08:25 -0700
On Sat, Apr 29, 2017 at 8:05 PM, Patrick Donnelly <email@example.com> wrote:
> I'm very against even inching towards this destination. Lua is a
> *language*. As soon as we start allowing identifiers outside of ASCII,
> we begin to cultivate "dialects". Only with full support would
> anyone's Lua be able to load scripts written with identifiers from
> another language. And, of course, programmers not fluent in that
> language would be at great disadvantage.
> Air traffic control for flight standardized on English so any pilot
> can communicate with any flight controller. In the same way, I think
> it makes a lot of sense for programmers to accept that English is the
> lingua franca for programming, including comments, documentation, and
> identifiers. There really is no upside to allowing non-English (ASCII)
> Maybe that's self-serving as an American for which I apologize.
[long post warning]
Some telegraphy history might stretch your horizon a bit.
One of the early experimental systems in France at the sending end
involved selecting a wire corresponding to a character and briefly
connecting it to an electrical power source, which caused a charge to
be sent down the wire. At the receiving end was a group of people,
each holding a wire in their hand corresponding to a character. When
one of them felt a shock, they were to shout the name of their
character. A transcriber seated among the group was to write down each
character as it was received. Crude and slow, but it worked.
Skip ahead and we arrive at Baudot, who devised what amounted to a
5-bit code that could handle an English or French alphabet. Never
commercialized, but his system led to similar systems that were
successful. Five-bit telegraphy code peaked with invention and
commercialization of the Teletype, a wondrous device that allowed
using a typewriter keyboard and eventually punched 5-bit TTY paper
tape for storage and advance recording of telegraph messages.
But not expressive enough for some folk, such as newspaper types who
wanted both lower case and upper case characters plus special
characters and a few codes for the forerunners of today's markup
languages. Enter the Teletypesetter in 1928, which used a 6-bit TTS
code that, with two code pages (the typewriter shift key was the
switch between them), fit the bill. Suddenly, lines of type could be
set one one end of the continent but cast from molten lead at the
other end. All you had to do was to learn to cope with a
typewriter-like keyboard that had both a Shift key and an Unshift key.
Then in the 1950s, when IBM et ilk were looking for ways to
commercialize computers, they realized that the newspaper industry was
working with punched paper tape that was inherently binary and that
newspapers had the money to buy computers that could reduce very
expensive labor costs. So we got very crude hyphenation programs that
could process "idiot tape," TTS tape punched without line breaks, each
paragraph a continuous stream. The computers could process the tape
and add the line endings and do the hyphenation. But as word
processing technology developed in the newspaper industry,  6 bits
with two code pages just wasn't expressive enough, so a third code
page was engrafted onto TTS, using the dollar-sign character as its
toggle (we had to put two consecutive dollar signs to get a dollar
sign after that. But that gave us a code page for cryptic commands
like "$d52" that were the forerunners of today's word processing
But even with a third code page, TTS just wasn't expressive enough. So
along comes 8-bit ASCII. Room for the entire English alphabet, both
upper and lower case, plus a lot more special characters and computer
codes. And what had been the printer's handwritten markup language
that had evolved over some 500 years was abruptly translated into
computer code that could be processed far more quickly. ASCII was even
expressive enough for modern "markup" languages and programming
But ASCII was not expressive enough for all human languages, notably
the CJK languages that depend on iconic symbols rather than alphabetic
characters. And thus Unicode was born.
UTF-8 is now recognizable as ASCII's successor. It's become the most
common character set specified for web pages. And some of the
most-used programming libraries speak UTF-8, e.g, the multi-platform
GTK and Qt families of libraries.
I work on a GTK-based program  that runs on a wide variety of
operating systems and is localized for about 17 human languages,
including CJK languages and right-to-left languages. We couldn't do
that if our data files were written to ASCII. We have to use UTF-8.
And we embed Lua, which means that Lua has to handle UTF-8 strings. We
get there by also embedding Xavier Wang's lua-utf8 code,  which
lets our scripters get and set character offsets that allow for UTF-8
But my message here is that Lua either makes the transition to more
elegant handling of UTF-8 data or Lua will go the way of those people
who held those wires, waiting for a shock so they could shout their
character. Lua's code may be written in ASCII, but Lua has to make the
transition to UTF-8 strings or Lua will be obsoleted much sooner than
it has to be.
Or in other words, it's the character set of the data content rather
than the character set of the programming code that points the way
here. The data is going to be UTF-8 and Lua must evolve to handle it.
My 2 cents.
 Virtually all of the basic modern word processing techniques were
originally developed for the newspaper industry while it was still
using TTS code. There is probably more here about TTS and related
technology than you will want to view/read.
<http://www.gochipmunk.com/html/contents.html>. I worked as a
typographer in daily newspapers in the years spanning the introduction
of computers and the transition from hot metal linecasting to
[Notice not included in the above original message: The U.S. National
Security Agency neither confirms nor denies that it intercepted this