Re: Internationalization of Lua

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Internationalization of Lua
From: mathieu stumpf guntz <psychoslave@...>
Date: Mon, 21 Nov 2016 10:17:56 +0100

Le 21/11/2016 à 08:08, TW a écrit :

Can you elaborate on the benefits of programming language
localization?  I am quite skeptical about this.  Learning a
programming language is like learning a new language and IMO it hardly
makes it more difficult to learn the English keywords. But learning a
localized programming language makes it harder to apply one's gathered
knowledge beyond private projects or other programming languages. It's
also much harder to find help.

As said, I'm in fact more interested with an Esperanto localization, though I have no fundamental opposition to localization in various native languages, it's not my main interest. Esperanto is far easier to learn than many, in not any, other spoken language out there. In fact it seems that, in the same amount of time of learning only one language such as English, French or German, learning Esperanto first then an other language make people more skilled at last one studied alone. I won't spend to much time advocating this topic here, you can look at the Wikipedia's Esperanto article for more sources on this topic, and Esperanto myth busting essays of Claude Piron which give you point point of view of someone who worked as translator for UN. Note that I'm open to further discussion on this topic, but if I don't mistake, this list is not the fittest place for such a conversation, is it?

For more (mostly academic) sources about use of localized programming language and compilers, here is a few links:

Non-English-based programming languages
Internationalization of Compilers. (PDF Download Available)
The impact of the medium of instruction: The case of teaching and learning of computer programming (pay-walled, but seems relevant from the abstract and provided references which also seem interesting)
Questions and Answers — Max Planck Institute for Psycholinguistics :
Correlates of Success in Introductory Programming: A Study with Middle School Students
Teaching English Based Programming Courses to English Language Learners/Non-Native Speakers of English

If you are aware of other interesting article on the subject, I would welcome feedback.

I'm less fundamentally skeptical concerning the internationalization
of error messages, though there are problems as well - like finding
help by feeding a search engine with a localized error message.  I
usually use `LANG=C` when I want help on errors, so this is not a big
problem, at least not on *nix systems.

I didn't used other systems over the last years, but I doubt there are many system which doesn't have variable environment, aren't there?

The bigger problem with internationalization is that in theory it is
good, but alas, in practice, translations are frequently hair-raising,
confusing and misleading.

I agree. Now, note that this formulation tacitly imply that the opposite situation is better, and with that I don't agree. This is just a situation where you have to chose between two solutions which both are their cons and pros. Sure, letting whole latitude to diversity always come with a cost, but the cult of standardized monoculture to. Again, I'm not encouraging further debate over this on this mailing list. Please answer in private or suggest a fitter public canal. This hold for anything in this answer which is not directly related to Lua-i18n, Lupa, or Mallupa, unless a wide consensus is express to do otherwise in some way.

 Commercial applications might be doing
slightly better on average than open source applications, depending on
their budget.  Quality assurance is hard as maintainers don't
understand all the languages.  My impression is that in many cases,
well-meaning enthusiasts without deeper knowledge of the technical
terminology in the respective domain just do a quick translation
without too much thought.  I looked at a Babylscript translation that
is a perfect example of that.  I hope no one ever attempted to learn
or will learn _javascript_ using that translation.  Translation is hard
and requires quite some thought.  In programming language design,
especially much thought should have been spent on the choice of the
original keywords.

Again, I do agree, and in my translation I do take time to make not only translation which is relevant to the given context, but which also provides a coherent lexicon so that semantic proximity is reflected by lexical proximity. I also tend to take the length of the lexems in the equation, privileging shorter words where the lexem stay meaningful and longer lexems when existing shorter options are meaningless. For the babylscript's Esperanto translation, you might consult the token translation document and the error message document, which both provide explanation of proposed choices, and other alternatives. If you have feedback to improve this documents, please do it in documents themselves. For other Babylscript translation, I'm not involved, but a concrete example of what you mean would be welcome.

So, to really let people benefit from translations, true experts in
both languages, the subject to translate and the respective technical
terminologies in both languages are needed.  And I'm not at all
convinced that one does people a favor by translating programming
languages. I think it becomes harder rather than easier to learn the
language because learning resources are very limited or not existent
at all.  There is a plethora of books and free material on e.g.
_javascript_ in many languages, some of it excellent, but probably
hardly any material for translated versions of _javascript_/Babylscript,
if any at all.

To my mind, that's sounds more like an egg-and-chicken problem than a fundamental problem of localized programming languages.


Sorry for the negativity - I hope my reasoning has been rational and
unoffensive.  I'd encourage translations of messages with advice to
install a thorough review process before unleashing things on confused
users.

No offense. Constructive critics comments and questions are always welcome. Thank you for having taken time to give some feedback.


Thomas W.



2016-11-20 22:20 GMT+01:00 mathieu stumpf guntz <psychoslave@culture-libre.org>:

Hello, if you are interested in Lua, internationalization and possibly
programming languages localization, you might be interested in this thread.

A bit of background

You can skip this section if you are less interested in the human story
stuffs and more interested on technical aspects of Lua internationalization.

My initial motivation, was to have a programming languages that only use
phonetic signs.

Well, one way to do that is to assign an unique phonetic value to each sign
usable in the programming language. For example in Lua which currently can
only use ASCII tokens, a 256 sign mapping is enough (string content apart).
That rather easy, and you can even easily have a monosyllabic map. For
example in my native language we have 16 vowels (V) and and 20 consonant
(C), that's more than enough to make a CV (or VC) sign for each ASCII
element. A quick and dummy mapping would just assign them following some
arbitrary order, and a somewhat less dummy solution would try to make a
mapping with some mnemonic relation with the usual ASCII denoted sign. But
even then, to my mind that would stay a rather impractical solution for
anything useful.

So I looked for a spoken language which had a phonetic transcription and
possibly would have in bonus friendly morpho-syntaxic properties for a
programming language use case. On this regards Lojban might be a good choice
I guess, at least it passed within my radars. But an other aspect regarding
usefulness is the number of speaker. On that point, without forgetting the
previous one, Esperanto make a better candidate. So I began to write
research projects about it, namely Fabeleblaĵo and Algoritmaro, but in my
native language as I didn't feel skilled enough in Esperanto at the time.

Most recently I transferred some courses of International Academy of
Sciences San Marino, which use Esperanto as a common working language.
Indeed, as I wanted to begin to translate my still in progress works on
Fableblaĵo and Algoritmaro to Esperanto, I discovered that the Esperanto
version of Wikiversity is still in beta version. I'm trying to change that
by adding some courses and in the process so creating useful wiki templates
and make feedback and tickets. The later are grouped on Wikimedia
phabricator Eliso tag (Eliso stands for Esperanto kaj LIbera Scio, ie.
Esperanto and free knowledge). For now, I completely finished only one
wikification the course on internationalization.

I also made an Esperanto localization for Babylscript. I'm not completely
satisfied with this solution, as JS, at least the version implemented in the
Rhino branch from which Babyscript is derived, doesn't even allow you to
import other scripts, which is a huge restriction. Then, as a Wikimedia
contributor, I met Lua, which is used there to create frontend editable
modules, as you may know. So came the wish to make an Esperantist version of
Lua.

Lupa and Mallupa

For those who are really only interested on Lua internationalization, you
might skip this section and its subsection. The current section mainly focus
on presenting (still in progress) project which so far took more an approach
of direct localization to Esperanto, problems encountered and solutions used
or considered.

Lupa

So far Lupa aims to provide an Esperantist version of Lua. At first I just
wanted to make it a pure Esperanto version of Lua. Now, thanks to Luiz
Henrique de Figueiredo advises and implementation suggests, I already
shifted from a complete replacement of keywords to a more backward
compatible approach which only aliases for misc. built in tokens. The
current implementation is not in a sane state, as for example a simple
single : will make lupe (the lua interpreter counterpart) crash.

Still, it already enable to write some little peace of code like se 1 plus 1
egalas 2 tiam printu 'tio estas bona aritmetiko' hop, and it works.

As Esperanto as a very regular grammar, unlike much spoken languages out
there, parsing it is a rather practicable task. Even without such a support,
you can already make most statements coinciding with Esperanto semantically
sound sentences, if you chose your tokens carefully. That's an other driving
criteria behind the list of lexems translations on the project wiki. For
example one can write the statement tabelo['enigo'] = 3 as tabelo kies
'enigo' ero iĝu 3, the latter also being an plain Esperanto sentence meaning
"table whose element 'entry' become 3". This Esperanto version is a bit
longer than it's graphemo-ideographic mix up counterpart, but keep in mind
that the tokens are only aliased so one can also use the former mixup. Also
note that plain Esperanto also offer shorter ways to express the same thing,
like tabelenigiĝu 3, or in a more parser friendly version which is still
valid Esperanto tabel-enig-iĝu 3. But of course, that kind of syntax can't
be treated within the scope of mere relexicalisation.

Even sticking to the scope of "static aliases only", there are still some
problem to localize Lua toward Esperanto. First, Lua doesn't provide support
for Unicode in identifiers and other built in tokens. Esperanto, do have a
set of non-ASCII characters in it's usual way to write, namely ĉ, ĝ, ĥ, ĵ, ŝ
and ŭ. But when it's not possible to use them, it's a recognized practice to
append -h or -x the the letter without it's diacritic. As "x" isn't part of
Esperanto alphabet, it's less problematic regarding possible lexem
collisions. So far, Lupa use the -x notation to circumvent the script
encoding limitation.

A minor problem is that, as far as Esperanto is concerned, number normally
use coma as a decimal separator rather than a dot, at least if you refer to
most authoritative sources. It's minor in the sense that in practice, usage
vary, and not every Esperantist take great care of typographic "subtlety".
On the technical side, it's more annoying as 3,14 do have a well defined
completely different meaning in Lua. Babylscript for example propose to use
space separated coma to resolve ambiguity for the similar case of French. As
far as I'm concerned, I would rather use a token like plie (and ... as well,
and also, together with) as list separator operator. On a broader
internationalization perspective, the number recognition of the lexer would
require far more thought to support more diverse numbering system, such as
१.६ for Hindi.

Future development in Lupa should somewhat reverse it's approach to modify
the official interpreter as little as possible. Hence the a Lua-i18n project
presented bellow, which should focus on providing internationalization
facilities, ideally with an approach that allow to build on top of it other
tools which are flexible enough to support some syntactic changes. Lupa then
could base it's later evolutions on top of this Lua-i18n.

Mallupa

While Lupa modify directly Lua, Mallupa just translate a localized dialect
to a plain old lua script. Currently it uses ltokenp, which itself reuse the
Lua lexer, to retrieve lexems. And it includes a Lupa dialect, which already
provide more feature than Lupa. As the main part of the code is in Lua, it
make the development far more easier. On the other hand it comes with it's
cons, it's a source-to-source compiler, so it make debugging harder due to
the additional layer of translation.

As it rely on the Lua Lexer, there are some flexions which still can't be
performed. In particular I wanted to add the support for the numeral suffix
"-a" to digit which make sense regarding table locations. But a string like
1a will be taken as a malformed number by the lexer and it will never reach
the dialect converter script. To avoid that, ever the lexer should be
changed, or the project should rely on an other lexer.

Lua-i18n

So, Lua-i18n is focused on providing internationalization facilities by
modifying as less as possible to official Lua release to do so.

Some relative issues have been added and described on the project page.

Internationalization of built in messages
Internationalization of built in tokens
Unicode support

For the last one, Luiz suggested me the following:

A hack to allow unicode identifiers is to set chars over 128 be letters.
You can do this by editing lctype.c.
Ask in the mailing list about this.

He also provided me the attached file with this comment:

Here is what I had I mind for a token filter in C. This piece of C code
centralizes all needed changes. Just add <<#include "proxy.c">> just
before the definition of luaX_next in llex.c. That's the only change in
the whole Lua C code.

So, so far I can't tell I miss help or a path that need more deep exploring,
and I thank again Luiz for all this. But still, if you are interested in
Lua-i18n, have any advice, comment, or question, please feel free to reply
there or add it in the relevant project issue tracker.


Kind regards,
Mathieu

References:
- Internationalization of Lua, mathieu stumpf guntz
- Re: Internationalization of Lua, TW

Prev by Date: Re: Internationalization of Lua
Next by Date: Re: Internationalization of Lua
Previous by thread: Re: Internationalization of Lua
Next by thread: Re: Internationalization of Lua
Index(es):
- Date
- Thread