lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Hello, if you are interested in Lua, internationalization and possibly programming languages localization, you might be interested in this thread.

A bit of background

You can skip this section if you are less interested in the human story stuffs and more interested on technical aspects of Lua internationalization.

My initial motivation, was to have a programming languages that only use phonetic signs.

Well, one way to do that is to assign an unique phonetic value to each sign usable in the programming language. For example in Lua which currently can only use ASCII tokens, a 256 sign mapping is enough (string content apart). That rather easy, and you can even easily have a monosyllabic map. For example in my native language we have 16 vowels (V) and and 20 consonant (C), that's more than enough to make a CV (or VC) sign for each ASCII element. A quick and dummy mapping would just assign them following some arbitrary order, and a somewhat less dummy solution would try to make a mapping with some mnemonic relation with the usual ASCII denoted sign. But even then, to my mind that would stay a rather impractical solution for anything useful.

So I looked for a spoken language which had a phonetic transcription and possibly would have in bonus friendly morpho-syntaxic properties for a programming language use case. On this regards Lojban might be a good choice I guess, at least it passed within my radars. But an other aspect regarding usefulness is the number of speaker. On that point, without forgetting the previous one, Esperanto make a better candidate. So I began to write research projects about it, namely Fabeleblaĵo and Algoritmaro, but in my native language as I didn't feel skilled enough in Esperanto at the time.

Most recently I transferred some courses of International Academy of Sciences San Marino, which use Esperanto as a common working language. Indeed, as I wanted to begin to translate my still in progress works on Fableblaĵo and Algoritmaro to Esperanto, I discovered that the Esperanto version of Wikiversity is still in beta version. I'm trying to change that by adding some courses and in the process so creating useful wiki templates and make feedback and tickets. The later are grouped on Wikimedia phabricator Eliso tag (Eliso stands for Esperanto kaj LIbera Scio, ie. Esperanto and free knowledge). For now, I completely finished only one wikification the course on internationalization.

I also made an Esperanto localization for Babylscript. I'm not completely satisfied with this solution, as JS, at least the version implemented in the Rhino branch from which Babyscript is derived, doesn't even allow you to import other scripts, which is a huge restriction. Then, as a Wikimedia contributor, I met Lua, which is used there to create frontend editable modules, as you may know. So came the wish to make an Esperantist version of Lua.

Lupa and Mallupa

For those who are really only interested on Lua internationalization, you might skip this section and its subsection. The current section mainly focus on presenting (still in progress) project which so far took more an approach of direct localization to Esperanto, problems encountered and solutions used or considered.

Lupa

So far Lupa aims to provide an Esperantist version of Lua. At first I just wanted to make it a pure Esperanto version of Lua. Now, thanks to Luiz Henrique de Figueiredo advises and implementation suggests, I already shifted from a complete replacement of keywords to a more backward compatible approach which only aliases for misc. built in tokens. The current implementation is not in a sane state, as for example a simple single : will make lupe (the lua interpreter counterpart) crash.

Still, it already enable to write some little peace of code like se 1 plus 1 egalas 2 tiam printu 'tio estas bona aritmetiko' hop, and it works.

As Esperanto as a very regular grammar, unlike much spoken languages out there, parsing it is a rather practicable task. Even without such a support, you can already make most statements coinciding with Esperanto semantically sound sentences, if you chose your tokens carefully. That's an other driving criteria behind the list of lexems translations on the project wiki. For example one can write the statement tabelo['enigo'] = 3 as tabelo kies 'enigo' ero iĝu 3, the latter also being an plain Esperanto sentence meaning "table whose element 'entry' become 3". This Esperanto version is a bit longer than it's graphemo-ideographic mix up counterpart, but keep in mind that the tokens are only aliased so one can also use the former mixup. Also note that plain Esperanto also offer shorter ways to express the same thing, like tabelenigiĝu 3, or in a more parser friendly version which is still valid Esperanto tabel-enig-iĝu 3. But of course, that kind of syntax can't be treated within the scope of mere relexicalisation.

Even sticking to the scope of "static aliases only", there are still some problem to localize Lua toward Esperanto. First, Lua doesn't provide support for Unicode in identifiers and other built in tokens. Esperanto, do have a set of non-ASCII characters in it's usual way to write, namely ĉ, ĝ, ĥ, ĵ, ŝ and ŭ. But when it's not possible to use them, it's a recognized practice to append -h or -x the the letter without it's diacritic. As "x" isn't part of Esperanto alphabet, it's less problematic regarding possible lexem collisions. So far, Lupa use the -x notation to circumvent the script encoding limitation.

A minor problem is that, as far as Esperanto is concerned, number normally use coma as a decimal separator rather than a dot, at least if you refer to most authoritative sources. It's minor in the sense that in practice, usage vary, and not every Esperantist take great care of typographic "subtlety". On the technical side, it's more annoying as 3,14 do have a well defined completely different meaning in Lua. Babylscript for example propose to use space separated coma to resolve ambiguity for the similar case of French. As far as I'm concerned, I would rather use a token like plie (and ... as well, and also, together with) as list separator operator. On a broader internationalization perspective, the number recognition of the lexer would require far more thought to support more diverse numbering system, such as १.६ for Hindi.

Future development in Lupa should somewhat reverse it's approach to modify the official interpreter as little as possible. Hence the a Lua-i18n project presented bellow, which should focus on providing internationalization facilities, ideally with an approach that allow to build on top of it other tools which are flexible enough to support some syntactic changes. Lupa then could base it's later evolutions on top of this Lua-i18n.

Mallupa

While Lupa modify directly Lua, Mallupa just translate a localized dialect to a plain old lua script. Currently it uses ltokenp, which itself reuse the Lua lexer, to retrieve lexems. And it includes a Lupa dialect, which already provide more feature than Lupa. As the main part of the code is in Lua, it make the development far more easier. On the other hand it comes with it's cons, it's a source-to-source compiler, so it make debugging harder due to the additional layer of translation.

As it rely on the Lua Lexer, there are some flexions which still can't be performed. In particular I wanted to add the support for the numeral suffix "-a" to digit which make sense regarding table locations. But a string like 1a will be taken as a malformed number by the lexer and it will never reach the dialect converter script. To avoid that, ever the lexer should be changed, or the project should rely on an other lexer.

Lua-i18n

So, Lua-i18n is focused on providing internationalization facilities by modifying as less as possible to official Lua release to do so.

Some relative issues have been added and described on the project page.

  1. Internationalization of built in messages
  2. Internationalization of built in tokens
  3. Unicode support

For the last one, Luiz suggested me the following:

A hack to allow unicode identifiers is to set chars over 128 be letters.
You can do this by editing lctype.c.
Ask in the mailing list about this.

He also provided me the attached file with this comment:

Here is what I had I mind for a token filter in C. This piece of C code
centralizes all needed changes. Just add <<#include "proxy.c">> just
before the definition of luaX_next in llex.c. That's the only change in
the whole Lua C code.

So, so far I can't tell I miss help or a path that need more deep exploring, and I thank again Luiz for all this. But still, if you are interested in Lua-i18n, have any advice, comment, or question, please feel free to reply there or add it in the relevant project issue tracker.


Kind regards,
Mathieu


Hello, if you are interested in Lua, internationalization and possibly programming languages localization, you might be interested in this thread.


# A bit of background

You can skip this section if you are less interested in the human story stuffs and more interested on technical aspects of Lua internationalization.

My initial motivation, was to have a programming languages that only use phonetic signs.

Well, one way to do that is to assign an unique phonetic value to each sign usable in the programming language. For example in Lua which currently can only use ASCII tokens, a 256 sign mapping is enough (string content apart). That rather easy, and you can even easily have a monosyllabic map. For example in my native language we have 16 vowels (V) and and 20 consonant (C), that's more than enough to make a CV (or VC) sign for each ASCII element. A quick and dummy mapping would just assign them following some arbitrary order, and a somewhat less dummy solution would try to make a mapping with some mnemonic relation with the usual ASCII denoted sign. But even then, to my mind that would stay a rather impractical solution for anything useful.

So I looked for a spoken language which had a phonetic transcription and possibly would have in bonus friendly morpho-syntaxic properties for a programming language use case. On this regards Lojban might be a good choice I guess, at least it passed within my radars. But an other aspect regarding usefulness is the number of speaker. On that point, without forgetting the previous one, Esperanto make a better candidate. So I began to write research projects about it, namely [Fabeleblaĵo][] and [Algoritmaro][], but in my native language as I didn't feel skilled enough in Esperanto at the time.

Most recently I transferred some courses of [International Academy of Sciences San Marino][AIS], which use [Esperanto as a common working language][A UNIVERSITY MAINLY IN ESPERANTO]. Indeed, as I wanted to begin to translate my still in progress works on Fableblaĵo and Algoritmaro to Esperanto, I discovered that [the Esperanto version of Wikiversity][Vikiklerigejo] is still in beta version. I'm trying to change that by adding some courses and in the process so creating useful wiki templates and make feedback and tickets. The later are grouped on Wikimedia [phabricator Eliso tag][] (Eliso stands for Esperanto kaj LIbera Scio, ie. Esperanto and free knowledge). For now, I completely finished only one wikification the [course on internationalization][].

I also  made an [Esperanto localization][] for [Babylscript][]. I'm not completely satisfied with this solution, as JS, at least the version implemented in the Rhino branch from which Babyscript is derived, doesn't even allow you to import other scripts, which is a huge restriction. Then, as a Wikimedia contributor, I met Lua, which is used there to create frontend editable modules, as you may know. So came the wish to make an Esperantist version of Lua.

# Lupa and Mallupa

For those who are really only interested on Lua internationalization, you might skip this section and its subsection. The current section mainly focus on presenting (still in progress) project which so far took more an approach of direct localization to Esperanto, problems encountered and solutions used or considered.

## Lupa

So far [Lupa][] aims to provide an Esperantist version of Lua. At first I just wanted to make it a pure Esperanto version of Lua. Now, thanks to Luiz Henrique de Figueiredo advises and implementation suggests, I already shifted from a complete replacement of keywords to a more backward compatible approach which only aliases for misc. built in tokens. The current implementation is not in a sane state, as for example a simple single `:` will make `lupe` (the `lua` interpreter counterpart) crash.  

Still, it already enable to write some little peace of code like `se 1 plus 1 egalas 2 tiam printu 'tio estas bona aritmetiko' hop`, and it works.

As Esperanto as a very regular grammar, unlike much spoken languages out there, parsing it is a rather practicable task. Even without such a support, you can already make most statements coinciding with  Esperanto semantically sound sentences, if you chose your tokens carefully. That's an other driving criteria behind the list of lexems translations on the [project wiki][]. For example one can write the statement `tabelo['enigo'] = 3` as `tabelo kies 'enigo' ero iÄ?u 3`, the latter also being an plain Esperanto sentence meaning "table whose element 'entry' become 3". This Esperanto version is a bit longer than it's graphemo-ideographic mix up counterpart, but keep in mind that the tokens are only aliased so one can also use the former mixup. Also note that plain Esperanto also offer shorter ways to express the same thing, like `tabelenigiÄ?u 3`, or in a more parser friendly version which is still valid Esperanto `tabel-enig-iÄ?u 3`. But of course, that kind of syntax can't be treated within the scope of mere relexicalisation.

Even sticking to the scope of "static aliases only", there are still some problem to localize Lua toward Esperanto. First, Lua doesn't provide support for Unicode in identifiers and other built in tokens. Esperanto, do have a set of non-ASCII characters in it's usual way to write, namely Ä?, Ä?, Ä¥, ĵ, Å? and Å­. But when it's not possible to use them, it's a recognized practice to append -h or -x the the letter without it's diacritic. As "x" isn't part of Esperanto alphabet, it's less problematic regarding possible lexem collisions. So far, Lupa use the -x notation to circumvent the script encoding limitation.

A minor problem is that, as far as Esperanto is concerned, number normally use coma as a decimal separator rather than a dot, at least if you refer to most authoritative sources. It's minor in the sense that in practice, usage vary, and not every Esperantist take great care of typographic "subtlety". On the technical side, it's more annoying as `3,14` do have a well defined completely different meaning in Lua. Babylscript for example propose to [use space separated coma to resolve ambiguity][babnumber] for the similar case of French. As far as I'm concerned, I would rather use a token like `plie` (and ... as well, and also, together with) as list separator operator. On a broader internationalization perspective, the number recognition of the lexer would require far more thought to support more diverse numbering system, such as १.६ for Hindi.

Future development in Lupa should somewhat reverse it's approach to modify the official interpreter as little as possible. Hence the a Lua-i18n project presented bellow, which should focus on providing internationalization facilities, ideally with an approach that allow to build on top of it other tools which are flexible enough to support some syntactic changes. Lupa then could base it's later evolutions on top of this Lua-i18n.



## Mallupa

While Lupa modify directly Lua, Mallupa just translate a localized dialect to a plain old lua script. Currently it uses [ltokenp][], which itself reuse the Lua lexer, to retrieve lexems. And it includes a Lupa dialect, which already provide more feature than Lupa. As the main part of the code is in Lua, it make the development far more easier. On the other hand it comes with it's cons, it's a source-to-source compiler, so it make debugging harder due to the additional layer of translation.

As it rely on the Lua Lexer, there are some flexions which still can't be performed. In particular I wanted to add the support for the numeral suffix "-a" to digit which make sense regarding table locations. But a string like `1a` will be taken as a malformed number by the lexer and it will never reach the dialect converter script. To avoid that, ever the lexer should be changed, or the project should rely on an other lexer.  


# Lua-i18n

So, [Lua-i18n][] is focused on providing internationalization facilities by modifying as less as possible to official Lua release to do so.

Some relative issues have been added and described on the project page.

1. Internationalization of built in messages
2. Internationalization of built in tokens
3. Unicode support

For the last one, Luiz suggested me the following:

    A hack to allow unicode identifiers is to set chars over 128 be letters.
    You can do this by editing lctype.c.
    Ask in the mailing list about this.

He also provided me the attached file with this comment:

    Here is what I had I mind for a token filter in C. This piece of C code
    centralizes all needed changes. Just add <<#include "proxy.c">> just
    before the definition of luaX_next in llex.c. That's the only change in
    the whole Lua C code.

So, so far I can't tell I miss help or a path that need more deep exploring, and I thank again Luiz for all his help. But still, if you are interested in Lua-i18n, have any advice, comment, or question, please feel free to reply there or add it in the relevant project issue tracker. 

# References

[Esperanto localization]: https://github.com/psychoslave/babylscript
[Babylscript]: http://babylscript.com/
[Fabeleblaĵo]: https://fr.wikiversity.org/wiki/Recherche:Fabelebla%C4%B5o
[Algoritmaro]: https://fr.wikiversity.org/wiki/Recherche:Algoritmaro
[AIS]: https://en.wikipedia.org/wiki/Akademio_Internacia_de_la_Sciencoj_San_Marino
[A UNIVERSITY MAINLY IN ESPERANTO]: https://www.academia.edu/28199232/A_University_Mainly_in_Esperanto
[Vikiklerigejo]: https://beta.wikiversity.org/wiki/%C4%88efpa%C4%9Do
[phabricator Eliso tag]: https://phabricator.wikimedia.org/tag/eliso/
[course on internationalization]: https://beta.wikiversity.org/wiki/Internaciigo_de_komputilaj_programoj
[Lupa]: https://github.com/psychoslave/lupa
[project wiki]: https://github.com/psychoslave/lupa/wiki
[babnumber]: http://www.babylscript.com/wiki/FeatureOverview.html#Numbers
[ltokenp]: http://lua-users.org/lists/lua-l/2016-05/msg00028.html
[Lua-i18n]: https://github.com/psychoslave/lua-i18n
/*
* proxy.c
* lexer proxy for Lua parser -- allows aliases
* Luiz Henrique de Figueiredo <lhf@tecgraf.puc-rio.br>
* Sun Nov 13 09:24:13 BRST 2016
* This code is hereby placed in the public domain.
* Add <<#include "proxy.c">> just before the definition of luaX_next in llex.c
*/

#define TK_ADD		'+'
#define TK_BAND		'&'
#define TK_BNOT		'~'
#define TK_BOR		'|'
#define TK_BXOR		'^'
#define TK_DIV		'/'
#define TK_GT		'>'
#define TK_LT		'<'
#define TK_MINUS	'-'
#define TK_MOD		'%'
#define TK_POW		'^'
#define TK_SUB		'-'

static const struct {
    const char *name;
    int token;
} aliases[] = {
    { "nee",     TK_BNOT },
    { "disauxe", TK_BXOR },
    { "superas", TK_GT },
    { "malinfraas", TK_GT },
    { "suras", TK_GE },
    { "almenauxas", TK_GE },
    { "malsubas", TK_GE },
    { "egalas",  TK_EQ },
    { "samas",   TK_EQ },
    { "malsamas",TK_NE },
    { "neegalas",TK_NE },
    { "infraas", TK_LT },
    { "malsuperas", TK_LT },
    { "subas", TK_LE },
    { "malsuras", TK_LE },
    { "malalmenauxas", TK_LE },
    { "kaje", TK_BAND },
    { "auxe", TK_BOR },
    { "sobsxove", TK_SHR },
    { "sorsxove", TK_SHL },
    { "plus", TK_ADD },
    { "mal", TK_MINUS },
    { "kontraux", TK_MINUS },
    { "minus", TK_SUB },
    { "disige", TK_DIV },
    { "divide", TK_DIV },
    { "ozle", TK_DIV },
    { "onige", TK_IDIV },
    { "parte", TK_IDIV },
    { "pece", TK_IDIV },
    { "kvociente", TK_IDIV },
    { "module", TK_MOD },
    { "kongrue", TK_MOD },
    { "alt", TK_POW },
    { "potencige", TK_POW },
};

static int nexttoken(LexState *ls, SemInfo *seminfo)
{
	int t=llex(ls,seminfo);
	if (t==TK_NAME && strcmp(getstr(seminfo->ts),"sia")==0) {
		seminfo->ts = luaS_new(ls->L,"self");
		return t;
	}
	if (t==TK_NAME) {
		int i;
		int n = sizeof(aliases)/sizeof(aliases[0]);
		for (i=0; i<n; i++) {
			if (strcmp(getstr(seminfo->ts),aliases[i].name)==0)
				return aliases[i].token;
		}
	}
	return t;
}

#define llex nexttoken