Re: [OT] Re: Allow UTF8 value names in lua?

But the sinograms used in Chinese are not really pictographic, most of them are NOT ideographic, they are compound most frequently pairing two symbols (sometimes more), one of them being a semantic classifier and the other one encoding a true phonetic syllable in one of the Chinese dialects (sometimes historic dialects, with important differences of pronunciations across Chinese languages and regional dialects, but still with a clear etymology).

That why Chinese people can read the written text: the phonetic part of each sinograms is easily recognized, as well as the ideographic classifier, in the compound symbol. Then comes some variations of traits or simplifications for esthetic or readability reasons (that's why it's not possible to rendert easily each sinogram from their decomposition; each compound has its specific layout rules and alteration of traits; some variations of traits are also made for cultural reasons or will vary across regions or rendering styles.

Chinese sinograms are then formed based on a smaller set of base symbols, some with phonetic values, other with ideographic values; both kinds were derived from early logograms that were used isolately to represent a single concept.

Also Not all Chinese words are monosyllabic, and in a few cases, some words are now pronounced with more or less syllables, but the historic compunds were monosyllabic and each compound was associated to a single syllable. In some more recent developments the monosyllabic rules have been changed in the spoken language without altering the orthography significantly: in some cases the changes were marked by altering or adding/supressing some traits in one part of the the base phonograms or ideograms used in the compound, in other cases the too complex compounds were broken in several parts, and there's a dummy phonetic symbol added.

Those that pretend that the Chinese Han sinograms are logographic are wrong. Most Chinese people (as well as Japanese with Kanjis) would not be able to read their language with a rich vocabulary and semantics if they did not understood how the system works. In Korea, sinograms were used but were reduced to a smaller set of usable simplified phonograms, to create an alphabet with much stricter composition and layout rules for the compounds. The same occured for Yi and Mongolian languages (which adapted the early Brahmic scripts to the inherited Sino-Tibetan, Turkic and Semitic scripts, from which new base sinograms were adapted to adopt the Chinese calligraphy. Recently Chinese has also adapted alphabetic scripts (notably Latin and Cyrillic) and now use them in derived forms as part of their compounds, to create a richer set capable of representing more phonologic differences.

Sinograms are then extremely compelx and include many features from lot of languages, cultures, scripts and dialects. But it's a fact that they still remain relatively easy to read for natives even if they have never seen a specific sinogram coupound: this is the same skills that allows readers of alphabetic scripts to recognize syllables and words, which are just using a simpler linear layout, while sinograms have complex compositions and lot of contextual variants for the base glyphs (and consequently, all attempts to build autocomposed fonts for Chinese siniograms, based on a normalized decomposition have failed, it is impossible to do without lot of exceptions to the rendering rules, so much that it's jsut simpler to encode each compound separately; this is a problem only for computer fonts, Chinese people know how to create, read and understand new compounds easily, but they also have now a wide choice for the multiple variants possible, so there are several competing traditions: the Chinese script is NOT unified as their set of common compounds is not fixed and varies across regions and dialects of the same language, or depending on context of use and the material used to reproduce these characters, with artisitic calligraphy using many more variations than those available in common computer fonts and supported by several competing standards).

Unicode recognizes this variety and for this reason, sinograms are not directly encoded by Unicode and have little properties (even their IDS decomposition is still not standardized and it will proably never be). Unicode then chose to encode only the compounds supported in one of the supported competing standards. It was hard for Unicode to convince the Chinese, Japanese, Korean, and Vietnamese governments to adopt a common standard. Vietnam renounced, as well Singapore, China initially wanted to promote a single standard but also renounced (for now, there's no likely change to occur before 2047) and allowed again the use of Latin, Cyrllic, Arabic, Tibetan, Yi and Mongolian scripts and finally approved the Southern Chinese variants as well as Taiwanese that it wants to reunify politically with mainland without stressing the popular oppositions to the Han domination in cultures, Japan maintained its Kanji standard.

So one lesson to learn: Chinese is not logographic, and it features a phonetic sustem as part of its script which is more precisely described as a script family encompassing multiple scripts with some basic calligraphic and typographic traditions and only very basic composition in fixed grids, which does not reflect the effective composition of each cell, treated on computers as a single "character" even if they are clearly compounds with clear reading and decipherable basic phonetics (sufficient along with the additional semantic classifier to create very a precise lexicon; lexems in Chinese are not full words, and not exactly individually syllables except in historic variants) The spoken language also uses a lot of affixes that may or may not be written as separate lexems or added into the compounds, using the esthetic calligraphic traditions for visual harmony.