Re: unicode char ranges

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: unicode char ranges
From: Hans Hagen <pragma@...>
Date: Thu, 06 Dec 2012 10:48:50 +0100

On 12/6/2012 6:12 AM, Dirk Laurie wrote:

2012/12/5 Jay Carlson <nop@nop.com>:

Here's a nickel. Get yourself a real operating system
(or perhaps just a real MUA).


You're the second poster to make snide remarks at my OS.
Adam called it "crappy".

Actually unnecessary decomposed characters cannot arise
on my system without great inconvenience, so I can't blame
the authors for failing to provide an output mechanism that
uncraps crappy input.

Typographic issues are a bit beyond this list, but here is how it works(a it simplified as more is involved):

- input can consist of either a sequence of characters that are turnedinto one (u + diaeresis = udiaeresis) or of direct code points(udiaeresis); from the linguistic point of view the two dots canrepresent something different per language, e.g. an umlaut in german

- a font can provide a composed characters as precomposed or asdecomposed and most modern (truetype/opentype) fonts provide for this;some fonts have composed glyphs but at the same time carry theinformation of how to compose them from other glyphs

- the way composition happens depends on the font logic: it can be donevia substitution (resulting in a precomposed glyph) or relativepositioning; fonts may also require a decomposition step and start fromthe individual characters

- in most cases already at the input stage collapsing takes place i.e.decomposed sequences get turned into composed (but a font might demanddecomposition later on)

- characters get represented by glyphs and there is a one to manyrelationship, think of smallcap, oldstyle and other renderings; a fontcan have rulesets that are to be applied in sequence

- in a precomposed glyph the (for instance) accent is part of thepackage and the graphic definition might provide clues for rendering(hinting)

- in the decomposed case the base character and the accent (officiallycalled mark) get positioned relative to each other using so calledanchors; in that case you can run into rounding errors and hinting canbe less optimal

- if none of this works, which is the case if no entry for the composedglyph is provided i.e. no information is available on how to deal withthe situation, the characters get overlayed due to the fact that anaccent has either width zero or some fixed width (fonts are notconsistent in this)

- of course a font renderer can apply some heuristics i.e. centering theaccent over the base character

- in addition, operating systems often use technologies where, if a fonthas no entry, a glyph from another font is taken

- situations where ligaturing is involved (nb: an accented character isnot a ligature) things can be more complex as each component of theligature can get its own marks (for instance in arabic scripts)

- some languages have stacked marks, for example vietnamese, so there werun into base to mark and mark to mark situations (given that noprecomposed glyph is present)


Now to operating systems (just some personal observations):

- windows: the font rendering technology (volt, cleartype, etc) is quitegood given that a decent font is used; in xp one had to turn oncleartype explicitly

- osx: no issues (apart from occasional issues in the built in pdfrenderer); there is some apple font technology but I think it's beingphased out in favor for generic opentype

- linux: the technology is okay, but not always applied / configuredright; one of the things i like about (x)ubuntu is that right from thestart they got this right i.e. enabled anti-aliasing and other featuresas well as chose fonts that render okay (so, in case of doubt about thequality, just check the settings)

microsoft and apple have some advantage here as they are behind thecurrent font technologies (truetype and opentype)

so: rendering is not so much os related, but more a matter of using theright fonts and setting up the machinery right; of course a high resscreen helps too

My system composes at keyboard entry level.   I hit Compose,
`a`, and `^`, and a genuine `â` appears, no matter which
program is asking for input.

it might be less optimal for chinese, korean or arabic (more fontdependent as well as renderer dependent; for arabic one can often seethe font machinery realtime in action when one keys in charactersbecause sequences of characters are turned into combined shapes thatneed relative (vertical and horizontal) positioning as well as markanchoring

To produce the second, decomposed, one in my post I had
to remind myself of the Unicode for combining circumflex
by consulting a document I wrote in August 2011 (revised
thanks to the present discussion and appended, helpful
comments welcome).

such documents are actually good tests for checking support ofcharacters in an editor


Hans



-----------------------------------------------------------------
                                          Hans Hagen | PRAGMA ADE
              Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
    tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
                                             | www.pragma-pod.nl
-----------------------------------------------------------------

References:
- unicode char ranges, spir
- Re: unicode char ranges, spir
- Re: unicode char ranges, Dirk Laurie
- Re: unicode char ranges, Jay Carlson
- Re: unicode char ranges, Dirk Laurie

Prev by Date: Re: DSL in lua
Next by Date: Re: Type Metatables for Table and Userdata - Powerpatch
Previous by thread: Re: unicode char ranges
Next by thread: Re: unicode char ranges
Index(es):
- Date
- Thread