lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


Op Wo. 20 Mrt. 2019 om 07:39 het Daurnimator <quae@daurnimator.com> geskryf:
>
> On Tue, 19 Mar 2019 at 07:25, Roberto Ierusalimschy
> <roberto@inf.puc-rio.br> wrote:
> >
> > >  Roberto> Why is rejecting surrogates a backwards step?
> > >
> > > Rejecting surrogates is a forward step, that's not the problem.
> > >
> > > Accepting values over 10FFFF is the backward step.
> >
> > Did you read the documentation? By default the functions reject any
> > value over 10FFFF. They only accept these values if you give an explicit
> > parameter for that end. You explicitly says: I want invalid codes.
> > That, as others pointed out, may be useful for other purposes.
> >
> > If you want to accept invalid codes, it is not the lack of this
> > parameter that will stop you.
> >
> > (Again, did you read the documentation? Maybe that point is not
> > clear there?)

> What I think is a backwards step, is the lexer accepting "\u{110000}"
> Unicode escapes >10FFFF should really be an error IMO.
>
> UTF8PATT accepting deprecated 5 and 6 byte sequences is a similarly
> undesirable change.
>
> Accepting unpaired surrogates isn't odd, and is unfortunately required
> when working with many badly designed APIs (e.g. windows file paths,
> javascript). utf-8 with unpaired surrogates allowed is often called
> "wtf-8". https://simonsapin.github.io/wtf-8/

The Unicode and UTF-8 standards have changed over time. Who can say
that they will forever be frozen as they stand now?

On the other hand, non-negative 4-byte integers will always be
non-negative 4-byte integers.

The change to the utf8 library offers a one-to-one conversion
algorithm between non-begative four-byte integers and an encoding that
translates seven-bit integers to themselves and other integers to two
ore more bytes. For historical reasons, and because of its ability to
translate valid Unicode to and from valid UTF-8, this library is
called utf8. It is by no means the only application of the encoding,
any more than counting sheep is the only application of integers.

-- Dirk