[ogb-discuss] RFC: Emancipation Community
John Sonnenschein
johnsonnenschein at gmail.com
Tue May 13 09:17:48 PDT 2008
Don't mean to be rude, this discussion has plenty of value, but
perhaps ogb-discuss is not the right place for it
On Tue, May 13, 2008 at 8:18 AM, Don Cragun <don.cragun at sun.com> wrote:
> >Date: Tue, 13 May 2008 16:54:16 +0200
> >From: Roland Mainz <roland.mainz at nrubsig.org>
>
> >
> >Joerg Schilling wrote:
> >> Don Cragun <don.cragun at sun.com> wrote:
> >> > >BTW: Regarding our talk... I checked the POSIX standard and it turns out
> >> > >that od(1) support for UTF-8 "chars" is fully optional. There is no need
> to
> >> > >support it.
> >> >
> >> > >Jörg
> >> >
> >> > Joerg,
> >> > This is only partly true.
> >>
> >> Please also comment Rolands claim that UNICODE is not a lossless coding.
> >> Roland mentioned this recently without giving evidence.
>
> Joerg,
> In addition to the comments Roland made below, there are also a
> lot of "private" character sets that contain characters (e.g., the AT&T
> deathstar logo, the Sun logo, etc.) that do not appear in any ISO
> standard character set. Also, just as new English words are created
> every year, new ideographs appear in the languages that use ideographic
> character sets. These ideographs may be used for a long time before
> they are included in a UNICODE revision (and when the new ideographs
> represent children's names, they may never be included).
>
> - Don
>
>
>
> >
> >There wasn't enougth time during our meeting to show the problem in
> >detail...
> >
> >> I can hardly believe that the 21 bit coding used by UNICODE still has
> problems
> >> to map other codings. UNICODE has been designed to be a lossless coding....
> >
> >... I try to keep it short: Some encodings (e.g. ISO-2022) can define
> >the language being used in the following characters (similar to the
> >xml:lang="<lang>" tag in XML). Since Unicode folds some charcaters which
> >are shared between languages to one codepoint (search for
> >"han-unification") this information is lost[1], making Unicode not 100%
> >lossless. Sounds trivial but it results in some unhappy&&nasty issues
> >when the users mix text from multiple languages (one of the "harmless"
> >things is that browsers will choose fonts based on the langauge being
> >used - which may lead to issues like a japanese font being used for a
> >single lonely character in the middle of an otherwise completely chinese
> >text... and backwards... (and if you've followed the history of both
> >countries in the last >= 1500 years you may realise that they don't like
> >that much...)), unfortunately for languages where the matching countries
> >are hyper-picky about their characters (note: That's an understatement).
> >
> >[1]=Technicially there are language-selector characters in a block
> >outside the BMP (= Basic Multilinguar Plane) but I'm not sure whether
> >they are really thought for this use - at least the existing converters
> >do not use them and I can't find a standard (or even draft) which
> >defines their usage. Or short: The situation is stuck badly in the mud.
> >
> >If you want the long story ask in i18n-discuss@, AFAIK Ienup can explain
> >all the details better than I can do...
> >
> >----
> >
> >Bye,
> >Roland
>
>
>
> _______________________________________________
> ogb-discuss mailing list
> ogb-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/ogb-discuss
>
--
PGP Public Key 0x437AF1A1
Available on hkp://pgp.mit.edu
More information about the ogb-discuss
mailing list