Fix Phobos dependencies on autodecoding
Gregor Mückl
gregormueckl at gmx.de
Thu Aug 15 19:05:32 UTC 2019
On Thursday, 15 August 2019 at 11:02:54 UTC, Argolis wrote:
> From the examples above, most of the time you simply need
> opaque memory management, so decaying the
> string/dstring/wstring to a binary blob, but that's not string
> processing
This is the point we're trying to get across to you: this isn't
sufficient. Depending on the context and the script/language, you
need access to the string at various levels. E.g. a font renderer
needs to sometimes iterate code points, not graphemes in order to
compose the correct glyphs.
Binary blob comparisons for comparing strings are *also* not
sufficient, again depending on both script/language of the text
in the string and the context in which the comparison is
performed. If the comparison is to be purely semantic, the
following strings should be equal: "\u00f6" and "\u006f\u0308".
They both represent the same "Latin Small Letter O with
Diaeresis". Their in-memory representations clearly aren't equal,
so a memcpy won't yield the correct result. The same applies to
sorting.
If you decide to force a specific string normalization
internally, you put the burden on the user to explicitly select a
different normalization when they require it. Plus, there is no
way to perfectly reconstruct the input binary representation of a
string, e.g. when it was given in a non-normalized form (e.g. a
mix of NFC and NFD). Once such a string is through a
normalization algorithm, the exact input is unrecoverable. This
makes interfacing with other code that has idiosyncrasies around
all of this hard to impossible to achieve.
One such system that I worked on in the past was a small embedded
microcontroller driven HCI module with very limited capabilites,
but with the requirement to be multilingual. I carefully worked
out that for the languages that were required, a UTF-8 encoding
with a very specific normalization would just about work. This
choice was viable because the user interface was created in a
custom tool where I could control the code and data generation
just enough to make it work.
Another case where normalization is troublesome is ligatures.
Ligatures that are purely stylistic like "ff", "ffi", "fft",
"st", "ct" etc... have their own code points. Yet, it is a purely
stylistic choice whether to use them. So in terms of the
contained text, the ligature \ufb00 is equal to the string "ff",
but it is not the same grapheme. Whether you can normalize this
depends on the context. The user may have selected the ligature
representation deliberately to have it appear as such on screen.
If you want to do spell checking on the other hand, you would
need to resolve the ligature to its individual letters.
And then there is Hangul: this is a prime example of a writing
system that is "weird" to westerners. It is based on 40 symbols
(19 consonants, 21 vowels) which aren't written individually, but
merged syllable by syllable into rectangular blocks of two or
three such symbols. These symbols get arranged in different
layouts depending on which symbols there are in a syllable. As
far as I understand, this follows a clear algorithm. This results
in approximately 6500 individual graphemes that are actually
written. Yet each of these is a group of two or three letters and
parsed as such. So depending on whether you're interested in
individual letters or syllables, you need to use a different
string representation for processing that language.
OK, this are all just examples that come to my mind while
brainstorming the question a little bit. However, none of us are
not experts in language processing, so whatever examples we can
come up with are very likely just the very tip of the iceberg.
There is a reason why libraries like ICU give the user a lot of
control over string handling and expose a lot of variants of
functions depending on the user intent and context. This design
rests on a lot of expert knowledge that we don't have, but we
know that it is sound. Going against that wisdom is inviting
trouble. Autodecoding is an example of doing just that.
More information about the Digitalmars-d
mailing list