[review] new string type
Ali Çehreli
acehreli at yahoo.com
Fri Dec 3 14:11:35 PST 2010
spir wrote:
> What I have in mind is a "UText" type that provides the right abstraction
> for text processing / string maipulation as one has when dealing with
ASCII
> (in any fact any legacy character set). All what is needed is having
a true
> one-to-one mapping between characters (in the common sense) and
elements of
> strings (what I call "code stacks"); one given stack unambiguously
denotes
> one character. To reach this point, in addition to decoding (ag from
utf8 to
> code points), we must:
> * group codes into stacks
> * normalize (into 'NFD')
Are those operations independent of the context? Is stacking always desired?
I guess one would use one of the D string types when grouping or
normalization is not desired, right? Makes sense.
> * sorts points in stacks
Ok, I see that it is possible with NFD. I am not experienced with
Unicode, but I think there will be issues with other types of Unicode
normalization. (Judging from your posts, I know that you know all these.
:) )
> Then, we can for instance index or slice in O(1) as usual, and get a
> consistent substring of _characters_ [...] I do not want to deal with
anything related to script-, language-, locale- specific issues.
Is the concept of _character_ well defined in Unicode outside of the
context of the an alphabet (I think your "script" covers alphabet.)
It is an interesting decision when we actually want to see an array of
code points as characters. When would it be correct to do so? I think
the answer is when we start treating the string as a piece of text.
For a string to be considered as text, it must be based on an alphabet.
ASCII strings are pieces of text, because they are based on the
26-letter alphabet.
I hope I don't sound like saying against anything that you said. I am
also thinking about the other common operations that work on pieces of text:
- sorting (e.g. ç is between c and d in many alphabets)
- lowercasing, uppercasing (e.g. i<->İ and ı<->I in many alphabets)
As a part of the Turkish D community, we've played with the idea of such
a text type. It took advantage of D's support for Unicode encoded source
code, so it's fully in Turkish. Yay! :)
Here is the module that takes care of sorting, capitalization, and
producing the base forms of the letters of the alphabets:
http://code.google.com/p/trileri/source/browse/trunk/tr/alfabe.d
It is also based on dchar[], as you recommend elsewhere in this thread.
It is written with the older D2 operator overloading, doesn't support
ranges, etc. But it currently supports ten alphabets (including the
26-letter English, and the Old Irish alphabet).
Going out of the context of this thread, we've also worked on a type
that contains pieces of text from different alphabets to make a "text",
where a text like "jim & ali" is correctly capitalized as "JIM & ALİ".
I am thinking more than what you describe. But your string would be
useful for implementing ours, as we don't have normalization or stacking
support at all.
Thanks,
Ali
More information about the Digitalmars-d
mailing list