standard ranges

Jonathan M Davis jmdavisProg at gmx.com
Wed Jun 27 14:00:22 PDT 2012


On Wednesday, June 27, 2012 22:54:28 Gor Gyolchanyan wrote:
> On Wed, Jun 27, 2012 at 10:42 PM, Jonathan M Davis <jmdavisProg at gmx.com> 
wrote:
> > On Wednesday, June 27, 2012 22:29:25 Gor Gyolchanyan wrote:
> >> Agreed. Having struct strings (with slices and everything) will set
> >> the record straight.
> > 
> > Except that they couldn't have slicing, because it would be very
> > inefficient. You'd have to get at the actual array of code units to slice
> > anything. A struct string type would have to be restricted to exactly the
> > same set of operations that range-based functions consider strings to
> > have and then give you a way to get at the underlying code unit
> > representation to be able to use it when special-casing for strings for
> > efficiency, just like you do now.
> > 
> > You _can't_ get away from the fact that you're dealing with an array (or
> > list or whatever) of code units even if you do want to operate on it as a
> > range of code points most of the time. Having a struct would fix the
> > issues like foreach iterating over char by default whereas range-based
> > functions iterate over dchar - it would make it consistent by making it
> > dchar for everything - but the issue of code unit vs code point still
> > remains and you can't get rid of it. Anyone wanting to write efficient
> > string-processing code _needs_ to understand unicode. There's no way
> > around it (which is part of the reason that Walter isn't keen on the idea
> > of changing how strings work in the language itself).
> > 
> > So, while having a string type which is a struct does help eliminate the
> > schizophrenia, the core problem of code unit vs code point is still there,
> > and you still need to understand it. There is no fix for it, because it's
> > intrinsic to how unicode works.
> > 
> > - Jonathan M Davis
> 
> Yes you can get away. The struct string would have ubyte[] ushort[]
> and uint[] as the representation. Maybe even the char[], wchar[] and
> dchar[], but those won't be strings as we know them now. The string
> struct will take care of encoding 100% transparently and will provide
> access to the representation, which is good for bit blitting and other
> encoding-agnostic operations, but the representation is then known NOT
> to be a valid string and will need to be placed into the string struct
> in order to use string operations.

If you want efficient strings, you _must_ worry about the encoding. It's 
_impossible_ for it to be otherwise. It helps quite a bit if you're using 
functions that someone else already wrote which take this into account rather 
than having to write the functions yourself, but if you're doing much in the 
way of string processing, you _must_ understand unicode in order to handle 
them properly. I fully understand that it's something that most people don't 
want to have to worry about, but the reality of the matter is that the can't 
do that unless you don't care about efficiency. The fact that strings are 
variably length encoded has a huge impact on how they need to be used if you 
care about both correctness and efficiency. You can't escape it.

- Jonathan M Davis


More information about the Digitalmars-d mailing list