Random string samples & unicode - Reprise

Sun Sep 12 19:57:11 PDT 2010

On Sunday 12 September 2010 19:22:02 bearophile wrote:
> Andrei Alexandrescu:
> > No, you end up having string-processing code dealing with ranges of
> > dchar.
> 
> Well, in several situations it's better to produce a real string/dstring.
> Even in Haskell, that is designed to manage lazy computation well, you
> sometimes create eager lists/arrays to simplify the types or the code or
> to make the code more deterministic.

Personally, I've had to use strict functions rather than lazy ones in haskell 
primarily to save memory by forcing the program to actually do the computations 
rather than putting it off and piling up the whole list of operations to possibly 
do later in memory. When working on my thesis, I had a program which made me run 
out of memory - all 4 GB of memory and 6GB of swap - because it wasn't 
processing _any_ of the files that I gave it until it had gotten the last one. I 
had to make it process each file and save the result before processing the next 
file rather than processing them all and then saving the result.

> > If you want to keep the
> > comparison with Python complete, Python's support for Unicode also needs
> > to be part of the discussion.
> 
> Right. My code was written in Python 2.x. In Python 3.x the situation is
> different, all strings are Unicode on default (they are all UTF 16 or UTF
> 32 according to the way you have compiled CPython) (and there is a
> built-in bytearray, that is an array of bytes that in some situations is
> seen as an ASCII string). So in Python it's like using dstrings everywere
> (in Python there's no char type, it's a string of length 1) or using lazy
> generators of them.

Well, then in comparing python 3 with D, it would then seem like you wouldn't 
really lose anything to be using dstrings everywhere. Sure, it's nice to be able 
to save space by using string, but if it's a comparison between python and D and 
you end up using UTF-32 in both, then it doesn't seem to me that it's all that 
big a deal when porting code. Now, in comparing python 2 and D, that may be a 
different issue, but it sounds like the python 2 strings aren't unicode, which 
could be problematic.

The issues with UTF-8 vs UTF-32 and random access are just a natural side-effect 
of having all strings be unicode. And honestly, I _really_ don't want having 
non-unicode strings to be at all normal in D. The fact that D forces unicode is 
a _good_ thing.

- Jonathan M Davis