Random string samples & unicode - Reprise
Jonathan M Davis
jmdavisprog at gmail.com
Sun Sep 12 19:57:11 PDT 2010
On Sunday 12 September 2010 19:22:02 bearophile wrote:
> Andrei Alexandrescu:
> > No, you end up having string-processing code dealing with ranges of
> > dchar.
>
> Well, in several situations it's better to produce a real string/dstring.
> Even in Haskell, that is designed to manage lazy computation well, you
> sometimes create eager lists/arrays to simplify the types or the code or
> to make the code more deterministic.
Personally, I've had to use strict functions rather than lazy ones in haskell
primarily to save memory by forcing the program to actually do the computations
rather than putting it off and piling up the whole list of operations to possibly
do later in memory. When working on my thesis, I had a program which made me run
out of memory - all 4 GB of memory and 6GB of swap - because it wasn't
processing _any_ of the files that I gave it until it had gotten the last one. I
had to make it process each file and save the result before processing the next
file rather than processing them all and then saving the result.
> > If you want to keep the
> > comparison with Python complete, Python's support for Unicode also needs
> > to be part of the discussion.
>
> Right. My code was written in Python 2.x. In Python 3.x the situation is
> different, all strings are Unicode on default (they are all UTF 16 or UTF
> 32 according to the way you have compiled CPython) (and there is a
> built-in bytearray, that is an array of bytes that in some situations is
> seen as an ASCII string). So in Python it's like using dstrings everywere
> (in Python there's no char type, it's a string of length 1) or using lazy
> generators of them.
Well, then in comparing python 3 with D, it would then seem like you wouldn't
really lose anything to be using dstrings everywhere. Sure, it's nice to be able
to save space by using string, but if it's a comparison between python and D and
you end up using UTF-32 in both, then it doesn't seem to me that it's all that
big a deal when porting code. Now, in comparing python 2 and D, that may be a
different issue, but it sounds like the python 2 strings aren't unicode, which
could be problematic.
The issues with UTF-8 vs UTF-32 and random access are just a natural side-effect
of having all strings be unicode. And honestly, I _really_ don't want having
non-unicode strings to be at all normal in D. The fact that D forces unicode is
a _good_ thing.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list