Fixing std.string

Mon Aug 23 23:16:25 PDT 2010

On Fri, 20 Aug 2010 02:22:56 +0000, dsimcha wrote:

> As I mentioned buried deep in another thread, std.string is in serious
> need of fixing, for two reasons:
> 
> 1.  Most of it doesn't work with UTF-16/UTF-32 strings.
> 
> 2.  Much of it requires the input to be immutable even when there's no
> good reason for this constraint.
> 
> I'm trying to understand a few things before I dive into fixing it:
> 
> 1.  How did it get to be this way?  Why did it seem like a good idea at
> the time to only support UTF-8 and only immutable strings?
> 
> 2.  Is there any "deep" design/technical issue that makes these hard to
> fix, or is it basically just lack of manpower and other priorities?
> 

The problems are combinatorial, because of encoding schemes.
I imagine that when someone wants a function that is missing from 
std.string, they might write one, and might even add to it.

I also found std.utf to not contain exactly what I needed.
The functions toUTF16, to UTF8, have signatures like
wstring toUTF16(const(dchar)[] s).

But when hacking a class I found I wanted functions that
would almost have the very same innards, but could also append mutable 
character arrays of any sort.

// Does almost the same as toUTF16, but creates or appends a mutable 
array.

void append_UTF16m(ref wchar[] r, const(dchar)[] s) {...}

At the expense of another nested function call, which I imagine most 
people would not want to pay, toUTF16 becomes a call to append_UTF16m.

wstring toUTF16(const(dchar)[] s)
{
	wchar[] temp = null;
	append_UTF16m(temp, s);
	return assumeUnique(temp);
}

But isNumeric for me required a parsing function, when I was religiously 
trying to use ranges, and know what sort of conversion function to call 
afterwards. I know its really simple-minded, but it did the required job.

enum NumberClass {
	NUM_ERROR = -1,
	NUM_EMPTY,
	NUM_INTEGER,
	NUM_REAL
}

/// R is an input range, P is a output range (put).
/// Return a NumberClass value.
/// Collect characters in P for later processing.
/// Does no NAN or INF, only checks for error, empty, integer, or real.
/// E or e might be an exponent, or just the end of a number.

NumberClass
getNumberString(R, P)(R ipt, P opt, int recurse = 0 )
{
int   digitct = 0;
bool  done = ipt.empty;
bool  decPoint = false;
for(;;)
{
  if (ipt.empty)
    break;
  auto test = ipt.front;
  ipt.popFront;
  switch(test)
  {
  case '-':
  case '+':
    if (digitct > 0)
    {
      done = true;
    }
    break;
  case '.':
    if (!decPoint)
      decPoint = true;
    else
      done = true;
    break;
  default:
    if (!isdigit(test))
    {
      done = true;
      if (test == 'e' || test == 'E')
      {
        // Ambiguous end of number, or exponent?
        if (recurse == 0)
        {
          opt.put(test);
          if (getNumberString(ipt,opt, recurse+1)
==NumberClass.NUM_INTEGER)
            return NumberClass.NUM_REAL;
          else 
            return NumberClass.NUM_ERROR;
        }
        // assume end of number
      }
    }
    else
      digitct++;
    break;
  }
  if (done)
    break;
  opt.put(test);
}

if (digitct == 0)
	return NumberClass.NUM_EMPTY;
if (decPoint)
	return NumberClass.NUM_REAL;
return NumberClass.NUM_INTEGER;
}

A string class.
http://dsource.org/projects/xmlp/trunk/alt/ustring.d

The component structures maintain a terminating null character and 
pretend it is not there. It seemed a good idea at the time when I was 
doing a lot of windows API calls which expected null terminated C-strings 
of char or wchar. The UString class does conversions on accessing cstr(), 
wstr() or dstr(), on the assumption that last used will be most frequent, 
and ideally caches a decent hash value.  I only have some limited uses of 
UString so far, because character arrays are so powerful.

struct cstext {
	char[]	  str_ = null;
...
}

struct wstext {
	wchar[]	  str_ = null;
...
}

struct dstext {
	dchar[]	  str_ = null;
...
}

class UString {

	private {
		union {
			vstruc vstr; // not fully supported?
			cstext cstr;
			wstext wstr;
			dstext dstr;
		}

		UStringType ztype;
		hash_t		hash_;
	}

	...