Fixing std.string
Michael Rynn
michaelrynn at optusnet.com.au
Mon Aug 23 23:16:25 PDT 2010
On Fri, 20 Aug 2010 02:22:56 +0000, dsimcha wrote:
> As I mentioned buried deep in another thread, std.string is in serious
> need of fixing, for two reasons:
>
> 1. Most of it doesn't work with UTF-16/UTF-32 strings.
>
> 2. Much of it requires the input to be immutable even when there's no
> good reason for this constraint.
>
> I'm trying to understand a few things before I dive into fixing it:
>
> 1. How did it get to be this way? Why did it seem like a good idea at
> the time to only support UTF-8 and only immutable strings?
>
> 2. Is there any "deep" design/technical issue that makes these hard to
> fix, or is it basically just lack of manpower and other priorities?
>
The problems are combinatorial, because of encoding schemes.
I imagine that when someone wants a function that is missing from
std.string, they might write one, and might even add to it.
I also found std.utf to not contain exactly what I needed.
The functions toUTF16, to UTF8, have signatures like
wstring toUTF16(const(dchar)[] s).
But when hacking a class I found I wanted functions that
would almost have the very same innards, but could also append mutable
character arrays of any sort.
// Does almost the same as toUTF16, but creates or appends a mutable
array.
void append_UTF16m(ref wchar[] r, const(dchar)[] s) {...}
At the expense of another nested function call, which I imagine most
people would not want to pay, toUTF16 becomes a call to append_UTF16m.
wstring toUTF16(const(dchar)[] s)
{
wchar[] temp = null;
append_UTF16m(temp, s);
return assumeUnique(temp);
}
But isNumeric for me required a parsing function, when I was religiously
trying to use ranges, and know what sort of conversion function to call
afterwards. I know its really simple-minded, but it did the required job.
enum NumberClass {
NUM_ERROR = -1,
NUM_EMPTY,
NUM_INTEGER,
NUM_REAL
}
/// R is an input range, P is a output range (put).
/// Return a NumberClass value.
/// Collect characters in P for later processing.
/// Does no NAN or INF, only checks for error, empty, integer, or real.
/// E or e might be an exponent, or just the end of a number.
NumberClass
getNumberString(R, P)(R ipt, P opt, int recurse = 0 )
{
int digitct = 0;
bool done = ipt.empty;
bool decPoint = false;
for(;;)
{
if (ipt.empty)
break;
auto test = ipt.front;
ipt.popFront;
switch(test)
{
case '-':
case '+':
if (digitct > 0)
{
done = true;
}
break;
case '.':
if (!decPoint)
decPoint = true;
else
done = true;
break;
default:
if (!isdigit(test))
{
done = true;
if (test == 'e' || test == 'E')
{
// Ambiguous end of number, or exponent?
if (recurse == 0)
{
opt.put(test);
if (getNumberString(ipt,opt, recurse+1)
==NumberClass.NUM_INTEGER)
return NumberClass.NUM_REAL;
else
return NumberClass.NUM_ERROR;
}
// assume end of number
}
}
else
digitct++;
break;
}
if (done)
break;
opt.put(test);
}
if (digitct == 0)
return NumberClass.NUM_EMPTY;
if (decPoint)
return NumberClass.NUM_REAL;
return NumberClass.NUM_INTEGER;
}
A string class.
http://dsource.org/projects/xmlp/trunk/alt/ustring.d
The component structures maintain a terminating null character and
pretend it is not there. It seemed a good idea at the time when I was
doing a lot of windows API calls which expected null terminated C-strings
of char or wchar. The UString class does conversions on accessing cstr(),
wstr() or dstr(), on the assumption that last used will be most frequent,
and ideally caches a decent hash value. I only have some limited uses of
UString so far, because character arrays are so powerful.
struct cstext {
char[] str_ = null;
...
}
struct wstext {
wchar[] str_ = null;
...
}
struct dstext {
dchar[] str_ = null;
...
}
class UString {
private {
union {
vstruc vstr; // not fully supported?
cstext cstr;
wstext wstr;
dstext dstr;
}
UStringType ztype;
hash_t hash_;
}
...
More information about the Digitalmars-d
mailing list