[review] new string type

Tue Nov 30 10:34:50 PST 2010

On Tuesday, November 30, 2010 07:48:03 Steven Schveighoffer wrote:
> In a prior thread, I promised to create a narrow string type that would
> enforce the requirements of narrow strings.  That is, the new string type
> should respect the encoding of narrow strings.
> 
> Here is a rough draft, tested minimally, but it does compile and pass
> simple tests.  It's pretty simple, which is what I would expect.  I copied
> a lot of stuff from std.array to get this to work.
> 
> The point of this type is -- if we replace what the compiler considers
> "strings" with this type, then we get both the compiler *and* phobos
> agreeing as to what this type is:  A bi-directional range of dchar.
> 
> As a bonus, char[] and wchar[] now would become arrays and be manipulated
> consistently with other arrays, which if not done correctly could cause
> problems, but may provide more flexibility and opportunity for
> performance.  Instead of the library fighting you on it.
> 
> Anyways, here it is, released under the boost license, commence attack ;)
> 
> 
> // Written in the D programming language.
> 
> /**
> Copyright: Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010
> License:   <a href="http://www.boost.org/LICENSE_1_0.txt">Boost License
> 1.0</a>.
> Authors:   $(WEB erdani.org, Andrei Alexandrescu, Steven Schveighoffer)
> 
> Copyright Andrei Alexandrescu and Steven Schveighoffer 2008-2010.
> Distributed under the Boost Software License, Version 1.0.
>     (See accompanying file LICENSE_1_0.txt or copy at
>           http://www.boost.org/LICENSE_1_0.txt)
> */
> import std.utf;
> import std.traits;
> 
> struct string_t(T) if (is(Unqual!T == char) || is(Unqual!T == wchar))
> {
>      private T[] _data;
>      this(T[] data)
>      {
>          this._data = data;
>      }
> 
>      // note, this assumes that idx is a valid index
>      private size_t _charStart(size_t idx) const
>      {
>          static if(is(Unqual!T == wchar))
>          {
>              immutable c = _data.ptr[idx];
>              if(c >= 0xDC00 && c <= 0xDFFF)
>              {
>                  // surrogate pair detected, verify we have at least 2
> wchars,
>                  // and that both wchars are properly encoded.
>                  assert(idx > 0, "Invalid UTF character at beginning of
> string");
>                  return idx-1;
>              }
>              else
>                  return idx;
>          }
>          else
>          {
>              const p = _data.ptr + idx;
>              if ((p[0] & 0b1100_0000) != 0b1000_0000)
>              {
>                  return idx;
>              }
>              else if (idx >= 1 && (p[-1] & 0b1100_0000) != 0b1000_0000)
>              {
>                  return idx - 1;
>              }
>              else if (idx >= 2 && (p[-2] & 0b1100_0000) != 0b1000_0000)
>              {
>                  return idx - 2;
>              }
>              else if (idx >= 3 && (p[-3] & 0b1100_0000) != 0b1000_0000)
>              {
>                  return idx - 3;
>              }
>              else
>              {
>                  assert(false, "Invalid UTF character in string");
>              }
>          }
>      }
> 
>      void popFront()
>      {
>          auto nc = std.utf.stride(_data, 0);
>          assert(nc <= _data.length && nc != 0xFF, "Invalid sequence at
> beginning of string");
>          _data = _data[nc .. $];
>      }
> 
>      void popBack()
>      {
>          immutable n = _data.length;
>          assert(n, "Attempting to pop back of an empty string");
>          _data = _data.ptr[0.._charStart(n-1)];
>      }
> 
>      @property dchar front() const
>      {
>          assert(_data.length, "Attempting to fetch the front of an empty
> string");
>          size_t i = 0;
>          return decode(_data, i);
>      }
> 
>      @property dchar back() const
>      {
>          immutable n = _data.length;
>          assert(n, "Attempting to fetch the back of an empty string");
>          auto idx = _charStart(n-1);
>          return std.utf.decode(_data, idx);
>      }
> 
>      @property bool empty() const
>      {
>          return !_data.length;
>      }
> 
>      @property typeof(this) save()
>      {
>          return this;
>      }
> 
>      // support read-only random access via code unit index.
>      dchar opIndex(size_t idx)
>      {
>          idx = _charStart(idx);
>          return std.utf.decode(_data, idx);
>      }
> 
>      string_t opSlice()
>      {
>          return this;
>      }
> 
>      string_t opSlice(size_t start, size_t end)
>      {
>          if(start != _data.length)
>              start = _charStart(start);
>          if(end != _data.length)
>              end = _charStart(end);
>          return string_t(_data[start..end]);
>      }
> 
>      // note we don't call this length because length can be assumed to be
> the
>      // number of elements, which this isn't.
>      @property size_t codeUnits() const
>      {
>          return _data.length;
>      }
> 
>      // support append and concat
>      // TODO: need to support appending various types of strings to
> eachother
>      // (right now only same-type strings can be appended, or raw arrays)
>      ref string_t opOpAssign(string op, U)(U data) if (op == "~" && is(U ==
> string_t))
>      {
>          _data ~= data._data;
>          return this;
>      }
> 
>      ref string_t opOpAssign(string op, U)(U data) if (op == "~" && !is(U
> == string_t) && is(typeof(_data ~= U.init)))
>      {
>          _data ~= data;
>          return this;
>      }
> 
>      string_t opBinary(string op, U)(U data) if (op == "~" && is(U ==
> string_t))
>      {
>          return string_t(_data ~ data._data);
>      }
> 
>      string_t opBinary(string op, U)(U data) if (op == "~" && !is(U ==
> string_t) && is(typeof(_data ~ U.init)))
>      {
>          return string_t(_data ~ data);
>      }
> }
> 
> template string_t(T) if (is(Unqual!T == dchar))
> {
>      alias T[] string_t;
> }
> 
> /** begin test code **/
> import std.stdio;
> 
> alias string_t!(immutable char) mystring;
> alias string_t!(immutable wchar) mywstring;
> alias string_t!(immutable dchar) mydstring;
> 
> void main()
> {
>      auto str = mystring("hello");
>      str ~= " world";
>      str ~= mystring("!!!");
>      writeln(str._data);
> }

1. At least until universal function syntax is in the language, you use the 
ability to do str.func().

2. Functions that would work just fine treating strings as arrays of code units 
(presumably because they don't care about what the actual data is) lose 
efficiency, because now a string isn't an array.

3. You have no access to the underlying array unless you're dealing with an 
actual array of dchar.

4. Indexing is no longer O(1), which violates the guarantees of the index 
operator.

5. Slicing (other than a full slice) is no longer O(1), which violates the 
guarantees of the slicing operator.

What you're doing here is forcing the view of a string as a range of dchar in 
all cases. Granted, that's what you want in most cases, but it can degrade 
efficiency, and the fact that some operations (in particular indexing and slicing) 
are not O(1) like they're supposed to be means that algorithms which rely on 
O(1) behavior from them could increase their cost by an order of magnitude. All 
the cases where treating a string as an actual array which are currently valid 
are left out to dry

The inherent problem with strings is that we _want_ to be able to view them as 
both arrays of code units and as ranges of dchar. Trying to view them only as 
ranges of dchar is inefficient in a number of cases. Even something as simple as 
getting a substring is less efficient with this code. Only in cases where you 
don't actually know the length of the substring that you want is this 
essentially as efficient as what we have now. You're ignoring all cases where 
viewing a string as an array of code units is correct and desirable. You're only 
solving half of the problem (albeit the more prevalent half).

It could very well be that we need a better way to handle when a string is 
treated as a range of dchar and when it is treated as an array of code units, 
but trying to treat them as ranges of dchar in all cases is not the right 
answer.

Now, making it possible to have a wrapper struct like this which works in many 
cases where you'd use strings could reduce errors in code in many situations, so 
giving the programmer the ability to do that could be interesting. But it seems 
to me that this solution is too limiting to be a viable replacement for strings 
as they are.

- Jonathan M Davis