VLERange: a range in between BidirectionalRange and RandomAccessRange

spir denis.spir at gmail.com
Tue Jan 11 06:23:00 PST 2011


On 01/11/2011 02:30 PM, Steven Schveighoffer wrote:
> On Mon, 10 Jan 2011 22:57:36 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> wrote:
>
>> I've been thinking on how to better deal with Unicode strings.
>> Currently strings are formally bidirectional ranges with a
>> surreptitious random access interface. The random access interface
>> accesses the support of the string, which is understood to hold data
>> in a variable-encoded format. For as long as the programmer
>> understands this relationship, code for string manipulation can be
>> written with relative ease. However, there is still room for writing
>> wrong code that looks legit.
>>
>> Sometimes the best way to tackle a hairy reality is to invite it to
>> the negotiation table and offer it promotion to first-class
>> abstraction status. Along that vein I was thinking of defining a new
>> range: VLERange, i.e. Variable Length Encoding Range. Such a range
>> would have the power somewhere in between bidirectional and random
>> access.
>>
>> The primitives offered would include empty, access to front and back,
>> popFront and popBack (just like BidirectionalRange), and in addition
>> properties typical of random access ranges: indexing, slicing, and
>> length. Note that the result of the indexing operator is not the same
>> as the element type of the range, as it only represents the unit of
>> encoding.
>>
>> In addition to these (and connecting the two), a VLERange would offer
>> two additional primitives:
>>
>> 1. size_t stepSize(size_t offset) gives the length of the step needed
>> to skip to the next element.
>>
>> 2. size_t backstepSize(size_t offset) gives the size of the _backward_
>> step that goes to the previous element.
>>
>> In both cases, offset is assumed to be at the beginning of a logical
>> element of the range.
>>
>> I suspect that a lot of functions in std.string can be written without
>> Unicode-specific knowledge just by relying on such an interface.
>> Moreover, algorithms can be generalized to other structures that use
>> variable-length encoding, such as those used in data compression. (In
>> that case, the support would be a bit array and the encoded type would
>> be ubyte.)
>>
>> Writing to such ranges is not addressed by this design. Ideas are
>> welcome.
>>
>> Adding VLERange would legitimize strings and would clarify their
>> handling, at the cost of adding one additional concept that needs to
>> be minded. Is the trade-off worthwhile?
>
> While this makes it possible to write algorithms that only accept
> VLERanges, I don't think it solves the major problem with strings --
> they are treated as arrays by the compiler.
>
> I'd also rather see an indexing operation return the element type, and
> have a separate function to get the encoding unit. This makes more sense
> for generic code IMO.
>
> I noticed you never commented on my proposed string type...
>
> That reminds me, I should update with suggested changes and re-post it.

People interested in solving the general problem with Unicode strings 
may have a look at https://bitbucket.org/denispir/denispir-d. All 
constructive feedback welcome.
(This will be asked for review in a short while. The main / client 
interface module is Text.d. A (long) presentation of the issues, 
reasons, solution can be found in the text called "U missing level of 
abstraction")

Denis
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list