[review] new string type

Steven Schveighoffer schveiguy at yahoo.com
Sat Dec 4 20:56:16 PST 2010


On Fri, 03 Dec 2010 22:08:37 -0500, Jerry Quinn <jlquinn at optonline.net>  
wrote:

> I'm actually working in C++ but keeping an eye on things going on
> in D-land.  The kind of stuff we do is to normalize text in preparation
> for natural language processing.
>
> As a simple example, let's say you want to use a set of regexes to  
> identify patterns
> in text.  You want to return the offsets of each regex that matches.   
> However, before the regexes
> run, you replace all html tags with a placeholder, so they can easily  
> span
> tags without worrying about the contents.

I'm assuming you are not changing the length of the string, or is that not  
correct?

> Before you return the results to the
> user, though, you need to translate those offsets back to ones for the  
> original
> string.

Hm... I guess you must be changing the lengths if the offsets are  
different.  That seems odd, wouldn't you encounter performance issues when  
processing large documents?

> Everything is unicode of course and we care about processing unicode  
> code points, but want to maintain UTF-8 storage underneath to keep size  
> down.
>
> In reality, we're often doing things like single character  
> normalizations as well as larger spans, but still need to maintain  
> alignment to the original data.
>
> As long as this is reasonable to do, I'm fine.  I just wasn't sure from  
> the descriptions I was seeing.

What you will have is access to the underlying char[] array, which should  
give you full edit access.  I just don't want strings to be easily  
editable since doing so can be very difficult.

Any offsets to dchar code-points in the string will match offsets to char  
code-units.  In effect, you are always indexing by code-unit, even though  
with the string type you get code-points back.

It should be as simple as accessing a member (like str.data) or casting  
(i.e. cast(char[])str).  I'm unsure yet if it's dangerous enough to  
require casting.

-Steve


More information about the Digitalmars-d mailing list