[review] new string type

Jerry Quinn jlquinn at optonline.net
Fri Dec 3 19:08:37 PST 2010


Steven Schveighoffer Wrote:

> On Fri, 03 Dec 2010 14:40:30 -0500, Jerry Quinn <jlquinn at optonline.net>  
> wrote:
> 
> > I tend to do a lot of transforming strings, but I need to track offsets  
> > back to the original text to maintain alignment between the results and  
> > the input.  For that, indexes are necessary and we use them a lot.
> 
> In my daily usage of strings, I generally use a string as a whole, not  
> individual characters.  But I do occasionally use it.
> 
> Let's also understand that indexing is still present, what is deactivated  
> is the ability to index to arbitrary code-units.  It sounds to me like  
> this new type would not affect your ability to store offsets (you can  
> store an index, use it later when referring to the string, etc. just like  
> you can now).
> 
> My string type does not allow for writeable strings.  My plan was to allow  
> you access to the underlying char[] and let you edit that way.  Letting  
> someone write a dchar into the middle a utf-8 string could cause lots of  
> problems, so I just disabled it by default.

That's reasonable.  I"m not trying to create invalid strings.

I'm actually working in C++ but keeping an eye on things going on
in D-land.  The kind of stuff we do is to normalize text in preparation
for natural language processing. 

As a simple example, let's say you want to use a set of regexes to identify patterns
in text.  You want to return the offsets of each regex that matches.  However, before the regexes
run, you replace all html tags with a placeholder, so they can easily span
tags without worrying about the contents.  Before you return the results to the
user, though, you need to translate those offsets back to ones for the original
string.

Everything is unicode of course and we care about processing unicode code points, but want to maintain UTF-8 storage underneath to keep size down.

In reality, we're often doing things like single character normalizations as well as larger spans, but still need to maintain alignment to the original data.

As long as this is reasonable to do, I'm fine.  I just wasn't sure from the descriptions I was seeing.

Jerry




More information about the Digitalmars-d mailing list