How to check i

Uranuz via Digitalmars-d-learn digitalmars-d-learn at puremagic.com
Thu Oct 16 11:46:42 PDT 2014


I have some string *str* of unicode characters. The question is 
how to check if I have valid unicode code point starting at code 
unit *index*?

I need it because I try to write parser that operates on string 
by *code unit*. If more precisely I trying to write function 
*matchWord* that should exctract whole words (that could consist 
not only English letters) from text. This word then compared with 
word from parameter. I want to not decode if it is not necessary. 
But looks like I can't do it without decoding, because I need to 
know if current character is letter of alphabet and not 
punctuation or whitespace for example.

There is how I think this look like. In real code I have template 
algorithm that operates on differrent types of strings: string, 
wstring, dstring.

struct Lexer
{
	string str;
	size_t index;

	bool matchWord(string word)
	{
		size_t i = index;
		while( !str[i..$].empty )
		{
			if( !str.isValidChar(i) )
			{
				i++;
				continue;
			}
			
			uint len = str.graphemeStride(i);

			if( !isAlpha(str[i..i+len]) )
			{
				break;
			}
			i++;
		}
		
		return word == str[index..i];
	}
}

It is just a draft of idea. Maybe it is complicated. What I want 
to get as a result is logical flag (matched or not) and position 
should be set after word if it is matched. And it should match 
whole words of course.

How do I implement it correctly without overhead and additional 
UTF decodings if possible?

And also how could I validate single char of string starting at 
code unit index? Also I don't like that graphemeStride can throw 
Exception if I point to wrong possition. Is there some nothrow 
version? I don't want to have extra allocations for exceptions.


More information about the Digitalmars-d-learn mailing list