How to check i
Uranuz via Digitalmars-d-learn
digitalmars-d-learn at puremagic.com
Thu Oct 16 11:46:42 PDT 2014
I have some string *str* of unicode characters. The question is
how to check if I have valid unicode code point starting at code
unit *index*?
I need it because I try to write parser that operates on string
by *code unit*. If more precisely I trying to write function
*matchWord* that should exctract whole words (that could consist
not only English letters) from text. This word then compared with
word from parameter. I want to not decode if it is not necessary.
But looks like I can't do it without decoding, because I need to
know if current character is letter of alphabet and not
punctuation or whitespace for example.
There is how I think this look like. In real code I have template
algorithm that operates on differrent types of strings: string,
wstring, dstring.
struct Lexer
{
string str;
size_t index;
bool matchWord(string word)
{
size_t i = index;
while( !str[i..$].empty )
{
if( !str.isValidChar(i) )
{
i++;
continue;
}
uint len = str.graphemeStride(i);
if( !isAlpha(str[i..i+len]) )
{
break;
}
i++;
}
return word == str[index..i];
}
}
It is just a draft of idea. Maybe it is complicated. What I want
to get as a result is logical flag (matched or not) and position
should be set after word if it is matched. And it should match
whole words of course.
How do I implement it correctly without overhead and additional
UTF decodings if possible?
And also how could I validate single char of string starting at
code unit index? Also I don't like that graphemeStride can throw
Exception if I point to wrong possition. Is there some nothrow
version? I don't want to have extra allocations for exceptions.
More information about the Digitalmars-d-learn
mailing list