Notepad++

Fri Aug 14 17:36:26 PDT 2009

Sergey Gromov wrote:
<snip>
> Well, you can write a regexp to handle a simple C string.  That is, if
> your regexp is matched against the whole file, which is usually not the
> case.  Otherwise you'll have troubles with C string:
> 
> "foo\
> bar"
> 
> or D string:
> 
> "foo
> bar"

So there is a problem if the highlighter works by matching regexps on a 
line-by-line basis.  But matching regexps over a whole file is no harder 
in principle than matching line-by-line and, when the maximal munch 
principle is never called to action, it can't be much less efficient. 
(The only bit of C or D strings that relies on maximal munch is octal 
escapes.)

> Then you want to highlight string escapes and probably format
> specifiers.  Therefore you need not simple regexps but hierarchies of
> them, and also you need to know where *internals* of the string start
> and end.

Let's just concentrate for the moment on the simple process of finding 
the beginning and end of a string.  Here's a snippet of a TextPad syntax 
file:

StringsSpanLines = Yes
StringStart = "
StringEnd = "
StringEsc = \

A possible snippet of lexer code to handle this (which FAIK might be 
near enough how TP does it):

if (*c == StringStart) {
     beginHighlightString(c);
     for (++c; *c != StringEnd && *c != '\0'
           &&(StringsSpanLines || *c != '\n'); ++c) {
         if (*c == StringEsc) ++c;
     }
     endHighlightString(c+1);
}

It's simple and it should work.  (OK, there are two assumptions made for 
simplicity: that line breaks are normalised to LF, and that the file is 
terminated by at least two null bytes in memory, but you get the idea.)

While it doesn't support highlighting of escapes, I can't see this fact 
as being the reason N++'s developers haven't implemented even this in 
the generic lexer module.  I probably couldn't see it being the reason 
even if the C lexer did highlight escapes (which it doesn't).

> Then you have r"foo" which probably can be handled with regexps.
> 
> Then you have q"/foo/" where "/" can be anything.  Still can be handled
> by extended regexps, even though they won't be regular expressions in
> scientific sense.
> 
> Then you have q"{foo}" where "{" and "}" can be any of ()[]<>{}.
> Regexps cannot translate while substituting, so you must create regexps
> for all possible parens.

Yes, these aspects are more complicated.  Both TP and N++ (out of the 
box, anyway) are probably far from being able to lex D2 properly.  But 
they certainly could do better in supporting D1.  Still, once N++ gains 
access to Scintilla's D lexer, things will certainly be better.

> And of course q"BLAH
> whatever BLAH here
> BLAH", well, probably nice for help texts.
> 
> And these are only strings.  Try to write regexp which treats .__15 as
> number(.__15), .__foo as operator(.), ident(__foo), and 2..3 as
> number(2), operator(..), number(3).
<snip>

We'd need many regexps to handle all possible cases, but a possible set 
to cover these cases and a few others (listed in a possible order of 
priority) is:

\._*[0-9][0-9_]*
([1-9][0-9]*)(\.\.)
[0-9]+\.[0-9]*
[1-9][0-9]*
\.\.
\.
[a-zA-Z_][a-zA-Z0-9_]*

Note the use of capturing groups to handle the 2..3 case.  Each 
capturing group would match a token, while in the other cases the whole 
regexp matches a token.

Stewart.