Request for comments: std.d.lexer

H. S. Teoh hsteoh at quickfur.ath.cx
Mon Jan 28 10:04:18 PST 2013


On Sun, Jan 27, 2013 at 10:39:13PM +0100, Brian Schott wrote:
> On Sunday, 27 January 2013 at 19:46:12 UTC, Walter Bright wrote:
[...]
> >Just a quick comment: byToken() should not accept a filename. It's
> >input should be via an InputRange, not a file.
> 
> The file name is accepted for eventual error reporting purposes. The
> actual input for the lexer is the parameter called "range".
[...]

FWIW, I've developed this little idiom in my code when it comes to
dealing with error reporting in lexing/parsing code (for my own DSLs,
not D):

The main problem I have is that my lexer/parser accepts an input range,
but input ranges don't (necessarily) have any filename/line number
associated with them. Moreover, the code that throws the exception may
be quite far down the call chain, and may have not access to the context
that knows what filename/line number the error occurred at. For example,
I may have a generic function called parseInt(), which can be called
from the lexer, the parser, or a whole bunch of other places. It
wouldn't make sense to force parseInt() to take a filename and line
number, just so it can have nicer error reporting, for example.

So I decided to move the inclusion of filename/line number information
to where they belong: in the code that knows about them. So here's a
sketch of my approach:

	class SyntaxError : Exception {
		string filename;
		int linenum;
		this(string msg) { super(msg); }
	}
	...
	int parseInt(R)(R inputRange) {
		...
		// N.B.: no filename/line number info here
		if (!isDigit(inputRange.front))
			throw new Exception("Invalid digit: %s",
				inputRange.front);
	}
	...
	Expr parseExpr(R)(R inputRange) {
		...
		// N.B.: any exception just unrolls past this call, 'cos
		// we have no filename/line number info to insert anyway
		if (tokenType == IntLiteral) {
			value = parseInt(inputRange):
		}
		...
	}
	...
	Expr parseFileInput(string filename) {
		auto f = File(filename);
		try {
			// Wrapper range that counts line numbers
			auto r = NumberedSrc(f);

			return parseExpr(r);
		} catch(SyntaxError e) {
			// Insert filename/line number info into message
			e.filename = filename;
			e.linenum = r.linenum;
			e.msg = format("%s:%d %s", filename, r.linenum, e.msg);
			throw e;
		}
	}
	...
	Expr parseConsoleInput() {
		// No filename/line number info here
		return parseExpr(stdin.byLine());
	}
	...
	Expr parseStringInput(string input) {
		try {
			auto r = NumberedSrc(input);
			return parseExpr(r);
		} catch(SyntaxError e) {
			// We don't have filename here, but we do have
			// line number, so use that.
			e.linenum = r.linenum;
			e.msg = format("Line %d: %s", r.linenum, e.msg);
			throw e;
		}
	}

Notice that I have different wrapper functions for dealing with
different kinds of input; the underlying parser doesn't even care about
filename/line numbers, but the wrapper functions catch any parsing
exceptions that are thrown from underneath and prepend this info as
appropriate. This simplifies the parsing code (don't have to keep
worrying about line numbers and propagating filenames) and also produces
output that makes sense:

- Console input don't need line numbers; the user doesn't care if this
  is the 500th command he typed, or the 701st.

- Internal strings don't get a nonsensical "filename", 'cos they don't
  *have* a filename in the first place. Just a single line number so you
  can locate the problem in, say, the string literal or something.

- File input has filename and line number.

- Other kinds of input contexts can be handled in the same way.

- The use of NumberedSrc (maybe better named LineNumberedRange or
  something) makes line numbers available to each of these contexts at
  the top-level. Though of course, the lexer itself can also handle this
  (but it adds complications if you have to continue detecting newlines
  in, say, string literals, when the lexer is in a different state).

The cleaner code does come at a price, though: this code probably is a
bit inefficient due to the number of layers in it. But, just thought I'd
share this idea.


T

-- 
You are only young once, but you can stay immature indefinitely. -- azephrahel


More information about the Digitalmars-d mailing list