Language Tutorial part 1: Scanning input source text

This post is part of a language tutorial series.

In this part of the tutorial we will define the basic components of our language and introduce first phases of our language processer: the scanner. The scanner breaks up the textual representation of the input program text into a sequence of Tokens; as an example, 1.0 * 20 + -3 will in our language result in the tokens:

Number Times Number Plus Minus Number

We break up our input-text into tokens to simplify the parsing stage, which is the next part of our language processor. If we didn’t do this, we would have to do the scanning and identify tokens while we did the parsing. We know, however, splitting our complex task into smaller comprehendible tasks helps us develop more elegant solutions.

Tokens are not just names, but also values. When relevant, tokens will be parameterized by values, e.g. 1.0 for the first token above, and nothing for the Times-token, since currently, the type (Times) of such a token is information enough; additional information might be source-code position of a token, but we currently do not track such information. We will use an empty (for now) record Token as the basic abstraction for tokens; we use C# 9 records for tokens throughout because of their succinctness, but classes will work just as well. Our tokenset then looks like this; for every type of word in our languge we have a token to represent this. One could imagine various ways of designing a suitable datatype for tokens, involving enums and probably other fancy ways. I chose records because it makes my token design simple and elegant.

record Token;
interface IgnoredToken { }
record Plus : Token;
record Minus : Token;
record Division : Token;
record Times : Token;
record LParan : Token;
record RParan : Token;
record WhiteSpace : Token, IgnoredToken;
record NewLine : Token, IgnoredToken;
record End : Token;
record Number(string Value) : Token;

We have added an interface IgnoredToken used to mark tokens that are named, but not part of our stream. This allow us to handle many different tokens as one type.

Character stream

For convenience we create our own definition of a character stream used to wrap .NET streams, strings, files or the like. The interface ICharStream looks like this (a string-based implementation is found here:https://bit.ly/3vWd8cB):

public interface ICharStream
{
    char PeekCh(int n = 0);
    char Read();
    int Peek(int n = 0);
    bool Skip(params char[] expected);
    bool Skip(int n = 1);
    public bool AtEndOfStream => Peek() == -1;
}

Implementing the scanner

The transformation of text to constituent tokens is a mechanical process of inspecting each character input and depending on the current state of the scanner, process the character and switch state, basically a state machine – formally known as a finite state machine; such machines are useful for recognizing regular languages, however, we do not discuss further the formal definition in this post.

The most basic operation of the scanner is to produce tokens if they are not ignored tokens (of type IgnoredToken) until we reach the end of our stream, the End token. If you are unfamiliar with yield, in short, what we are doing here is generating an IEnumerable with the contents of whatever is yielded. Find a more thorough explanation here.

private IEnumerable<Token> GetTokens(ICharStream cs)
{
    Token next;
    do
    {
        next = NextToken(cs);
        if (!(next is IgnoredToken))
            yield return next;
    } while (!(next is End));
}

The NextToken method does the actual processing of our character-stream:

private Token NextToken(ICharStream s) 
{
    if (s.AtEndOfStream)
        return new End();
    if (ch == '.' || char.IsDigit(s.PeekCh()))
        return NextNumber(s);
    else
        return NextSimpleToken(s);
}

If we reached the end of the character-stream, we return the token End. If not, the choice is either a Number-token or some of the fixed-size simple tokens, e.g. newline, parentheses or the like. Note, that we allow numbers to start with a ‘.’ to allow writing .1 instead of 0.1.

Reading numbers are then done using the NextNumber-method:

private Number NextNumber(ICharStream s)
{
    StringBuilder sb = new();
    bool dotted = false;
    while (char.IsDigit(s.PeekCh()) || (!dotted && (dotted = s.PeekCh() == '.')))
        sb.Append(s.Read());
    return new Number(sb.ToString());
}

We do however leave the translation to actual numeric values up to a later stage process. This will allow us to decide and switch such implementations later. Simple tokens are whitespace, operators etc:

Token NextSimpleToken(ICharStream s) => s.Read() switch
{
    '(' => new LParan(),
    ')' => new RParan(),
    '+' => new Plus(),
    '-' => new Minus(),
    '*' => new Times(),
    '/' => new Division(),
    ' ' or '\t' => new WhiteSpace(),
    '\r' when s.Skip('\n') => new NewLine(),
    '\n' when s.Skip('\r') => new NewLine(),
    '\n' or '\r'  => new NewLine()
};

We make sure to handle both tabs and spaces, and various newlines. Similar approach could be used when we add variables, functions and other keywords to the language. You may however already be thinking that this process and the implementation will turn out to be rather tedious. There are tools for generating tokenizers and actually complete parsers, such as Antlr, SableCC or one step further with XText and IDE generation. Such tools will significantly ease scanner and parser implementations.

In the next post we discuss and implement a parser for our expression language.

Leave a Comment