So once again I found myself looking at the documentation for the Python re module, trying to remember the difference between matches and matching groups and I saw a code snippet that was so beautiful that I wanted to break it down and share it. It’s a fully-functional mini-language tokenizer in only 41 lines of code. I’ll go through it line-by-line and explain what it’s doing and why it is so great below:
from typing import NamedTuple
import re
class Token(NamedTuple):
    type: str
    value: str
    line: int
    column: int
This snippet only has two dependencies, both from the core Python runtime. The reason for the re dependency is clear, but the choice to use NamedTuple isn’t quite as obvious. As a refresher, tuples are an immutable, ordered sequence of items. They’re a lightweight data structure when compared to a mutable collection like a list or dict, but since their values are indexed using slice notation, tuples that contain more than two or three distinct values quickly become unreadable:
>>> token = ('NUMBER', 42, 3, 1) # an anonymous token tuple
>>> print(token[2]) # print the line number (or is it the column number?, the value?)
3
But after creating the new Token NamedTuple, we get a lot of the benefits of collection classes without the overhead:
>>> token = Token('NUMBER', 42, 3, 1)
>>> token
Token(type='NUMBER', value=42, line=3, column=1)
>>> token.line
3
>>> token.value
42
So by line 8, there is a great foundation for our tokenizer to return parsed data efficiently.
def tokenize(code):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
        ('ASSIGN',   r':='),           # Assignment operator
        ('END',      r';'),            # Statement terminator
        ('ID',       r'[A-Za-z]+'),    # Identifiers
        ('OP',       r'[+\-*/]'),      # Arithmetic operators
        ('NEWLINE',  r'\n'),           # Line endings
        ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH', r'.'),            # Any other character
    ]
The tokenize function doesn’t start with executable code at all! Instead, it starts with data declarations that describe the tokens that will be parsed. This is a common pattern in well-written code: complex information is expressed in data, simple logic is expressed as code. Reading this, it’s clear what keywords the tokenizer recognizes and what patterns will be matched for each of the token types. And since they are simple assignment statements, adding new keywords and tokens is straightforward and unlikely to introduce new errors.
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
This is the first real executable code in this snippet, and it doesn’t appear until line 22! It uses a list comprehension to construct a truly gnarly regular expression. A newer programmer would be tempted to hand-code this regex, but if you were to do so, you would end up with this:
tok_regex = r'(?P<NUMBER>d+(.d*)?)|(?P<ASSIGN>:=)|(?P<END>;)|(?P<ID>[A-Za-z]+)|(?P<OP>[+-*/])|(?P<NEWLINE> )|(?P<SKIP>[ ]+)|(?P<MISMATCH>.)'
This expression is barely legible and maintainable. But add a couple of more tokens to your language (function calls anyone?) and it will turn into a huge plate of parenthesized spaghetti. Constructing the string dynamically makes this approach scale to languages of surprising complexing without compromising code legibility.
An important thing to note here is the use of named groups (?P<name>…). Note that the name of each group corresponds to the token type from the token_specification array. This is DRY at its best. Nothing is wasted, and the code is working with the language instead of against it.
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group()
        column = mo.start() - line_start
Now we are in the meat of the tokenizer. The function is looping through all of the matches in the code parameter while tracking the kind of token (using the lastgroup property from the match object), the matched value, and the current line and column number (useful for error reporting).
        if kind == 'NUMBER':
            value = float(value) if '.' in value else int(value)
        elif kind == 'ID' and value in keywords:
            kind = value
        elif kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
            continue
        elif kind == 'SKIP':
            continue
        elif kind == 'MISMATCH':
            raise RuntimeError(f'{value!r} unexpected on line {line_num}')
        yield Token(kind, value, line_num, column)
Now that all of the relevant data has been collected, all we need to do is do the per-token-type processing based on the token kind. On line 2 of this snippet, numbers are automatically coerced into the correct type based on the presence (or absence) of a decimal. On line 4, something very tricky is going on. The ‘ID’ rule matches sequences of alphabetic characters, but if the matched string happens to appear in the keywords set the token kind is set to the keyword name itself. This is a little sneaky, but very efficient.
The rest of the code ensures that whitespace is skipped, and it even recognizes malformed input and raises an exception! Note the strategic use of the continue statement in the whitespace handling on lines 8 and 10 to avoid the yield at the bottom of the loop.
If you’re not familiar with yield and Python generator functions, you might want to search for a primer on them, they are a very powerful tool to have in your programming toolbox. You might also want to check out the relevant Python Enhancement Proposal (PEP 255) to get the information direct from the source.
The rest of the sample code just shows the tokenizer in action on a simple code fragment:
statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''
for token in tokenize(statements):
    print(token)
Which produces:
Token(type='IF', value='IF', line=2, column=4) Token(type='ID', value='quantity', line=2, column=7) Token(type='THEN', value='THEN', line=2, column=16) Token(type='ID', value='total', line=3, column=8) Token(type='ASSIGN', value=':=', line=3, column=14) Token(type='ID', value='total', line=3, column=17) Token(type='OP', value='+', line=3, column=23) Token(type='ID', value='price', line=3, column=25) Token(type='OP', value='*', line=3, column=31) Token(type='ID', value='quantity', line=3, column=33) Token(type='END', value=';', line=3, column=41) Token(type='ID', value='tax', line=4, column=8) Token(type='ASSIGN', value=':=', line=4, column=12) Token(type='ID', value='price', line=4, column=15) Token(type='OP', value='*', line=4, column=21) Token(type='NUMBER', value=0.05, line=4, column=23) Token(type='END', value=';', line=4, column=27) Token(type='ENDIF', value='ENDIF', line=5, column=4) Token(type='END', value=';', line=5, column=9)
And that’s it! From here it is a pretty simple exercise to write a recursive-descent parser to implement this toy language, or even extend it if necessary. All from 41 lines of code! I couldn’t track down the author of this code fragment, but it is a master class in writing elegant, efficient, and most importantly maintainable code. Enjoy!
Leave a Reply