How to hack on WAQL-PP

This document is a guide which introduces the internal workings of the preprocessor and how it can be modified. It is intended for developers starting to work on the preprocessor itself and divided into the following chapters.

The preprocessor source follows the usual directory structure of a Java project, so there is no explanation of the directories. However a short description of the different packages and their meaning might be in order.

  • org.antforge.waqlpp.core: This is where the central API classes are located. It defines the public interface to the preprocessor. Changes to these classes should be well thought through to keep a stable interface and not break compatibility. It is also the only package for which API documentation in the form of Javadoc pages is generated.
  • org.antforge.waqlpp.engine: The underlying implementations of the API classes and all other driving parts of the preprocessor are located here. Keep in mind that the user should not be coerced into importing classes from this package.
  • org.antforge.waqlpp.parser: The actual query parser which translates a textual query into an intermediate representation is located here. All the model classes of the intermediate representation like Node and Token are also part of this package.
  • org.antforge.waqlpp.printer: All predefined DataPrinter implementations are located in this package. The actual printer pipeline however is part of the preprocessor engine.
  • org.antforge.waqlpp.test: All test suits should be placed into this package. There is a common base class called TestBase available for easy construction of new test cases. This package will be excluded from the binary distribution.

The following chapters will explain some of the implementation details and what to keep in mind when hacking on them.

1. About the parser and the WAQL grammar

The most important fact about the parser is that it is generated using a parser generator. We use the JavaCC to accomplish this at compile time of the preprocessor. The build system will take care of (re)generating source files if necessary, so don’t worry about having to call JavaCC yourself.

The grammar of an WAQL query is described using an EBNF-like notation in the Parser.jjt file. Originally most of it was actually generated by converting the EBNF-rules in the XQuery specification into the JavaCC syntax using a shell-script. There are two different types of rules inside this file.

  1. Lexical rules: These rules describe the tokens (terminal symbols) of the grammar. The lexical analyzer (aka. TokenManager) just creates a stream of consecutive tokens and forms a single-linked list with them.
  2. Parser rules: These rules describe which nodes (non-terminal symbols) are used to form the AST. This tree is the actual intermediate representation of the query after parsing has completed.

Even though JavaCC folds those two rule types into one phase, it is best to think of them as being executed in two distinct phases. This helps to prevent misconceptions about when tokens are actually created. Also note that the XQuery grammar is mostly LL(1) with some parts being LL(2) or LL(3), which is a measure of how many tokens the parser has to look into the future to determine which parser decision to make. This behavior is called lookahead. Actually this leads to the most important lesson of this chapter: Most of the time tokens are generated before parer rules are even looked at.

Lets now take a look at the parser rule for a Data Dependency as an example of how these rules are written down.

public SimpleNode pDataDependency() : {
    Token t = null;
}{
    <T_DLR> ( t = <DIGITS> )? <T_CLB> pPathExpr() <T_CRB>
    {
        Integer id = (t != null) ? Integer.parseInt(t.image) : null;
        jjtThis.jjtSetValue(id);
        tableDataDependency.addFirst(jjtThis);
        return jjtThis;
    }
}

Each such rule will be transformed into a Java method with a signature as specified in the first line. You can see the actual EBNF-rule in the fourth line. The names in angle brackets refer to different tokens. If the rule matches the code block below is executed. In this example the integer identifier is parsed and the generated node is added into a table for fast node-visiting of Data Dependency nodes.

The WAQL specific language constructs which extend the original XQuery syntax can be used in two different contexts, as an expression or as the content of an element or attribute. Those two contexts are reflected by the following enumeration and also stored as part of the AST node.

/** Types describing in which context a "WAQLExtension" is used. */
public static enum UsageType { AS_EXPR, AS_TEXT }

To deal with the huge amount of token types the WAQL grammar has to cope with, we used several different lexical states that define which lexical rules are active at a given position in the query. Usually this lexical states are only changed from within the lexical analyzer. However there are some special cases in which the parser also changes the lexical state from within a parser rule. This is only possible because JavaCC folds those two phases into one. However extreme caution has to be taken when using this technique, because it easily introduces bugs in combination with lookahead. The method we use to change the state from within the parser is called chgState() and defined at the top of the parser definition file.

You might want to look at the following offsite resources to learn more about JavaCC or grammars in general.

2. Testing the parser with XQTS

The first prototype of the parser was quickly finished, unfortunately bugs kept popping up and we were in desperate need of a test suite for the parser. Then we stumbled upon XQTS which is a test suite intended for XQuery engines. There is a XQTSHarness available in the testing package which tries to parse the queries in this suite with our parser and collects the results.

At the time of writing of this guide we use XQTS-1.0.3 and execute a total of 6012 queries from the suite. This helped us to reduce the number of incorrectly parsed queries from initially 4247 down to 3 when excluding XPath 2.0 and XQuery 1.1 extensions. So the main lesson of this chapter is: Whenever you change something at the parser, execute the XQTS harness and look for regressions.

The harness needs to find the test suite and will run all the queries directly out of the ZIP-file, you don’t even need to unpack it. To download the test suite or learn more about it visit the homepage.

3. On modifying the token stream

So far we just looked at the parser, but the actual transformation of a WAQL query is done by the engine. Most of the time we say that the intermediate representation is modified by the engine. However this is only partially true, what’s actually being modified is the token stream generated by the lexical analyzer.

We enable token tracking for the parser, which basically links each node in the AST to a part of the stream of tokens. These links are available through the jjtGetFirstToken() and jjtGetLastToken() methods of the node. Note that we also track special tokens like whitespace characters, line breaks or comments because we want to be able to reconstruct the original WAQL query without loosing those tokens. The following is one example of how the token stream is being modified by the engine.

/**
 * Replaces the token stream for the given node. This actually changes the
 * textual representation for the given node. However the abstract syntax
 * tree remains unchanged.
 * @param node The node for which to replace tokens.
 * @param replacement The textual representation of the replacement.
 */
private static void replaceNode(SimpleNode node, String replacement) {
    Token token = node.jjtGetFirstToken();
    token.kind = -1;
    token.image = replacement;
    token.next = node.jjtGetLastToken().next;
    for (SimpleNode parent = (SimpleNode) node.jjtGetParent(); parent != null; parent = (SimpleNode) parent.jjtGetParent())
        if (parent.jjtGetLastToken() == node.jjtGetLastToken())
            parent.jjtSetLastToken(token);
        else
            break;
    node.jjtSetLastToken(token);
}

Here you can see that the token stream is being modified by changing the token attributes but the AST actually remains unchanged, only the links into the token stream are being updated accordingly. The main lesson of this chapter is: The intermediate representation consists of an AST and a token stream, only the latter one is being modified.

Having read this guide you have caught a glimpse into some prominent implementation details of the preprocessor. Hopefully those can help you in getting to know the implementation by yourself. Have fun hacking!