  # Hack book of `Hoa\Compiler`

Compilers allow to analyze and manipulate textual data. There is numerous usages. `Hoa\Compiler` offers to manipulate several compilers based on needs.

## Introduction

What is conceived well is expressed clearly, and the right words come easily.

A language is a way to express or to formulate a solution to a problem. And there exists a lot of problems. We read and write in several languages daily, and some of these languages are understood by machines. This operation is possible thanks to compilers.

Formal languages is a theory that includes the automatic analysis of those languages through tools like automata or grammars. It is necessary to have a detailed explanation to understand well all these concepts. However, we are going to popularise as much as needed to allow a correct usage of `Hoa\Compiler`.

### Language and grammar

A language is a set of words. Each word is a sequence of symbols belonging to an alphabet. A symbol represents the smallest lexical unit of a language, it is atomic and we will call it a lexeme (also often mentionned as a token). These sequences of lexemes representing words are built with rules. From a word and a root rule, we will try to derive its sub-rules. If a derivation exists, then the word is considered as valid, else it is considered as invalid. We also speak about words matching. For example, if we consider the following rules:

``````    exp ::= exp + exp
| number
number ::= digit number
| digit
digit   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9``````

The word that we are going to match is `7 + 35`. The root rule is `exp`. If we derive (from the left to the right and from the top to the bottom), we can have `exp + exp` or `number` (the disjunction, i.e. the “or”, is represented by the “`|`” symbol):

``````exp + exp | number
→ exp + exp
→ ( exp + exp | number ) + exp
→ number + exp
→ ( digit number | digit ) + exp``````

We continue to derive until we eliminate every rules in order to have only lexemes:

``````…
→ ( digit number | digit ) + exp
→ digit + exp
→ ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) + exp
→ 7 + exp
→ 7 + ( exp + exp | number )
→ 7 + number
→ 7 + ( digit number | digit )
→ 7 + digit number
→ 7 + ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) number
→ 7 + 3 number
→ 7 + 3 ( digit number | digit )
→ 7 + 3 digit
→ 7 + 3 ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )
→ 7 + 3 5``````

A derivation exists to match the word `7 + 35`, thus it is a valid word for those rules.

A set of rules is called a grammar. And so, a grammar represents a language!

However, these are different kinds of grammars. In 1956, the Chomsky hierarchy has been formulated, classifying grammars in four levels:

1. unrestricted grammars, matching langages known as Turing languages, rules have no restriction,
2. context-sensitive grammars, matching contextual languages,
3. context-free grammars, matching algebraic languages, based on stacked automata,
4. regular grammars, matching regular languages.

Each level includes the next level. `Hoa\Compiler` only treats languages defined by grammars of level 3 and 4. To give a quick idea, regular grammars can be connected to regular expressions (like PCRE), well-known by developers. But regular grammars are not able to match a pair of symbols (like parentheses, braces or quotes), while algebraic languages allow that (thanks to the concept of stack of lexemes).

### Matching words

In general, the compilation process begins with two analyzes: lexical and syntactic. A lexical analysis splits a word in a sequence of lexemes. This sequence will thereafter be used by the syntactic analyzer in order to verify that the word belongs to the language.

According to the grammar, the matching will not be done in the same way, but the principle stays the same: using lexemes one after the other in the sequence and verify that they allow going forward in the derivation of grammar's rules.

Syntactic analyzes are also classified into categories: LL, LR, LALR etc. `Hoa\Compiler` only offers LL syntactic analyzers, stands for Left-to-right Leftmost derivation, i.e. from the topest rule to the deepest, and rules are derived from the left to the right. Here, again, there is sub-categories, whose two supported by `Hoa\Compiler`: LL(1) and LL(*). Globally, we speak about LL(k) syntactic analyzer: if a lexeme does not allow to derive a rule as expected, then the analyzer is allowed to go back to k step backward; we also speak about backtrack. Put another way, rules can be ambiguous: each time we derive a rule of the grammar, we have several possible choices and the analyzer can be wrong, this is why it needs to backtrack sometimes. The k variable defines the ambiguity level. If a grammar can be analyzed by a LL(1) syntactic analyzer, it is known as unambiguous: each time a lexeme is used to derive our rules, there is only one choice. And if we have a LL(*) syntactic analyzer, it means that the k variable is undefined. The following example illustrates an unambiguous grammar:

``rule ::= a b c | d e f``

And this example illustrates an ambiguous grammar:

``````rule1 ::= a rule2
rule2 ::= b rule3 | b rule4
rule3 ::= c d
rule4 ::= e f``````

Let's see what happens when trying to find a derivation for the word `abef` from the root rule `rule1`:

``````rule1
→ a rule2                      a  bef ✔
→ a (b rule3 | b rule4)      a  bef
→ a b rule3               ab  ef  ✔
→ a b c d              abe  f   ✖
← a b rule3               ab  ef  ✖
← a (b rule3 | b rule4)      a  bef
→ a b rule4               ab  ef  ✔
→ a b e f             abef      ✔``````

The `rule2` rule is ambiguous, which can lead to a bad derivation and consequently a backtrack.

## LL(k) compiler-compiler

Writing compilers is a laborious task. It is not always difficult but most of the time, it is repetitive and it takes time. This is why there exists compilers of compilers, or in other words, compilers generators. Most of the time, these compiler-compilers use an intermediary language to write a grammar. The `Hoa\Compiler` library provides the `Hoa\Compiler\Llk\Llk` class that allows the writing of compiler-compilers through a dedicated language.

### PP language

The PP language, standing for PHP Parser, allows to express algebraic grammars. It is written in files having the `.pp` extension (please, see the `hoa://Library/Compiler/.Mime` file).

A grammar is composed of lexemes and rules. The declaration of a lexeme has the following syntax: ```%token namespace_in:name value -> namespace_out```, where `name` represents the name of the lexeme, `value` represents its value, expressed with the PCRE format (take care to not match an empty value, in which case an exception will be thrown), and `namespace_in` and `namespace_out` represent the names of namespaces and are optional (the default name is `default`). For example `number` describing a number composed of digits from `0` to `9`:

``%token number \d+``

Namespaces represent disjointed subsets of lexemes, used to ease analyzes. A `%skip` declaration is similar to `%token` excepts that it represents a lexeme to skip, it means to not consider. A common example of `%skip` lexemes are spaces:

``%skip space \s``

To explain rules, we will use the `Json.pp` grammar as an example, which is a softly simplified version of the JSON language (please, see the RFC4627). The complete grammar is located in the `hoa://Library/Json/Grammar.pp` file. Thus:

``````%skip   space          \s
// Scalars.
%token  true           true
%token  false          false
%token  null           null
// Strings.
%token  quote_         "        -> string
%token  string:string  [^"]+
%token  string:_quote  "        -> default
// Objects.
%token  brace_         {
%token _brace          }
// Arrays.
%token  bracket_       \[
%token _bracket        \]
// Rest.
%token  colon          :
%token  comma          ,
%token  number         \d+

value:
<true> | <false> | <null> | string() | object() | array() | number()

string:
::quote_:: <string> ::_quote::

number:
<number>

#object:
::brace_:: pair() ( ::comma:: pair() )* ::_brace::

#pair:
string() ::colon:: value()

#array:
::bracket_:: value() ( ::comma:: value() )* ::_bracket::``````

Notice that we have two namespaces of lexemes: `default` and `string` (it allows to avoid confusion with the `quote_` and `string::_quote` lexemes that have the same representation). We also notice the `value` rule, which is a disjunction of several lexemes and rules. The constructions of the PP language are the following:

• `rule()` to call a rule,
• `<token>` and `::token::` to declare a lexeme (namespaces do not appear here),
• `|` for a disjunction (a choice),
• `(…)` for a group,
• `e?` to say that `e` is optional,
• `e+` to say that `e` can be present 1 or more times,
• `e*` to say that `e` can be present 0 or more times,
• `e{x,y}` to say that `e` can be present between `x` and `y` times,
• `#node` to create a node `node` in the resulting tree,
• `token[i]` to unify lexemes between each others.

Few constructions but largely enough.

Finally, the grammar of the PP language is written… with the PP language! You can find it in the `hoa://Library/Compiler/Llk/Llk.pp` file.

### Compilation process

The compilation process used by `Hoa\Compiler\Llk\Llk` is classical: it starts by lexically analyzing the textual data, the word, i.e. to transform our data in a sequence of lexemes. The declaration order of the lexemes is primordial because the lexical analyzer will use them one after the other. Next, the syntactic analyzer comes into play in order to match our data.

If the syntactic analysis is successful, we obtain a trace. This trace can be transformed into an AST, stands for Abstract Syntax Tree. This tree represents our textual data after the analysis. One advantage is that it can visited (we will detail this part later), which allows us to add new constraints which can be not be expressed in the grammar, such as type verification.

Let's manipulate `Hoa\Compiler\Llk\Llk` a little bit. This class is a helper to easily read a grammar expressed with the PP format. This class takes only one argument, which is an input stream that points to the grammar and returns a `Hoa\Compiler\Llk\Parser` compiler ready to use. On this compiler, we will call the `Hoa\Compiler\Llk\Parser::parse` to analyze a JSON data. If the data is correct, we will get an AST, otherwise an exception will be thrown. Finally, we will use the `Hoa\Compiler\Visitor\Dump` visitor to print our AST:

``````// 1. Load grammar.

// 2. Parse a data.
\$ast      = \$compiler->parse('{"foo": true, "bar": [null, 42]}');

// 3. Dump the AST.
\$dump     = new Hoa\Compiler\Visitor\Dump();
echo \$dump->visit(\$ast);

/**
* Will output:
*     >  #object
*     >  >  #pair
*     >  >  >  token(string:string, foo)
*     >  >  >  token(true, true)
*     >  >  #pair
*     >  >  >  token(string:string, bar)
*     >  >  >  #array
*     >  >  >  >  token(null, null)
*     >  >  >  >  token(number, 42)
*/``````

When we write and test a grammar, we will repeat these three tasks very often. This is why the `hoa` script provides the `compiler:pp` command. This command proposes to analyze a data based on a grammar and to apply a visitor if needed on the resulting AST. Notice that the grammar can be read through a pipe:

``````\$ echo '[1, [1, [2, 3], 5], 8]' | hoa compiler:pp Json.pp 0 --visitor dump
>  #array
>  >  token(number, 1)
>  >  #array
>  >  >  token(number, 1)
>  >  >  #array
>  >  >  >  token(number, 2)
>  >  >  >  token(number, 3)
>  >  >  token(number, 5)
>  >  token(number, 8)``````

This is a useful way to quickly test a data against a grammar. Do not hesitate to read the help of the `compiler:pp` command to get more details!

The analyzes start with the root rule but we can use another rule thanks to the second argument of the `Hoa\Compiler\Llk\Parser::parse` method:

``\$compiler->parse('{"foo": true, "bar": [null, 42]}', 'object');``

To use the root rule, we have to use `null`.

And finally, to not generate the AST but only know if a data is valid or not, we can use the last argument of this method by using `false`:

``````\$valid = \$compiler->parse('{"foo": true, "bar": [null, 42]}', null, false);
var_dump(\$valid);

/**
* Will output:
*     bool(true)
*/``````

#### Errors

The compilation errors are represented by exceptions, thus:

``````\$ echo '{"foo" true}' | hoa compiler:pp Json.pp 0 --visitor dump
Uncaught exception (Hoa\Compiler\Exception\UnexpectedToken):
Hoa\Compiler\Llk\Parser::parse(): (0) Unexpected token "true" (true) at line 1
and column 8:
{"foo" true}
↑
in hoa://Library/Compiler/Llk/Parser.php at line 1``````

Several exceptions can be thrown according to the context:

• during the lexical analysis, `Hoa\Compiler\Exception\UnrecognizedToken` when a lexeme is not recognized, i.e. when the textual data can not be split into a sequence of lexemes, and `Hoa\Compiler\Exception\Lexer` when other more general errors happen, for example if a lexeme matches an empty value,
• during the syntactic analysis, `Hoa\Compiler\Exception\UnexpectedToken` when a lexeme is not expected, i.e. it is not able to derive the rules of the grammar.

The parent exception is `Hoa\Compiler\Exception\Exception`.

#### Nodes

The compilation process very often succeeds by the production of an AST. It is important to control its form, its size, the data it holds etc. This is why it is necessary to understand the `#node` notation because it allows to create nodes in the AST. First, lexemes declared with the `<token>` syntax will appear in the tree, while the other lexemes, declared with the `::token::` syntax, will not appear. Indeed, in our last example, the `quote_`, `brace_`, `colon`, `comma` etc. lexemes do not appear. Next, it is important to notice that the declaration of nodes can be overwritten inside a same rule. Finally, a rule name can be preceeded by `#`, just like the `#array` rule, that allows to define a default node, but it can be overwritten. For example, if we replace the `array` rule by:

``````#array:
::bracket_:: value() ( ::comma:: value() #bigarray )* ::_bracket::``````

If the array contains only one value, the node will be named `#array`, else it will be named `#bigarray`. Let's see:

``````\$ echo '' | hoa compiler:pp Json.pp 0 --visitor dump
>  #array
>  >  token(number, 42)
\$ echo '[4, 2]' | hoa compiler:pp Json.pp 0 --visitor dump
>  #bigarray
>  >  token(number, 4)
>  >  token(number, 2)``````

Of course, a node can be created or not based on the derivation of a rule. The mecanism is normally pretty much intuitive.

#### Namespaces

Let's detail a little bit the behavior of the lexical analyzer regarding the namespaces.

Lexemes are put in namespaces, it means that only the lexemes of the current namespace are used by the lexical analyzer. The default namespace is `default`. To declare the namespace of a lexeme, we have to use the `:` operator. When a lexeme is consumed, it is able to change the current namespace for the rest of the lexical analysis, thanks to the `->` operator. Thus, we will declare five lexemes: `foo` and `bar` in the `default` namespace, `baz` in `ns1` and `qux` and `foo` in `ns2`. The fact that `foo` is declared two times is not embarrassing because it is put in different namespaces:

``````%token      foo   fo+     -> ns1
%token      bar   ba?r+   -> ns2
%token  ns1:baz   ba?z+   -> default
%token  ns2:qux   qux+
%token  ns2:foo   FOO     -> ns1``````

Writing `default:bar` is strictly equivalent to `bar`. The `foo` lexeme leads into `ns1`, `bar` into `ns2`, `ns1:baz` into `default`, `ns2:qux` stays in `ns2` and `ns2:foo` leads into `ns1`. Let's observe the sequence of lexemes produced by the lexical analyzer with the following data `fooooobzzbarrrquxFOObaz`:

``````\$pp = '%token      foo   fo+     -> ns1'     . "\n" .
'%token      bar   ba?r+   -> ns2'     . "\n" .
'%token  ns1:baz   ba?z+   -> default' . "\n" .
'%token  ns2:qux   qux+'               . "\n" .
'%token  ns2:foo   FOO     -> ns1';

// 1. Parse PP.
Hoa\Compiler\Llk\Llk::parsePP(\$pp, \$tokens, \$rawRules);

// 2. Run the lexical analyzer.
\$lexer    = new Hoa\Compiler\Llk\Lexer();
\$sequence = \$lexer->lexMe('fooooobzzbarrrquxFOObaz', \$tokens);

// 3. Pretty-print the result.
\$format   = '%' . (strlen((string) count(\$sequence)) + 1) . 's  ' .
'%-13s %-20s  %-20s  %6s' . "\n";
\$header   = sprintf(\$format, '#', 'namespace', 'token name', 'token value', 'offset');

foreach (\$sequence as \$i => \$token) {
printf(
\$format,
\$i,
\$token['namespace'],
\$token['token'],
\$token['value'],
\$token['offset']
);
}

/**
* Will output:
*      #  namespace     token name            token value           offset
*     ---------------------------------------------------------------------
*      0  default       foo                   fooooo                     0
*      1  ns1           baz                   bzz                        6
*      2  default       bar                   barrr                      9
*      3  ns2           qux                   qux                       14
*      4  ns2           foo                   FOO                       17
*      5  ns1           baz                   baz                       20
*      6  default       EOF                   EOF                       23
*/``````

We read the lexemes, their namespace and their default value in the table. The data has been successfully splitted, and we have jumped from one namespace to the other. If this time we try with the `foqux` data, we will have an error: `fo` is matched by the `foo` lexeme in the `default` namespace, we then jump to the `ns1` namespace, and here, no lexeme in this namespace is able to recognize at least `q`. An error is thrown.

So far, we have seen how to jump from one namespace to another with the `->` operator. No history about the used namespaces is stored. However, in some rare cases, it is able for a lexeme to be accessible through several namespaces and we would like that a lexeme jumps back to the previous namespace. Put another way, we would like a history of the used namespaces and being able to navigate (backward only) in it. This is the role of the `__shift__ * n` keyword, in conjunction with the `->` operator as usual. `__shift__` is equivalent to say: jump back to the previous namespace. `__shift__` is equivalent to `__shift__ * 1`, and `__shift__ * n` to say: jump back `n` times to the previous namespace.

When the `__shift__` keyword appears in the grammar, namespaces are managed like a stack, hence the shift vocabulary. We have to take care to shift namespaces regularly in order to avoid a huge stack.

Let's take the example of the following grammar:

``````%token       foo1  a    -> ns1
%token       foo2  x    -> ns2
%token       end   e

%token   ns1:bar   b
%token   ns1:baz   c    -> ns3

%token   ns2:bar   y
%token   ns2:baz   z    -> ns3

%token   ns3:qux   =   -> __shift__

#test:
( <foo1> | <foo2> ) <bar> <baz> <qux> <tada> <end>``````

This grammar recognizes two data: `abc=te` and `xyz=Te`. If the first lexeme `foo1` is matched, the syntactic analyzer will jump to the `ns1` namespace. When it will match the `ns1:baz` namespace, it will jump into `ns3`. Then, when it will match `ns3:qux`, it will jump back to the previous namespace, being `ns1`. And so on. Now, if it does not match `foo1` but `foo2`, it will jump to the `ns2` namespace, and then in `ns3` and then in `ns2` again. The `__shift__` keywords in `ns1:tada` and `ns2:tada` are only used to shift namespaces but no ambiguity exists: these namespaces are accessible by only one namespace, namely `default`.

Now, let's save this grammar in the `NamespaceStack.pp` file and use the `-s/--token-sequence` option of the ```hoa compiler:pp``` command:

``````\$ echo -n 'abc=te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence
#  namespace     token name            token value                     offset
-------------------------------------------------------------------------------
0  default       foo1                  a                                    0
1  ns1           bar                   b                                    1
2  ns1           baz                   c                                    2
3  ns3           qux                   =                                    3
5  default       end                   e                                    5
6  default       EOF                   EOF                                  6

\$ echo -n 'xyz=Te' | hoa compiler:pp NamespaceStack.pp 0 --token-sequence
#  namespace     token name            token value                     offset
-------------------------------------------------------------------------------
0  default       foo2                  x                                    0
1  ns2           bar                   y                                    1
2  ns2           baz                   z                                    2
3  ns3           qux                   =                                    3
5  default       end                   e                                    5
6  default       EOF                   EOF                                  6``````

We see that the lexical analyzer has successfully jumped between all namespaces, as expected. We had two ways to access to the `ns3` namespace: either from `ns1` or from `ns2`. The analyzer has succeeded to create a history of namespaces and navigate in it.

#### Unification

A new feature brought by the PP language regarding other grammar languages is the capability to express a unification of lexemes. Let's imagine the following grammar:

``````%token  quote   '|"
%token  string  \w+

rule:
::quote:: <string> ::quote::``````

The quotes surrounding the string can be of two kinds: simple, with the “`'`” symbol, or double, with the “`"`” symbol. Thus, the `"foo"` and `'foo'` data are valid, but also `"foo'` and `'foo"`!

The unification of lexemes allows to add an additionnal constraint on the value of lexemes at runtime. The syntax is the following: `token[i]`. The value of `i` indicates what lexemes will have the same value. And finally, the unification is locale to an instance of a rule, it means that there is no unification between lexemes of different rules and the unification is only applied on the current rule. Thus, the example becomes:

``````rule:
::quote:: <string> ::quote::``````

Which invalidates the `"foo'` and `'foo"` data.

Unification finds numerous applications, such as the name of opening and closing tags in the XML language.

### Abstract Syntax Tree

An abstract syntax tree represents a textual data in a structural form. Each node of this tree is represented by the `Hoa\Compiler\Llk\TreeNode` class. Among the most useful methods, we find:

• `getId` to get the identifier of the node,
• `getValueToken` to get the name of the lexeme,
• `getValueValue` to get the value of the lexeme,
• `isToken` if the token represents a lexeme,
• `getChild(\$i)` to get the `\$i`-th child of the node,
• `getChildren` to get all the children,
• `getChildrenNumber` to count the number of children,
• `getParent` to get the parent node,
• `getData` to get a reference to a data array,
• `accept` to apply the visitor.

Visitors are the most simple way to cross an AST. As an example, let's write the `PrettyPrinter` visitor which will rewrite a JSON data with our own conventions (spacing, line breaking etc.). A visitor must implement the `Hoa\Visitor\Visit` interface and will have only one method to implement: `visit` with three arguments: the element and two optional arguments (by reference and by copy). Let's see:

``````class PrettyPrinter implements Hoa\Visitor\Visit
{
public function visit (
Hoa\Visitor\Element \$element,
&\$handle = null,
\$eldnah = null
) {
static \$i      = 0;
static \$indent = '    ';

// One behaviour per node in the AST.
switch (\$element->getId()) {
// Object: { … }.
case '#object':
echo '{', "\n";
++\$i;

foreach (\$element->getChildren() as \$e => \$child) {
if (0 < \$e) {
echo ',', "\n";
}

echo str_repeat(\$indent, \$i);
\$child->accept(\$this, \$handle, \$eldnah);
}

echo "\n", str_repeat(\$indent, --\$i), '}';

break;

// Array: [ … ].
case '#array':
echo '[', "\n";
++\$i;

foreach (\$element->getChildren() as \$e => \$child) {
if (0 < \$e) {
echo ',', "\n";
}

echo str_repeat(\$indent, \$i);
\$child->accept(\$this, \$handle, \$eldnah);
}

echo "\n", str_repeat(\$indent, --\$i), ']';

break;

// Pair: "…": ….
case '#pair':
echo
\$element->getChild(0)->accept(\$this, \$handle, \$eldnah),
': ',
\$element->getChild(1)->accept(\$this, \$handle, \$eldnah);

break;

// Many tokens.
case 'token':
switch (\$element->getValueToken()) {
// String: "…".
case 'string':
echo '"', \$element->getValueValue(), '"';

break;

// Booleans.
case 'true':
case 'false':

// Null.
case 'null':

// Number.
case 'number':
echo \$element->getValueValue();

break;
}

break;
}
}
}``````

We are going to see an example:

``````\$compiler    = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
\$ast         = \$compiler->parse('{"foo": true, "bar": [null, 42]}');
\$prettyPrint = new PrettyPrinter();
echo \$prettyPrint->visit(\$ast);

/**
* Will output:
*     {
*         "foo": true,
*         "bar": [
*             null,
*             42
*         ]
*     }
*/``````

The `getData` method is very useful to store data that are likely to be reused, for example from one visitor to another. This method returns a reference to an array; thus:

``````\$data = \$element->getData();

if (!isset(\$data['previousComputing'])) {
throw new Exception('Need a previous computing.', 0);
}

\$previous = \$data['previousComputing'];``````

It is common to use one visitor per constraint: verification of symbols, verification of types etc. Some of them can set data for the next ones. The `getData` method does not require any data structure, it only provides an access to an array. Feel free to do what you want with it.

Using the `Hoa\Compiler\Llk\TreeNode` is very trivial and we will use it most of the time with a visitor.

### Traces

The LL(k) compiler provided by Hoa is clearly composed of layers. Each layer can be used with a maximal modularity and extensibility. When the grammar is translated into “machine rules” and that the lexical and syntactic analyzers have validated a data, a trace is computed. This trace is a flat array containing only three kind of classes:

• `Hoa\Compiler\Llk\Rule\Entry` when the syntactic analyzer has opened a rule,
• `Hoa\Compiler\Llk\Rule\Ekzit` when the syntactic analyzer has closed a rule,
• `Hoa\Compiler\Llk\Rule\Token` when the syntactic analyzer has encountered a lexeme.

We can get the trace with the `Hoa\Compiler\Llk\Parser::getTrace` method. To understand this trace, we will start with an example:

``````\$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));
\$ast      = \$compiler->parse('{"foo": true, "bar": [null, 42]}');
\$i        = 0;

foreach (\$compiler->getTrace() as \$element) {
if (\$element instanceof Hoa\Compiler\Llk\Rule\Entry) {
echo str_repeat('>   ', ++\$i), 'enter ', \$element->getRule(), "\n";
} elseif (\$element instanceof Hoa\Compiler\Llk\Rule\Token) {
echo
str_repeat('    ', \$i + 1), 'token ', \$element->getTokenName(),
', consumed ', \$element->getValue(), "\n";
} else {
echo str_repeat('<   ', \$i--), 'ekzit ', \$element->getRule(), "\n";
}
}

/**
* Will output:
*     >   enter value
*     >   >   enter object
*                 token brace_, consumed {
*     >   >   >   enter pair
*     >   >   >   >   enter string
*                         token quote_, consumed "
*                         token string, consumed foo
*                         token _quote, consumed "
*     <   <   <   <   ekzit string
*                     token colon, consumed :
*     >   >   >   >   enter value
*                         token true, consumed true
*     <   <   <   <   ekzit value
*     <   <   <   ekzit pair
*     >   >   >   enter 13
*     >   >   >   >   enter 12
*                         token comma, consumed ,
*     >   >   >   >   >   enter pair
*     >   >   >   >   >   >   enter string
*                                 token quote_, consumed "
*                                 token string, consumed bar
*                                 token _quote, consumed "
*     <   <   <   <   <   <   ekzit string
*                             token colon, consumed :
*     >   >   >   >   >   >   enter value
*     >   >   >   >   >   >   >   enter array
*                                     token bracket_, consumed [
*     >   >   >   >   >   >   >   >   enter value
*                                         token null, consumed null
*     <   <   <   <   <   <   <   <   ekzit value
*     >   >   >   >   >   >   >   >   enter 21
*     >   >   >   >   >   >   >   >   >   enter 20
*                                             token comma, consumed ,
*     >   >   >   >   >   >   >   >   >   >   enter value
*                                                 token number, consumed 42
*     <   <   <   <   <   <   <   <   <   <   ekzit value
*     <   <   <   <   <   <   <   <   <   ekzit 20
*     <   <   <   <   <   <   <   <   ekzit 21
*                                     token _bracket, consumed ]
*     <   <   <   <   <   <   <   ekzit array
*     <   <   <   <   <   <   ekzit value
*     <   <   <   <   <   ekzit pair
*     <   <   <   <   ekzit 12
*     <   <   <   ekzit 13
*                 token _brace, consumed }
*     <   <   ekzit object
*     <   ekzit value
*/``````

This example reveals several things. First, the informations given by the trace can be useful: if we are on a rule, we have its name (with the `getRule` method), and if we are on a lexeme, we have its name (with the `getTokenName` method), its representation (expressed in the PCRE format, with the `getRepresentation` method), its value (with the `getValue` method), if it is a node to keep in the AST (with the `isKept` method) and its unification index if it exists (with the `getUnificationIndex` method). Of course, all these informations can be edited, which can influence the construction of the AST which is the (optional) next step.

Next, we notice that sometimes we enter in a rule that exists in the grammar, like `object`, `pair`, `value` etc., and sometimes we enter in a numerical rule, like `13`, `12`, `21`, `20` etc. When the grammar is interpreted in order to be translated into “machine rules”, each one of these rules is linearized regarding the PP operators:

• `Hoa\Compiler\Llk\Rule\Choice` for a disjunction
• `Hoa\Compiler\Llk\Rule\Concatenation` for a concatenation,
• `Hoa\Compiler\Llk\Rule\Repetition` for a repetition,
• `Hoa\Compiler\Llk\Rule\Token` for a token (like seen above).

All these rules in this format are accessible through the `Hoa\Compiler\Llk\Parser::getRules` methods, represented by an array. We will print all the accessible rules from the root for a better understanding:

``````\$compiler = Hoa\Compiler\Llk\Llk::load(new Hoa\File\Read('Json.pp'));

// 1. Get all rules.
\$rules    = \$compiler->getRules();

\$stack    = [\$compiler->getRootRule() => true];

while (false !== current(\$stack)) {
\$rule = key(\$stack);
next(\$stack);
echo
"\n", '"', \$rule, '" is a ',
strtolower(substr(get_class(\$rules[\$rule]), 22));

\$subrules = \$rules[\$rule]->getContent();

// 3a. Token.
if (null === \$subrules) {
continue;
}

echo ' of rules: ';

// 3b. Other rules.
foreach ((array) \$rules[\$rule]->getContent() as \$subrule) {
if (!array_key_exists(\$subrule, \$stack)) {
// 4. Unshift new rules to print.
\$stack[\$subrule] = true;
}

echo \$subrule, ' ';
}
}

/**
* Will output:
*     "value" is a choice of rules: 1 2 3 string object array number
*     "1" is a token
*     "2" is a token
*     "3" is a token
*     "string" is a concatenation of rules: 5 6 7
*     "object" is a concatenation of rules: 10 pair 13 14
*     "array" is a concatenation of rules: 18 value 21 22
*     "number" is a token
*     "5" is a token
*     "6" is a token
*     "7" is a token
*     "10" is a token
*     "pair" is a concatenation of rules: string 16 value
*     "13" is a repetition of rules: 12
*     "14" is a token
*     "18" is a token
*     "21" is a repetition of rules: 20
*     "22" is a token
*     "16" is a token
*     "12" is a concatenation of rules: 11 pair
*     "20" is a concatenation of rules: 19 value
*     "11" is a token
*     "19" is a token
*/``````

If we read the `object` rule, we know that it is a concatenation of the `10`, `pair`, `13` and `14` rules. `10` is a lexeme, `pair` is a concatenation of the `string`, `16` and `value` rules, and so on. The initial grammar is translated in order to be in its tiniest form. It allows to reason on rules in an easier and faster way. Indeed, computations on the grammar are not restricted to AST. In the previous step, with the trace, we are already able to apply computations.

### Generation

A grammar can be used to satisfy two goals: to validate a data (if we see it as an automata) or to generate a data. So far, we have seen how to validate a data through several steps: initialization of our compiler, running of the lexical and syntactic analyzers, producing a trace, transformation of this trace into an AST in order to visit that tree. But we can stop at the first step, the initialization of our compiler, to exploit the rules in order to generate a data which will be valid regarding our grammar.

`Hoa\Compiler\Llk\Sampler` provides three data generation algorithms:

• random uniform generation,
• bounded exhaustive generation,
• grammar coverage-based generation.

Why providing three algorithms? Because it is illusory to think that one algorithm can satisfy numerous context of usages. Each one fits different needs, we will explain them later.

Generating a data based on a grammar requires two steps:

1. generating values for all lexemes in order to get raw data,
2. generating a path in the rules of the grammar.

A path is equivalent to a derivation, the vocabulary is different according to our goal: validation or generation.

#### Isotropic generation of lexemes

In order to generate values for lexemes, no matter what algorithm is used, it has to be fast. We are going to use a process called isotropic. We start with a rule and we go forward only by using random and locally uniform (only for this choice) choices. For instance, if we have a disjunction between three sub-rules, we will randomly and uniformly choose between 1 and 3. If we have a concatenation, we will just concatenate each parts. And finally, a repetition is nothing more than a disjunction of concatenation: indeed, `e{1,3}` is strictly equivalent to ```e | ee | eee```. Let's see:

``````([ae]+|[x-z]!){1,3}              repeat [ae]+|[x-z]! 2 times
→ ([ae]+|[x-z]!)([ae]+|[x-z]!)  choose between [ae]+ and [x-z]!
→ ([ae]+)([ae]+|[x-z]!)         repeat ae 2 times
→ [ae][ae]([ae]+|[x-z]!)        choose between a and e
→ e[ae]([ae]+|[x-z]!)           again
→ ea([ae]+|[x-z]!)              choose between [ae]+ and [x-z]!
→ ea([x-z]!)                    choose between x, y and z
→ eay!``````

This generation is naive but this is not important. What is important is the path generation in rules, or put another way, the generation of sequences of lexemes.

#### Random uniform generation

The first algorithm is the random and uniform generation. This algorithm is useful if we would like to generate sequences of lexemes of a given size n with a uniformity, not local (like in the isotropic generation), but on the set of all the possible sequences. Succinctly, the algorithm works in two steps: pre-computation (once per size) followed by a generation. The pre-computation is an automatic step and computes the number of all the possible sequences and sub-sequences of size n. This step helps to compute cumulative distribution functions in order to guide the final sequences of lexemes.

We will generate 10 random data of size 7, it means composed of 7 lexemes. That's why we will use the `Hoa\Compiler\Llk\Sampler\Uniform` class that takes our grammar in first argument, the lexeme value generator in second argument (here `Hoa\Regex\Visitor\Isotropic`) and finally a size:

``````\$sampler = new Hoa\Compiler\Llk\Sampler\Uniform(
// Grammar.
// Token sampler.
new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()),
// Length.
7
);

for (\$i = 0; \$i < 10; ++\$i) {
echo \$i, ' => ', \$sampler->uniform(), "\n";
}

/**
* Will output:
*     0 => [ false , null , null ]
*     1 => [ " l " , null ]
*     2 => [ [ true ] , true ]
*     3 => [ [ [ 4435 ] ] ]
*     4 => [ [ [ 9366 ] ] ]
*     5 => [ true , false , null ]
*     6 => { " |h<# " : false }
*     7 => [ [ [ false ] ] ]
*     8 => [ false , true , 7 ]
*     9 => [ false , 5 , 79 ]
*/``````

We can redefine the size with the `Hoa\Compiler\Llk\Sampler\Uniform::setLength` method. We have noticed that some repetition operators have no upper bound, like `+` or `*`; in this case, we set it to n.

#### Bounded exhaustive generation

The second algorithm is the bounded exhaustive generation. This algorithm generates all sequences of lexemes (for the exhaustive side) from size 1 to n (for the bounded side). An interesting aspect of this algorithm is that the exhaustivity is kind of a proof! The algorithm behaves like an iterator and is very simple to use:

``````\$sampler = new Hoa\Compiler\Llk\Sampler\BoundedExhaustive(
// Grammar.
// Token sampler.
new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random()),
// Length.
7
);

foreach (\$sampler as \$i => \$data) {
echo \$i, ' => ', \$data, "\n";
}

/**
* Will output:
*     0 => true
*     1 => false
*     2 => null
*     3 => " 8u2 "
*     4 => { " ,M@aj{ " : true }
*     5 => { " x`|V " : false }
*     6 => { " NttB " : null }
*     7 => { " eJWwA " : 0 }
*     8 => [ true ]
*     9 => [ true , true ]
*     10 => [ true , true , true ]
*     11 => [ true , true , false ]
*     12 => [ true , true , null ]
*     13 => [ true , true , 32 ]
*     14 => [ true , false ]
*     15 => [ true , false , true ]
*     16 => [ true , false , false ]
*     17 => [ true , false , null ]
*     18 => [ true , false , 729 ]
*     19 => [ true , null ]
*     20 => [ true , null , true ]
*     21 => [ true , null , false ]
*     …
*     157 => [ 780 , 01559 , 32 ]
*     158 => 344
*/``````

Like the previous algorithm, we can redefine the upper bound with the `Hoa\Compiler\Llk\Sampler\BoundedExhaustive::setLength` method. And the repetition operators with no upper bounds are bounded to n.

#### Grammar coverage-based generation

The last algorithm is the grammar coverage-based generation. This algorithm reduces the combinatory explosion encountered with the previous algorithm but the goal is different: it will generate sequences of lexemes that will “activate” all the rules branches of the grammar. A rule is said covered if and only if all its sub-rules are covered, and a lexeme is said covered if it has been used in a sequence. To ensure a diversity in the produced sequences, choice points between uncovered sub-rules are resolved randomly. The algorithm also behaves like an iterator:

``````\$sampler = new Hoa\Compiler\Llk\Sampler\Coverage(
// Grammar.
// Token sampler.
new Hoa\Regex\Visitor\Isotropic(new Hoa\Math\Sampler\Random())
);

foreach (\$sampler as \$i => \$data) {
echo \$i, ' => ', \$data, "\n";
}

/**
* Will output:
*     0 => true
*     1 => { " )o?bz " : null , " %3W) " : [ false , 130 , " 6 " ] }
*     2 => [ { " ny  " : true } ]
*     3 => { " Ne;[3 " : [ true , true ] , " th: " : true , " C[8} " : true }
*/``````

To avoid the combinatory explosion and to ensure the termination of the algorithm, we will use the following heuristic on the repetition operators: `*` will repeat `0`, `1` and `2` times, `+` will repeat `1` and `2` times, and finally `{x,y}` will repeat `x`, `x + 1`, `y - 1` and `y` times. This heuristic finds its origin in the limit testing.

We notice in our example that 4 data are generated and are enough to cover entirely our grammar!

#### Comparison between algorithms

Here are some clues to know when to use one algorithm instead of another.

Random uniform generation:
• fast for small data, high data diversity and fixed size,
• the pre-computation step is exponential and therefore long while, next, the generation is very fast.
Bounded exhaustive generation:
• fast for small data and exhaustivity is efficient,
• exponential number of data.
Grammar coverage-based generation:
• fast for average or big data, and diversity of data,
• do not consider size.

## LL(1) compiler-compiler

The documentation of the LL(1) compiler, through the `Hoa\Compiler\Ll1` class, is not written yet. The goal is different regarding `Hoa\Compiler\Llk`: there is no grammar description language, automata are hard-coded with arrays and this type of compiler is high performance oriented. That's why its behavior is monolitic.

## Conclusion

`Hoa\Compiler` provides two compiler-compilers: `Hoa\Compiler\Llk` and `Hoa\Compiler\Ll1`, in order to validate data. The PP language allows to write algebraic grammars in a simple and natural way. The LL(k) compiler-compiler is split in distinct processus, which encourages hackability. Finally, several algorithms allows to generate data based on a grammar according to specific usages.

An error or a suggestion about the documentation? Contributions are welcome! 