Defining a grammar
Pegmatite grammars are Parsing Expression Grammars (PEGs). As such, they are always matched greedily and there is no ambiguity in the grammar: the first rule to match is always the one taken.
Traditional PEGs do not support left recursion, because greedy matching of a left-recursive rule would simply involve it repeatedly trying to match the left recursive part until it ran out of stack space. OMeta added support for left recursion, as part of its parsing algorithm. Pegmatite uses a simplified form of the OMeta algorithm, which has worse worst-case running time, but better worst-case space usage.
Once left recursion is encountered, including indirect left recursion, the parser will backtrack until it finds a conditional that has a branch that is not left recursive.
You can create freestanding Rule
instances, however the recommended way of
creating a grammar is to make each rule a field in a class. The Calculator
example follows this pattern, with a CalculatorGrammar
class containing one
field for each rule. This is recommended for two reasons:
- It allows lazy creation, rather than requiring the rules to be instantiated on program creation. This gives faster start-up times and means that memory is only used when the grammar is actually used.
- It improves encapsulation. Although it is completely safe to refer to rules in other grammars, placing them in a class makes it easy for people binding actions to the grammar to identify all of the rules that they must (or, might want to) provide actions for.
Simple character or string recognising expressions can be created using the
_E
custom literal suffix. For example "int"_E
creates an expression that
will match the literal string “int”. This is useful for terminals. The _S
suffix defines a set, for example "eE"_S
matches either the character ‘e’ or
‘E’.
You can define more complex operations from these by using the following
operators (where a
and b
are expressions):
*a
matches zero or more instances ofa
+a
matches one or more instances ofa
-a
matches zero or one instance ofa
a >> b
matchesa
and thenb
a | b
matches eithera
orb
Newline rules have no special meaning in parsing, but are used to increment the line counter for input ranges. If you declare a rule as matching a newline, it will increment the line counter every time it is successfully matched.
Whitespace rules allow implicit whitespace in between all non-terminal expressions (sequences).
In Pegmatite, the grammar is kept separate from the set of actions that are invoked when matching it. The grammar from the calculator example is shown below:
Note that the _E
suffix is only required for the first element in an expression. For example, '0'_E - '9'
defines a range expression that will match any character in the range 0–9. The suffix is required on the first literal, but the argument to the -
operator (which defines a range) will accept either another expression or a literal character.
This can be seen in other sequences, for example this rule defines subtract operations:
The literals here do not need the suffix, because they are all on the right-hand side of an operator that defines a compound expression.
This grammar is a singleton and can be reused by multiple consumers.