2013-06-19

Parsing Erlang Terms in OCaml (part 2)

In this installment, we'll continue my example of parsing with OCaml by focusing on the parser.

As I was developing this program, I didn't first write the tokenizer and then the parser. It was an iterative process where some progress was made breaking the input into tokens, and then supporting it in the grammar. Other times, I defined more of the grammar and then tweaked the tokenizer to support it. Describing this in a few blog entries would get confusing if I kept going back and forth between the stages, so I'll simply focus on the parser for now and explain the tokenizer in the next article.

How to Invoke

The grammar is defined in a source file with a .yml suffix. Once compiled, an .ml file is generated. So, for instance, if my grammar specification is in a file called grammar.yml, I would compile it with:

ocamlyacc grammar.yml

This command generates grammar.ml, which holds the parser implementation. Amid all the various functions and tables used to recognize the language is an exported function which starts all the processing. This function takes two parameters: a function that breaks the input into tokens, and a source for the lexer. The function used for the first parameter will be created when we define the tokenizer. The second parameter is obtained from the Lexing module in the standard library.

Defining the Grammar

First we'll define tokens we expect to get from the input. The order in which the tokens are arranged determines the grammar of the language. Some tokens are fixed, like operators in an expression or keywords in a programming language. Some tokens represent values which vary, but have a consistent format. For instance, a string token is represented by text between two quotes and a number token is represented by 1 or more consecutive digits.

ocamlyacc has a section where the tokens are defined. If a token is fixed (i.e. punctuation), then the format is "%token NAME [NAME2 ...]". If the token represents a changing entity, then the syntax is extended to include the type of the value: "%token <type> NAME [NAME2 ...]"

For our Erlang term parser, we have several fixed tokens which represent the punctuation used; square brackets are used to mark where a list begins and ends, braces mark the beginning We can itemize all of them in one line:

%token OLIST OTUPLE CLIST CTUPLE COMMA PERIOD PIPE EOF

There are four primitive types that the tokenizer will recognize (the container types will be handled by the parser.) The four types read by the tokenizer are integers, floating point numbers, strings, and atoms. We associate these tokens with their expected types using the following three lines:

%token <int> INT
%token <float> FLOAT
%token <string> STRING ATOM

In the token section of the .yml file, we also get to specify the name of the function which begins the parsing and what its return value will be. In this case, we have a file containing a series of Erlang terms which we want to parse one term at a time. We'll have the function return an Erlang.t option so if a value is read, it will return Some v and at end of file it will return None.

%start next
%type <Erlang.t option> next

The next section of the source file defines the rules of the grammar. These rules are made up of the name of the rule, followed by the pattern of tokens that define the rule and then OCaml code to execute, if a match is found. The name of the rule will end up being the name of a function in the generated .ml file.

Our first rule will be our starting point, the function next. This function will return a parsed term or None if we reach the end of file. We know all top level terms are followed by a period, so we define the next rule as:

next: term PERIOD               { Some $1 }
| EOF                           { None }

The $1 in the OCaml code will be replaced with the value found in the first matched term. The second matched term is accessed with $2, but it's simply a period so it isn't very interesting. Typically we only take the values of tokens that have a type argument since the fixed tokens are used to build the context of the grammar.

Now we need to add a term rule so Erlang terms can be matched. The four primitive types are simple:

term:
| ATOM                          { Erlang.Atom $1 }
| STRING                        { Erlang.String $1 }
| INT                           { Erlang.Integer $1 }
| FLOAT                         { Erlang.Number $1 }

The four tokens we match against (ATOM, STRING, INT, and FLOAT) are tokens that are defined with a type argument. When the OCaml code refers to a token (the $1 in these examples) it wants the value that was associated with the token. Think of tokens with types as a variant type constructor that takes an argument. In fact, these four rules are transforming ocamlyacc "constructors" into our variant type's constructors.

We're not fully done with this rule, though. We also need to match Erlang tuples and lists:

| OLIST terms CLIST             { Erlang.List $2 }
| OLIST terms PIPE term CLIST   { Erlang.List (List.append $2 [$4]) }
| OTUPLE terms CTUPLE           { Erlang.Tuple $2 }

Even though I differentiate between Erlang tuples and lists, I stored the contents for both in an OCaml list. I didn't want to define an Erlang.Tuple2, Erlang.Tuple3, etc. for varying lengths of tuples. My OCaml knowledge isn't strong enough to know if there's a way to support different tuple sizes in a single variant type constructor, so I went for simplicity and saved the values in a list.

These rules match on the opening and closing tokens to determine whether it's a list or tuple. It relies on the terms rule to read in the values.

You may have also noticed that there are two rules to make a list. The Erlang designers, for some reason, allow building a list where the tail isn't another list! For instance: [ 3 | 9 ] is a list that is more like a 2-tuple. You can't pass this "list" to many of the standard library list functions without throwing an exception, so I'm not sure of the wisdom in allowing these lists to be built. Unfortunately, they can occur (there are some in the profiler output!) so we need to support them. The grammer I've defined will recognize degenerate lists and convert them into a normal OCaml list.

The last rule to define is the terms rule:

terms:                          { [] }
| term                          { [$1] }
| term COMMA terms              { ($1 :: $3) }

In the first line, there isn't a pattern, just OCaml code. This rule is an empty rule and it matches if the other rules don't. This rule will become a function named terms and it's use in previous rules show it needs to return a list of Erlang.t values. So if we don't match any pattern, we return the empty list.

What we're trying to do in the terms rule is find a comma separated list of terms. The second line finds a term that isn't followed by a comma, so it returns a one-item list. The third line finds a term followed by a comma and then recursively calls itself to parse the value after the list. The associated OCaml code builds up an OCaml list using the values of the two rules.

Done for now

The calculator examples you typically find in YACC tutorials are very useful because they show operator precedence along with doing slightly more processing in each rule (by computing a result.) This example shows a different use where we're trying to take text representing data and converting it into something usable by OCaml functions. My example, along with the traditional calculator example, will hopefully make ocamlyacc more accessible to programmers.

The next article will cover tokenizing the input.

Here is the full content of grammar.yml:

%{
%}

%token <int> INT
%token <float> FLOAT
%token <string> STRING ATOM
%token OLIST OTUPLE CLIST CTUPLE COMMA PERIOD PIPE EOF

%start next
%type <Erlang.t option> next

%%

next: term PERIOD               { Some $1 }
| EOF                           { None }

term:
| ATOM                          { Erlang.Atom $1 }
| STRING                        { Erlang.String $1 }
| INT                           { Erlang.Integer $1 }
| FLOAT                         { Erlang.Number $1 }
| OLIST terms CLIST             { Erlang.List $2 }
| OLIST terms PIPE term CLIST   { Erlang.List (List.append $2 [$4]) }
| OTUPLE terms CTUPLE           { Erlang.Tuple $2 }

terms:                          { [] }
| term                          { [$1] }
| term COMMA terms              { ($1 :: $3) }

No comments:

Post a Comment