There are actually two pgens. The original, in C, and a rewrite in Python, under lib2to3/pgen2.
I wrote both. The original was actually the first code I wrote for Python. Although technically I had to write the lexer first (pgen and Python share their lexer, pgen just doesn't have any uses for most tokens).
The reason for writing my own parser generator was that at the time the landscape of parser generators (that I was familiar with) was pretty sparse -- basically there was Yacc (the GNU rewrite named Bison existed, but I don't know if I knew); or you could write a parser by hand (which is what most people did). I had used Yacc in college and was familiar with how it worked from the Dragon book, but for some reason I didn't like it much; IIRC I found it hard to reason about what the limitations on an LALR(1) grammar would be. I was also familiar with LL(1) parsers and had written some serious recursive-descent LL(1) parsers by hand -- I liked that fine but I was also familiar (again from the Dragon book) with LL(1) parser generation techniques, and I had one improvement over that that I wanted to try out: using regular expressions (sort of) instead of the standard BNF format. The Dragon book had also taught me how to turn regular expressions into DFAs, so I combined all that and pgen was born. [Update: see below for a slightly different version for the reason.]
I was not familiar with more advanced techniques, or I deemed them too inefficient. (I think most people working in parsers did, at the time.)
For the lexer I decided I would not use a generator -- I had a much lower opinion of Lex than of Yacc, since the version of Lex that I was familiar with would just segfault when trying to scan a token longer than 255 bytes (really!). Also I figured that indentation would be hard to teach to a lexer generator.
The story of pgen2 is quite different: I was employed by a startup in San Mateo (Elemental Security, it died in 2007, after I had left and joined Google) where I was given the task to design a custom language (for security assertions about system configuration), and was given pretty much free reign. I decided to design something vaguely like Python, implemented in Python, and decided that I wanted to reuse pgen, but with a backend in Python, using tokenize.py as the lexer. So I just rewrote the exact algorithm from pgen in Python and then moved on to build the rest of that language. Open-sourcing the tooling made sense to management so they quickly approved that, and some time later (I had probably moved on to Google by then?) it made sense to use those tools for 2to3. (It was easy to generate a Python parser with it because the input format was identical to the original pgen -- I just had to feed the Grammar file into the tool. :-)
Update: more color behind my reason for creating pgen
I don't quite recall why I did things this way but I just peeked at https://en.wikipedia.org/wiki/ LL_parser#Conflicts
and I may have seen this as a new (to me) way of addressing the various
conflicts that may occur without adding helper rules. E.g. what that
page calls left-factoring (replace A -> X | X Y Z with A -> X B; B
-> Y Z | <empty>) I would rewrite instead as A -> X [Y Z].
IIRC the parsing engine (function syntacticAnalysis from earlier on that
page) still works with parsing tables derived from such rules through
the regex -> NFA -> DFA transformation; I think the requirement
not to have any empty productions was required here.
The
other thing I came up with was that the parse tree nodes produced by
the parsing engine would have a variable number of children, e.g. for
the rule A -> X [Y Z] above the node for A would have either 1 child
(X) or 3 (X Y Z). A simple check in the code generator would then be
needed to determine which alternative the parser had encountered. (This
has turned out to be a double-edged sword, and later we added a parse
tree -> AST step driven by a separate generator to simplify the byte
code generator.)
So the reason I used regular expressions
was probably to make the grammar easier to read: after resolving the
conflicts using the needed rewrites I found the grammar not all that
easy to read (insert Zen of Python reference here :-) hence the regular
expressions -- no helper rules with awkward names, and [optional] parts
or repeated* parts are closer to how I think about a typical language's
syntax. It certainly did not change the power beyond LL(1), nor did it
reduce the power. And yes by "regular expressions" I meant the EBNF
conventions -- I'm not sure that at the time "EBNF" was a well-defined
notation, it might just have meant any extension to BNF.
Converting
EBNF to BNF and taking it from there would cause awkward extra parse
tree node types, so I don't think that would have been an improvement.
UPDATE March 2019: Python 3.8 will drop the C version of pgen, in favor of a tweaked version of pgen2. See https://github.com/python/cpython/pull/11814.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.