[Haskell-cafe] Parsec - separating Parsing from Lexing

Discussion:

Fernando Henrique Sanches

2009-11-10 19:23:29 UTC

Hello.

I'm currently implementing a MicroJava compiler for a college assignment
(the implemented language was defined, the implementation language was of
free choice).

I've sucessfully implemented the lexer using Parsec. It has the type String
-> Parser [MJVal], where MJVal are all the possible tokens.

However, I don't know how to implement the parser, or at least how to do it
keeping it distinguished from the lexer.

For example, the MicroJava grammar specifies:

Program = "program" ident {ConstDecl | VarDecl | ClassDecl}
"{" {MethodDecl} "}".

The natural solution (for me) would be:

program = do
string "program"
programName <- identifier
...

However, I can't do this because the file is already tokenized, what I have
is something like:
[Program_, identifier_ "testProgram", lBrace_, ...]
for the example program:

program testProgram {
...

How should I implement the parser separated from the lexer? That is, how
should I parse Tokens instead of Strings in the "Haskell way"?

Fernando Henrique Sanches

Sean Leather

2009-11-10 20:07:00 UTC

Permalink

Post by Fernando Henrique Sanches
I've sucessfully implemented the lexer using Parsec. It has the type String
-> Parser [MJVal], where MJVal are all the possible tokens.

Great! You're partway there.

Post by Fernando Henrique Sanches
How should I implement the parser separated from the lexer? That is, how
should I parse Tokens instead of Strings in the "Haskell way"?

AFAIK, the difference is between having an input stream of type [Char] vs.
having a stream of tokens, e.g. [MJVal]. I haven't used Parsec myself, but
perhaps you want something of type 'GenParser MJVal s r' for some state s
and return type r?

There's an excellent set of lecture notes from a class we have at Utrecht,
formerly called Grammars and Parsing and now Languages and Compilers, in
which we use parser combinators in Haskell .

http://people.cs.uu.nl/johanj/publications/MAIN.pdf

It's not Parsec, but there's plenty of useful general information in there,
cf. 4.5.1, 4.5.2.

Regards,
Sean

Jason Dusek

2009-11-10 20:21:04 UTC

Permalink

You have to bootstrap your parser with something that takes
an `MJVal` and updates the parser state. Here is a simple
example:

http://github.com/jsnx/system-uuid/blob/master/Options.hs

This is a parser for command line options. It parses a list
of `String`s, not `Char`s (because `argv` is `char**` and not
`char*`, right?) so we introduce `stringPrim` and then build
up the primitives from that.

--
Jason Dusek

Fernando Henrique Sanches

2009-11-30 15:00:13 UTC

Permalink

Hello again.

First, I'd like to thank you both for your help. You gave me nice tips and
nice material.

However, I think there is something I'm missing here. I've coded most of the
parser, and I have a feeling I'm doing some great mistake. Not only my code
won't compile, I don't know how to deal with multiple "possible return
values" and worse yet - I don't know how I'll proceed from this to Semantic
Analysis and, later, code generation. Honestly, I'm thinking I'll have to
scrap out all of this code and use the (ugh) Java template code the teacher
gave us.

My code is divided in three files:

Parser.hs (the real parsing, problematic):
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=13399#a13399

ParserTokens.hs (where I believe the problem arouse):
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=13400#a13400

Lexer.hs (the lexer, the only part I believe is mostly right):
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=13401#a13401

That's a lot of code and I feel I'm far from reaching the solution to my
problem. This code is getting ugly, and I'm somewhat desperate. Any help
will be greatly appreciated.

Fernando Henrique Sanches

Post by Jason Dusek
You have to bootstrap your parser with something that takes
an `MJVal` and updates the parser state. Here is a simple
http://github.com/jsnx/system-uuid/blob/master/Options.hs
This is a parser for command line options. It parses a list
of `String`s, not `Char`s (because `argv` is `char**` and not
`char*`, right?) so we introduce `stringPrim` and then build
up the primitives from that.
--
Jason Dusek

Stephen Tetley

2009-11-30 16:58:26 UTC

Permalink

Hi Fernando

Which version of Parsec are you using, the one that ships with GHC or Parsec 3?

I would imagine whichever you one you are using you have the imports
wrong for these modules:

import Text.Parsec.Pos
import Text.Parsec.Prim

should be ...

import Text.ParserCombinators.Parsec.Pos
import Text.ParserCombinators.Parsec.Prim

Also it seems like an applicative instance for Parsec's GenParser
monad is missing. This isn't defined for parsec-2.1.0.1, people seem
to define it themselves:

-- The Applicative instance for every Monad looks like this.
instance Applicative (GenParser s a) where
pure = return
(<*>) = ap

... you will need `ap` from Control.Monad in scope.

Also there is a pdf document for Parsec (should be available from Daan
Leijen's home page) that covers separate scanners, it has quite a lot
more documentation than Parsec's Haddock docs.

For semantic analysis I'd highly recommend UUAG, it is well documented
and used for a large compiler (EHC). There is also a version of Andrew
Appel's Tiger language written with it that is much smaller and more
comprehensible, the version on Hackage doesn't seem to contain the
attribute grammar source but it is available here:

http://www.cs.uu.nl/wiki/bin/view/HUT/TigerCompiler

http://www.cs.uu.nl/wiki/bin/view/HUT/AttributeGrammarSystem

Best wishes

Stephen

Fernando Henrique Sanches

2009-11-30 21:24:13 UTC

Permalink

Hi.

I believe I'm using parsec 3.0.1, although I already fixed the imports (even
thou they were working).

Thanks for the references, specially the parsec User Guide - it saved my
life, in 30 minutes I actually *understood* my mistake and fixed it. Now my
code and it compiles and makes sense, even though it's a bit useless, since
it has no return values [yet].

I'll take a look at UUAG and Tiger. However, since this is a simple college
assignment, I want my compiler to be as simple as possible but in a way I
understand what I'm doing, so I'll probably be doing it by hand.

Fernando Henrique Sanches

Post by Stephen Tetley
Hi Fernando
Which version of Parsec are you using, the one that ships with GHC or Parsec 3?
I would imagine whichever you one you are using you have the imports
import Text.Parsec.Pos
import Text.Parsec.Prim
should be ...
import Text.ParserCombinators.Parsec.Pos
import Text.ParserCombinators.Parsec.Prim
Also it seems like an applicative instance for Parsec's GenParser
monad is missing. This isn't defined for parsec-2.1.0.1, people seem
-- The Applicative instance for every Monad looks like this.
instance Applicative (GenParser s a) where
pure = return
(<*>) = ap
... you will need `ap` from Control.Monad in scope.
Also there is a pdf document for Parsec (should be available from Daan
Leijen's home page) that covers separate scanners, it has quite a lot
more documentation than Parsec's Haddock docs.
For semantic analysis I'd highly recommend UUAG, it is well documented
and used for a large compiler (EHC). There is also a version of Andrew
Appel's Tiger language written with it that is much smaller and more
comprehensible, the version on Hackage doesn't seem to contain the
http://www.cs.uu.nl/wiki/bin/view/HUT/TigerCompiler
http://www.cs.uu.nl/wiki/bin/view/HUT/AttributeGrammarSystem
Best wishes
Stephen

Ryan Ingram

2009-11-30 20:21:34 UTC

Permalink

Hi Fernando. I tried this approach for a toy language as well, and I
was unhappy with it.

I have found that, with Parsec, it is best to *not* split your parsing
completely into "tokenization" and "parsing" phases, but rather to
interleave them. Instead of

tokenize :: Parser [MJVal]

make

token :: Parser MJVal
tokenSatisfies :: (MJVal -> Bool) -> Parser MJVal
tokenSatisfies f = try $ do
t <- token
if (f t) then return t else fail "No parse"
program :: Parser Program
program = do
tokenSatisfies (== Program_)
programName <- identifier
-- etc.
return $ Program i ...

There is a bit of an inefficiency using "try"; you'll reparse the same
token multiple times for each failure branch of a "choice" branch, but
I've found this to be the simplest solution and parsing time rarely
dominates your running time.

-- ryan

On Tue, Nov 10, 2009 at 11:23 AM, Fernando Henrique Sanches

Hello.
I'm currently implementing a MicroJava compiler for a college assignment
(the implemented language was defined, the implementation language was of
free choice).
I've sucessfully implemented the lexer using Parsec. It has the type String
-> Parser [MJVal], where MJVal are all the possible tokens.
However, I don't know how to implement the parser, or at least how to do it
keeping it distinguished from the lexer.
Program = "program" ident {ConstDecl | VarDecl | ClassDecl}
   "{" {MethodDecl} "}".
program = do
  string "program"
  programName <- identifier
  ...
However, I can't do this because the file is already tokenized, what I have
[Program_, identifier_ "testProgram", lBrace_, ...]
program testProgram {
...
How should I implement the parser separated from the lexer? That is, how
should I parse Tokens instead of Strings in the "Haskell way"?
Fernando Henrique Sanches
_______________________________________________
Haskell-Cafe mailing list
http://www.haskell.org/mailman/listinfo/haskell-cafe