Thought I forgot about Lex?

Saw my previous post? Hey Yacc is only half the battle. So why don't I give it all away and post the lexical analyzer too? OK I will. But I hope you appreciate how difficult it was to post this code since many of the characters in here have meaning to html as well.

Hopefully no errors have crept in with all of the escaped character codes I had to add to please html.

Lex, like Yacc, is split into three sections separated by double percent signs. In the first section you can introduce "global to the lexical analyzer" code between the percent-brace constructs. Here I include the token definitions generated by Yacc and also define a couple of variables which will be explained later.

    #include ""
    int markCount = 2;
    int balanced = 0;

Lex uses regular expressions to scan the input followed by small blocks of code in braces which are executed when the regular expression matches.

ID      [a-zA-Z_][a-zA-Z_0-9]*
DIGIT   [0-9]
SPACE   [ \t\n\r]
EOL     [\n\r]*

Above I define other shorthand definitions for regular expressions I use a lot. ID is defined to allow a leading underscore but not a leading number. SPACE includes both newlines and returns since text files are like a box of chocolates.

%x        defs code comment tail

Lex includes support for making "state machines" while scanning. In the definition above I have created four alternative states - def, code, comment, and tail - to the normal scanning state of 0. These states in Lex turn out to be necessary when building what I call monolithic parsers. In my ideal world, you would build grammars like a program with subroutines. There would be sub-parsers and they would have corresponding lexical scanners. The states in Lex just allow me to implement this all in the same file. Regular expressions prefixed by the state name are only seen when that state is active. (There is even more to this but let's keep it simple.) The variables markCount and balanced are used during the states to keep some bookkeeping.

    %%            /* don't forget the %% separator */

    "%"{2}        { markCount--; if (!markCount) BEGIN tail; return MARK; }

The markCount variable keeps track of the double percent signs used to separate the different parts of the input file. When I get to the last mark I am just interested in grabbing the tail of the file which is all program code and not part of the grammar. So I set the "tail" state.

    "%{"                      { BEGIN defs; return LCURL; }
    ([^%}]*{EOL})*    { yymore(); }
    .*/"%}"              { yymore(); BEGIN 0; return DEFBLOCK; }
    "%}"                      { return RCURL; }

The percent-brace (LCURL) construct is used to capture the header code for the parser. Once again I am only interested in the block as a whole, so I set a state to just grab everything until the closing RCURL.

"{"               { BEGIN code; balanced = 0; yymore(); }
                  ([^}]*{EOL})*"{"    { balanced++; yymore();  }
                  ([^}]*{EOL})*        { yymore(); }
                  ([^{]*{EOL})*"}"    {
                            if (balanced > 0) {
                            else {
                                /*dlog("CODEBLOCK")*/ BEGIN 0;
                                return CODEBLOCK;

Actions are a little bit trickier because they are delimited by simple braces like in C, and of course the code within can also have nested braces. The balanced variable keeps track of the nesting level. By the way, yymore() is a utility function which lets us keep what we have scanned and keep scanning more. I return the whole thing, braces and all as one token - CODEBLOCK

"/*"                    { BEGIN comment; }
[^*]*            /* eat anything that's not a '*' */
"*"+[^*/]*        /* eat up '*'s not followed by '/'s */
"*"+"/"        { BEGIN 0; }

Comments are just delimited like old style C comments, not the double slash variety. I don't call yymore() because I don't plan on keeping the scanned text. Just skip it. Don't return any token. From here on the regular expressions are almost all just simple tokens.

"%token"                    { return TOKEN; }
"%left"                    { return LEFT; }
"%right"                    { return RIGHT; }
"%prec"                    { return PREC; }
"%type"                    { return TYPE; }
"%nonassoc"                { return NONASSOC; }
"%nassoc"                    { return NONASSOC; /* Yes I've seen these used in some YACC files */ }
"%nterm"                    { return NTERM; }
"%start"                    { return START; }
"%union"                    { return UNION; }

The regular expression for literals is a nasty bugger because of escaped and quoted characters.

{ID}*                    { return IDENTIFIER; }
"'"."'"|"'\\"."'"        { return LITERAL; }
{DIGIT}+                 { return NUMBER; }
":"                        { return COLON; }
";"                        { return SEMICOLON; }
"|"                        { return VERTBAR; }
""                        { return GT; }

The tail state just grabs everything to the end of the file.

(.*{EOL})*        { return PROGRAM; BEGIN 0; }

At the end I put the catch-all rules. Of course you generally skip whitespace in any modern language and the "dot" rule catches anything which has slipped past all the other expressions which means either badly formed input, or worse, insufficient regular expressions were created by me.

{SPACE}+                /* eat up whitespace */
.                    { yyerror("unexpected character"); }

The yywrap function exists so that you can continue input from another file instead of stopping at the end of input. Usually the following definition is good enough unless you are processing included files or some other type of compilation which uses multiple files.

int yywrap (void) {return 1;}

And that wraps it up, so to speak. Questions?


Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
  • Only images hosted on this site may be used in <img> tags.
  • You can align images (data-align="center"), but also videos, blockquotes, and so on.
  • You can caption images (data-caption="Text"), but also videos, blockquotes, and so on.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.