|
| 1 | +### Notes on the Tokenizer and Parser |
| 2 | +There are a lot of ways to write any kind of parser. Some simply embed the |
| 3 | +syntax and semantic rules in the program logic as a bunch of if/then/else and |
| 4 | +switch statements. Some are more like "real" parsers, tokenizing the input and |
| 5 | +then applying grammatical rules to parse the tokens. |
| 6 | + |
| 7 | +I'd never written any of the latter kind. But in researching some aspects of |
| 8 | +command line parsing I stumbled across **LL(1) grammars**... and a comment to the |
| 9 | +effect no one in their right mind would write one just to parse a command line |
| 10 | +because it'd be overkill. |
| 11 | + |
| 12 | +So, naturally, I decided to write my parser that way :). |
| 13 | + |
| 14 | +For more information on how LL(1) parsers work I suggest you do some web searching. |
| 15 | +Or sign up for a course on compiler design :). If you go the web route, be forewarned |
| 16 | +that there aren't a lot of good -- as in "comprehensible to someone who knows |
| 17 | +nothing about the subject" -- source documents around. Or at least I wasn't able to |
| 18 | +find many. |
| 19 | + |
| 20 | +But by piecing together a bunch of stuff, I think this is how an LL(1) parser works. |
| 21 | +"LL(1)" means the parser scans tokens from **L**eft to right, and only looks **1** |
| 22 | +token ahead (the second "L" is about following/constructing "left hand" routes in |
| 23 | +a tree data structure; I'm not sure about that because I didn't use a tree-based |
| 24 | +approach). |
| 25 | + |
| 26 | +My parser isn't, technically, an LL(1) parser because it does some pre-processing of |
| 27 | +the tokens it generates before parsing them. The main such step being to merge all |
| 28 | +tokens between a starting "quoter" token and an ending "quoter" token into a single |
| 29 | +test token. |
| 30 | + |
| 31 | +Command lines like this: |
| 32 | +``` |
| 33 | +-x -y "This is a single argument to the y switch" -z abc |
| 34 | +``` |
| 35 | +are initially turned into the following sequence of tokens: |
| 36 | +- KeyPrefix (i.e., the "-") |
| 37 | +- Text (the "x") |
| 38 | +- Separator (the " ") |
| 39 | +- KeyPrefix (the second "-") |
| 40 | +- Text (the "y") |
| 41 | +- Quoter (the '"') |
| 42 | +- Text (the "This") |
| 43 | +- Separator (the " ") |
| 44 | +- Text (the "is") |
| 45 | +- Separator (the " ") |
| 46 | +- Text (the "a") |
| 47 | +- Separator (the " ") |
| 48 | +- Text (the "single") |
| 49 | +- Separator (the " ") |
| 50 | +- Text (the "argument") |
| 51 | +- Separator (the " ") |
| 52 | +- Text (the "to") |
| 53 | +- Separator (the " ") |
| 54 | +- Text (the "the") |
| 55 | +- Separator (the " ") |
| 56 | +- Text (the "y") |
| 57 | +- Separator (the " ") |
| 58 | +- Text (the "switch") |
| 59 | +- Quoter (the '"') |
| 60 | +- Separator (the " ") |
| 61 | +- KeyPrefix (the "-") |
| 62 | +- Text (the "z") |
| 63 | +- Separator (the " ") |
| 64 | +- Text (the "abc") |
| 65 | + |
| 66 | +Preprocessing turns that sequence into this: |
| 67 | +- KeyPrefix (i.e., the "-") |
| 68 | +- Text (the "x") |
| 69 | +- Separator (the " ") |
| 70 | +- KeyPrefix (the second "-") |
| 71 | +- Text (the "y") |
| 72 | +- Text (the "This is a single argument to the y switch") |
| 73 | +- Separator (the " ") |
| 74 | +- KeyPrefix (the "-") |
| 75 | +- Text (the "z") |
| 76 | +- Separator (the " ") |
| 77 | +- Text (the "abc") |
| 78 | + |
| 79 | +which is much easier to parse. |
| 80 | + |
| 81 | +If you want to study the parser in more detail it's implemented by the `Parser`, |
| 82 | +`ParsingTable`, `TokenEntry`, `TokenEntries` classes. The two token preprocessors |
| 83 | +are `MergeSequentialSeparators` and `ConsolidateQuotedText`. |
| 84 | + |
| 85 | +The tokenizer is implemented by `AvailableTokens`, `TokenType`, `Token` and |
| 86 | +`Toekenizer`. |
0 commit comments