Skip to content

Commit 28fc21f

Browse files
committed
updated docs to include discussion of tokenizer and parser.
1 parent 7a7e425 commit 28fc21f

File tree

3 files changed

+88
-0
lines changed

3 files changed

+88
-0
lines changed

J4JCommandLine.sln

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "docs", "docs", "{473DDAC4-C
2020
docs\goal-concept.md = docs\goal-concept.md
2121
LICENSE.md = LICENSE.md
2222
docs\logging.md = docs\logging.md
23+
docs\parser.md = docs\parser.md
2324
README.md = README.md
2425
EndProjectSection
2526
EndProject

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ available as version 0.5.0.1.
7171
- [Binding to static properties](docs/example-static.md)
7272
- [Binding to a configuration object](docs/example-instance.md)
7373
- [Logging and Errors](docs/logging.md)
74+
- [Notes on the Tokenizer and Parser](docs/parser.md)
7475

7576
#### Inspiration and Dedication
7677

docs/parser.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
### Notes on the Tokenizer and Parser
2+
There are a lot of ways to write any kind of parser. Some simply embed the
3+
syntax and semantic rules in the program logic as a bunch of if/then/else and
4+
switch statements. Some are more like "real" parsers, tokenizing the input and
5+
then applying grammatical rules to parse the tokens.
6+
7+
I'd never written any of the latter kind. But in researching some aspects of
8+
command line parsing I stumbled across **LL(1) grammars**... and a comment to the
9+
effect no one in their right mind would write one just to parse a command line
10+
because it'd be overkill.
11+
12+
So, naturally, I decided to write my parser that way :).
13+
14+
For more information on how LL(1) parsers work I suggest you do some web searching.
15+
Or sign up for a course on compiler design :). If you go the web route, be forewarned
16+
that there aren't a lot of good -- as in "comprehensible to someone who knows
17+
nothing about the subject" -- source documents around. Or at least I wasn't able to
18+
find many.
19+
20+
But by piecing together a bunch of stuff, I think this is how an LL(1) parser works.
21+
"LL(1)" means the parser scans tokens from **L**eft to right, and only looks **1**
22+
token ahead (the second "L" is about following/constructing "left hand" routes in
23+
a tree data structure; I'm not sure about that because I didn't use a tree-based
24+
approach).
25+
26+
My parser isn't, technically, an LL(1) parser because it does some pre-processing of
27+
the tokens it generates before parsing them. The main such step being to merge all
28+
tokens between a starting "quoter" token and an ending "quoter" token into a single
29+
test token.
30+
31+
Command lines like this:
32+
```
33+
-x -y "This is a single argument to the y switch" -z abc
34+
```
35+
are initially turned into the following sequence of tokens:
36+
- KeyPrefix (i.e., the "-")
37+
- Text (the "x")
38+
- Separator (the " ")
39+
- KeyPrefix (the second "-")
40+
- Text (the "y")
41+
- Quoter (the '"')
42+
- Text (the "This")
43+
- Separator (the " ")
44+
- Text (the "is")
45+
- Separator (the " ")
46+
- Text (the "a")
47+
- Separator (the " ")
48+
- Text (the "single")
49+
- Separator (the " ")
50+
- Text (the "argument")
51+
- Separator (the " ")
52+
- Text (the "to")
53+
- Separator (the " ")
54+
- Text (the "the")
55+
- Separator (the " ")
56+
- Text (the "y")
57+
- Separator (the " ")
58+
- Text (the "switch")
59+
- Quoter (the '"')
60+
- Separator (the " ")
61+
- KeyPrefix (the "-")
62+
- Text (the "z")
63+
- Separator (the " ")
64+
- Text (the "abc")
65+
66+
Preprocessing turns that sequence into this:
67+
- KeyPrefix (i.e., the "-")
68+
- Text (the "x")
69+
- Separator (the " ")
70+
- KeyPrefix (the second "-")
71+
- Text (the "y")
72+
- Text (the "This is a single argument to the y switch")
73+
- Separator (the " ")
74+
- KeyPrefix (the "-")
75+
- Text (the "z")
76+
- Separator (the " ")
77+
- Text (the "abc")
78+
79+
which is much easier to parse.
80+
81+
If you want to study the parser in more detail it's implemented by the `Parser`,
82+
`ParsingTable`, `TokenEntry`, `TokenEntries` classes. The two token preprocessors
83+
are `MergeSequentialSeparators` and `ConsolidateQuotedText`.
84+
85+
The tokenizer is implemented by `AvailableTokens`, `TokenType`, `Token` and
86+
`Toekenizer`.

0 commit comments

Comments
 (0)