updated docs to include discussion of tokenizer and parser.

markolbert · markolbert · commit 28fc21f3482e · 2021-01-02T16:03:45.000-08:00
diff --git a/J4JCommandLine.sln b/J4JCommandLine.sln
@@ -20,6 +20,7 @@ Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "docs", "docs", "{473DDAC4-C
 		docs\goal-concept.md = docs\goal-concept.md
 		LICENSE.md = LICENSE.md
 		docs\logging.md = docs\logging.md
+		docs\parser.md = docs\parser.md
 		README.md = README.md
 	EndProjectSection
 EndProject
diff --git a/README.md b/README.md
@@ -71,6 +71,7 @@ available as version 0.5.0.1.
   - [Binding to static properties](docs/example-static.md)
   - [Binding to a configuration object](docs/example-instance.md)
 - [Logging and Errors](docs/logging.md)
+- [Notes on the Tokenizer and Parser](docs/parser.md)
 
 #### Inspiration and Dedication
 
diff --git a/docs/parser.md b/docs/parser.md
@@ -0,0 +1,86 @@
+### Notes on the Tokenizer and Parser
+There are a lot of ways to write any kind of parser. Some simply embed the 
+syntax and semantic rules in the program logic as a bunch of if/then/else and
+switch statements. Some are more like "real" parsers, tokenizing the input and
+then applying grammatical rules to parse the tokens.
+
+I'd never written any of the latter kind. But in researching some aspects of 
+command line parsing I stumbled across **LL(1) grammars**... and a comment to the
+effect no one in their right mind would write one just to parse a command line
+because it'd be overkill.
+
+So, naturally, I decided to write my parser that way :).
+
+For more information on how LL(1) parsers work I suggest you do some web searching.
+Or sign up for a course on compiler design :). If you go the web route, be forewarned
+that there aren't a lot of good -- as in "comprehensible to someone who knows 
+nothing about the subject" -- source documents around. Or at least I wasn't able to
+find many.
+
+But by piecing together a bunch of stuff, I think this is how an LL(1) parser works.
+"LL(1)" means the parser scans tokens from **L**eft to right, and only looks **1**
+token ahead (the second "L" is about following/constructing "left hand" routes in
+a tree data structure; I'm not sure about that because I didn't use a tree-based
+approach).
+
+My parser isn't, technically, an LL(1) parser because it does some pre-processing of
+the tokens it generates before parsing them. The main such step being to merge all
+tokens between a starting "quoter" token and an ending "quoter" token into a single
+test token. 
+
+Command lines like this:
+```
+-x -y "This is a single argument to the y switch" -z abc
+```
+are initially turned into the following sequence of tokens:
+- KeyPrefix (i.e., the "-")
+- Text (the "x")
+- Separator (the " ")
+- KeyPrefix (the second "-")
+- Text (the "y")
+- Quoter (the '"')
+- Text (the "This")
+- Separator (the " ")
+- Text (the "is")
+- Separator (the " ")
+- Text (the "a")
+- Separator (the " ")
+- Text (the "single")
+- Separator (the " ")
+- Text (the "argument")
+- Separator (the " ")
+- Text (the "to")
+- Separator (the " ")
+- Text (the "the")
+- Separator (the " ")
+- Text (the "y")
+- Separator (the " ")
+- Text (the "switch")
+- Quoter (the '"')
+- Separator (the " ")
+- KeyPrefix (the "-")
+- Text (the "z")
+- Separator (the " ")
+- Text (the "abc")
+
+Preprocessing turns that sequence into this:
+- KeyPrefix (i.e., the "-")
+- Text (the "x")
+- Separator (the " ")
+- KeyPrefix (the second "-")
+- Text (the "y")
+- Text (the "This is a single argument to the y switch")
+- Separator (the " ")
+- KeyPrefix (the "-")
+- Text (the "z")
+- Separator (the " ")
+- Text (the "abc")
+
+which is much easier to parse.
+
+If you want to study the parser in more detail it's implemented by the `Parser`, 
+`ParsingTable`, `TokenEntry`, `TokenEntries` classes. The two token preprocessors
+are `MergeSequentialSeparators` and `ConsolidateQuotedText`.
+
+The tokenizer is implemented by `AvailableTokens`, `TokenType`, `Token` and 
+`Toekenizer`.