Skip to content

Commit 2dd00d8

Browse files
committed
Add dev doc to the parser dir
1 parent 8772952 commit 2dd00d8

File tree

2 files changed

+98
-2
lines changed

2 files changed

+98
-2
lines changed

src/parser/README.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Parse developer documentation
2+
3+
This is a quick introduction on how the parser is working.
4+
It gives a short introduction to each file in the order they are loaded.
5+
6+
## texexpr.jl
7+
8+
This file contains the definition of the `TeXExpr` struct.
9+
It is used as the representation for *all* the outputs of the parser.
10+
It works similarly as Julia built-in expr, having two fields:
11+
- `head::Symbol` : the identifier of the kind of `TeXExpr` used.
12+
See the main documentation for a list of valid names.
13+
- `args::Vector{Any}` : a list of all the data associated with the expression.
14+
For example for a `TeXExpr` with head `digit`, `args` is a list containing
15+
a single element, the digit represented by the expression.
16+
Arguments can be either `TeXExpr` themselves, or other Julia types,
17+
typically `Char` or `String`.
18+
19+
## commands_data.jl
20+
21+
This file simply lists family of command for easier registration in the next
22+
step.
23+
It is based on the commands defined for `mathtext` the latex engine of
24+
`matplotlib`.
25+
26+
## commands_registration.jl
27+
28+
In this file we map a single symbol or a string representing a latex
29+
command to its `TeXExpr` representation through the `canonical_expr` function.
30+
For example, the string `"\alpha"` is mapped to `TeXExpr(:symbol, 'α')`.
31+
32+
Here we introduce the concept of a canonical representation.
33+
This simply has to do with the fact that sometime different latex inputs can
34+
lead to the same expression, and we represent them in a unique and
35+
consistent way.
36+
For example, both the strings `"\alpha"` and `"α"` are mapped to the
37+
expression `TeXExpr(:symbol, 'α')`.
38+
39+
Note that the canonical expression may not be the final expression that
40+
the parser outputs.
41+
Sometimes additional informations need to be parsed to complete the command.
42+
In such case, the canonical expression is a `TeXExpr` that is further
43+
modified when the needed information are parsed.
44+
There are currently two main use cases:
45+
- LaTeX macros with arguments, like `\frac`, that are mapped to
46+
`TeXExpr(:argument_gatherer, [head, number_of_args])` that are converted
47+
to `TeXExpr(head, args)` once the arguments are parsed and gathered.
48+
- Constructs with optional modifiers, like `\int` that can optionally their
49+
bounds specified.
50+
In this case the optional arguments of the expression are initially
51+
filled with `nothing` and are later replaced with their actual value if
52+
they are found while parsing.
53+
54+
This strategy allows the parser to only move forward without explicit
55+
lookahead.
56+
57+
## parser.jl
58+
59+
This is where the magic happens, in the `texparse` function.
60+
For the most part it contains the definition of the parser using `Automa.jl`.
61+
A lot need to be learn from `Automa.jl` documentation before diving in here.
62+
63+
In addition to `Automa.jl` native capabilities, to be able to parse a rich
64+
language like latex, we need to manage a stack
65+
that contain both the current state of parsing and the already parsed data.
66+
The strategy is relatively simple:
67+
1. We put a `TeXExpr(:expr, [])` as initial state of the stack.
68+
2. We parse LaTeX strings character by character (`Automa.jl` do it byte by
69+
byte, some care is needed to do it unicode char by unicode char).
70+
3. When we encouter a new construct, we put its canonical representation on
71+
top of the stack (e.g. `{` start a new `TeXExpr(:group)`).
72+
4. When we encouter a char that can end the current construct, we finalize it.
73+
That is we pop it from the stack and apply some final transformation to it
74+
if needed (e.g. removing the useless `TeXExpr(:group)` layer for a
75+
group of a single element).
76+
Then we add it to the argument list of the first construct below.
77+
78+
Note that some construct, like digits, are composed of only a single char so for them
79+
steps 3 and 4 are merged and they are simply added to the current construct.
80+
81+
Most of the complexity in the file comes from the fact that there are
82+
many special rules for beginning or ending a construct.
83+
Think for example of superscript.
84+
Starting from the string `"10^"`, the superscript construct can be terminated
85+
by either
86+
- A single char e.g. `"10^2"`.
87+
- A command e.g. `"10^\beta"`.
88+
- A group e.g. `"10^{2 + 3}`.
89+
90+
Regardless, at the end, when the parsing is successful, the stack
91+
collapses to a single element, `TeXExpr(:expr)` which arguments contain
92+
a nested representation of the full LaTeX string.
93+
94+
You can watch the rise and fall of the stack by passing `showdebug=true` to
95+
`texparse`.
96+
It is currently not as fun as to watch an old empire rise and fall,
97+
but beware, it is nearly as verbose.

src/parser/commands_registration.jl

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,7 @@ function Base.get(d::CanonicalDict, key, default)
2828
end
2929

3030

31-
# Each symbol or command has a unique canonical representation, either
32-
# as a TeXExpr
31+
# Each symbol or command has a unique canonical representation
3332
const symbol_to_canonical = CanonicalDict{Char}()
3433
const command_to_canonical = CanonicalDict{String}()
3534

0 commit comments

Comments
 (0)