|
| 1 | +# Parse developer documentation |
| 2 | + |
| 3 | +This is a quick introduction on how the parser is working. |
| 4 | +It gives a short introduction to each file in the order they are loaded. |
| 5 | + |
| 6 | +## texexpr.jl |
| 7 | + |
| 8 | +This file contains the definition of the `TeXExpr` struct. |
| 9 | +It is used as the representation for *all* the outputs of the parser. |
| 10 | +It works similarly as Julia built-in expr, having two fields: |
| 11 | +- `head::Symbol` : the identifier of the kind of `TeXExpr` used. |
| 12 | + See the main documentation for a list of valid names. |
| 13 | +- `args::Vector{Any}` : a list of all the data associated with the expression. |
| 14 | + For example for a `TeXExpr` with head `digit`, `args` is a list containing |
| 15 | + a single element, the digit represented by the expression. |
| 16 | + Arguments can be either `TeXExpr` themselves, or other Julia types, |
| 17 | + typically `Char` or `String`. |
| 18 | + |
| 19 | +## commands_data.jl |
| 20 | + |
| 21 | +This file simply lists family of command for easier registration in the next |
| 22 | +step. |
| 23 | +It is based on the commands defined for `mathtext` the latex engine of |
| 24 | +`matplotlib`. |
| 25 | + |
| 26 | +## commands_registration.jl |
| 27 | + |
| 28 | +In this file we map a single symbol or a string representing a latex |
| 29 | +command to its `TeXExpr` representation through the `canonical_expr` function. |
| 30 | +For example, the string `"\alpha"` is mapped to `TeXExpr(:symbol, 'α')`. |
| 31 | + |
| 32 | +Here we introduce the concept of a canonical representation. |
| 33 | +This simply has to do with the fact that sometime different latex inputs can |
| 34 | +lead to the same expression, and we represent them in a unique and |
| 35 | +consistent way. |
| 36 | +For example, both the strings `"\alpha"` and `"α"` are mapped to the |
| 37 | +expression `TeXExpr(:symbol, 'α')`. |
| 38 | + |
| 39 | +Note that the canonical expression may not be the final expression that |
| 40 | +the parser outputs. |
| 41 | +Sometimes additional informations need to be parsed to complete the command. |
| 42 | +In such case, the canonical expression is a `TeXExpr` that is further |
| 43 | +modified when the needed information are parsed. |
| 44 | +There are currently two main use cases: |
| 45 | +- LaTeX macros with arguments, like `\frac`, that are mapped to |
| 46 | + `TeXExpr(:argument_gatherer, [head, number_of_args])` that are converted |
| 47 | + to `TeXExpr(head, args)` once the arguments are parsed and gathered. |
| 48 | +- Constructs with optional modifiers, like `\int` that can optionally their |
| 49 | + bounds specified. |
| 50 | + In this case the optional arguments of the expression are initially |
| 51 | + filled with `nothing` and are later replaced with their actual value if |
| 52 | + they are found while parsing. |
| 53 | + |
| 54 | +This strategy allows the parser to only move forward without explicit |
| 55 | +lookahead. |
| 56 | + |
| 57 | +## parser.jl |
| 58 | + |
| 59 | +This is where the magic happens, in the `texparse` function. |
| 60 | +For the most part it contains the definition of the parser using `Automa.jl`. |
| 61 | +A lot need to be learn from `Automa.jl` documentation before diving in here. |
| 62 | + |
| 63 | +In addition to `Automa.jl` native capabilities, to be able to parse a rich |
| 64 | +language like latex, we need to manage a stack |
| 65 | +that contain both the current state of parsing and the already parsed data. |
| 66 | +The strategy is relatively simple: |
| 67 | +1. We put a `TeXExpr(:expr, [])` as initial state of the stack. |
| 68 | +2. We parse LaTeX strings character by character (`Automa.jl` do it byte by |
| 69 | + byte, some care is needed to do it unicode char by unicode char). |
| 70 | +3. When we encouter a new construct, we put its canonical representation on |
| 71 | + top of the stack (e.g. `{` start a new `TeXExpr(:group)`). |
| 72 | +4. When we encouter a char that can end the current construct, we finalize it. |
| 73 | + That is we pop it from the stack and apply some final transformation to it |
| 74 | + if needed (e.g. removing the useless `TeXExpr(:group)` layer for a |
| 75 | + group of a single element). |
| 76 | + Then we add it to the argument list of the first construct below. |
| 77 | + |
| 78 | +Note that some construct, like digits, are composed of only a single char so for them |
| 79 | +steps 3 and 4 are merged and they are simply added to the current construct. |
| 80 | + |
| 81 | +Most of the complexity in the file comes from the fact that there are |
| 82 | +many special rules for beginning or ending a construct. |
| 83 | +Think for example of superscript. |
| 84 | +Starting from the string `"10^"`, the superscript construct can be terminated |
| 85 | +by either |
| 86 | +- A single char e.g. `"10^2"`. |
| 87 | +- A command e.g. `"10^\beta"`. |
| 88 | +- A group e.g. `"10^{2 + 3}`. |
| 89 | + |
| 90 | +Regardless, at the end, when the parsing is successful, the stack |
| 91 | +collapses to a single element, `TeXExpr(:expr)` which arguments contain |
| 92 | +a nested representation of the full LaTeX string. |
| 93 | + |
| 94 | +You can watch the rise and fall of the stack by passing `showdebug=true` to |
| 95 | +`texparse`. |
| 96 | +It is currently not as fun as to watch an old empire rise and fall, |
| 97 | +but beware, it is nearly as verbose. |
0 commit comments