|
| 1 | +# Django Template Language Processing: Phased Analysis Approach |
| 2 | + |
| 3 | +## Current State Assessment |
| 4 | + |
| 5 | +Currently, the Django template parser in `crates/djls-template-ast` uses a one-pass parsing approach: |
| 6 | + |
| 7 | +1. **Lexical Analysis:** The `Lexer` tokenizes the input template into a `TokenStream`. |
| 8 | +2. **Combined Syntactic & Semantic Analysis:** The `Parser` processes the tokens and directly generates a rich AST (`Ast`) with `Node` objects. This single pass simultaneously: |
| 9 | + * Determines syntactic structure (recognizing tags, variables, text). |
| 10 | + * Applies semantic meaning based on `TagSpecs` (understanding container/branch/closing relationships, tag arguments). |
| 11 | + * Builds the final nested tree structure. |
| 12 | +3. **Validation:** After parsing, a separate `Validator` runs on the rich AST to check for *additional* errors (some semantic checks might have already happened during the parse). |
| 13 | + |
| 14 | +The current implementation mixes syntactic structure recognition with semantic interpretation within a single parsing step. |
| 15 | + |
| 16 | +## Goal |
| 17 | + |
| 18 | +Refactor the language processing pipeline to clearly separate syntactic analysis from semantic analysis, aligning with standard compiler/interpreter design principles. |
| 19 | + |
| 20 | +1. **Phase 1: Syntactic Analysis (Parsing)** |
| 21 | + * Process the `TokenStream` to validate the grammatical structure of the template. |
| 22 | + * Produce a representation of the raw syntax, like a **Syntax Tree** (potentially a flat `NodeList` of simple nodes), *without* interpreting tag meanings or relationships. |
| 23 | + * Focus solely on whether the token sequence forms a valid Django template structure according to the language grammar. |
| 24 | + |
| 25 | +2. **Phase 2: Semantic Analysis (Single-File)** |
| 26 | + * Process the Syntax Tree generated by the Parser. |
| 27 | + * Apply `TagSpecs` to understand the *meaning* and behavior of tags. |
| 28 | + * Resolve tag relationships (container, branch, closing) within the single file. |
| 29 | + * Analyze and validate tag arguments and variable filters. |
| 30 | + * Build the final, rich, semantically-aware AST with proper nesting and semantic information. |
| 31 | + |
| 32 | +3. **Phase 3: Semantic Analysis (Cross-File / Project)** |
| 33 | + * Analyze multiple single-file ASTs in the context of a project. |
| 34 | + * Handle template inheritance (`extends`, `include`, `block`) by building an inheritance graph. |
| 35 | + * Perform cross-file validations and provide cross-file code intelligence features (handled primarily by the LSP Server). |
| 36 | + |
| 37 | +## Architecture: Phased Analysis Pipeline |
| 38 | + |
| 39 | +We will structure the processing pipeline based on distinct analysis phases: |
| 40 | + |
| 41 | +1. **Syntax Layer (Parser)**: |
| 42 | + * **Lexical Analysis:** `Source -> TokenStream` (Existing `Lexer`). |
| 43 | + * **Syntactic Analysis:** `TokenStream -> Syntax Tree` (e.g., `NodeList` of `SimpleNode`). Focuses on grammar and structure only. Generates syntax errors. |
| 44 | + |
| 45 | +2. **Single-File Semantics Layer (Semantic Analyzer)**: |
| 46 | + * **Semantic Analysis:** `Syntax Tree -> Rich AST` (Existing `Ast` structure). Applies `TagSpecs`, resolves tag relationships, validates arguments/filters within one file. Builds the nested AST. Generates single-file semantic errors. |
| 47 | + |
| 48 | +3. **Cross-File Semantics Layer (LSP Server / Project Analyzer)**: |
| 49 | + * **Project-Level Analysis:** `Multiple ASTs -> Inheritance Graph / Cross-File Insights`. Handles `extends`, `include`, `block` resolution across files. Performs cross-file validation. Generates cross-file semantic errors. |
| 50 | + |
| 51 | +This phased approach separates concerns effectively, mirroring how compilers and interpreters process code. |
| 52 | + |
| 53 | +## Rationale for Phased Analysis |
| 54 | + |
| 55 | +### Benefits for Django Template Processing |
| 56 | + |
| 57 | +1. **Alignment with Language Features**: |
| 58 | + * **Container Tags**: Correctly matching `if`/`endif`, `for`/`endfor` pairs and their branches (`elif`, `else`, `empty`) is naturally handled during semantic analysis after the basic structure is known. |
| 59 | + * **Custom Tag Libraries**: `{% load %}` directives can be identified syntactically by the Parser, with the actual tag definitions applied during Semantic Analysis. |
| 60 | + * **Filter Chains & Arguments**: Syntax can be parsed first, with validation and semantic processing (argument checking, filter resolution) deferred to Semantic Analysis. |
| 61 | + |
| 62 | +2. **LSP-Specific Advantages**: |
| 63 | + * **Faster Syntactic Feedback**: The Parser (Syntax Layer) can run quickly, providing immediate feedback on basic syntax errors as the user types. |
| 64 | + * **Clearer Error Categorization**: Errors are naturally categorized by the phase that detects them (Syntax Errors vs. Semantic Errors vs. Cross-File Errors). |
| 65 | + * **Improved Handling of Incomplete Code**: The Parser can often produce a partial Syntax Tree even if semantic errors exist, allowing basic features (highlighting) to function. |
| 66 | + * **Enhanced Code Intelligence**: The dedicated Semantic Analysis phase builds a richer AST, enabling more accurate completions, hover information, and navigation within a file. |
| 67 | + |
| 68 | +3. **Technical Benefits**: |
| 69 | + * **Separation of Concerns**: Clear distinction between validating grammar (Parser) and interpreting meaning (Semantic Analyzer). |
| 70 | + * **Simplified Error Recovery**: The Parser can focus on recovering from syntax errors without the complexity of semantic context. |
| 71 | + * **More Maintainable Code**: Isolating semantic logic (tag spec application, relationship resolution) makes the system easier to understand, modify, and extend. |
| 72 | + * **Potentially Better Incremental Performance**: Changes might only require re-running the Parser on a small section and then re-running Semantic Analysis only on the affected parts of the Syntax Tree. |
| 73 | + |
| 74 | +### Performance Considerations for LSP Context |
| 75 | + |
| 76 | +1. **Incremental Processing Optimization**: |
| 77 | + * Small text changes might only require re-lexing and re-parsing a small portion of the source, updating the Syntax Tree locally. |
| 78 | + * Semantic Analysis can then be re-run, potentially only on the changed sub-tree and its ancestors/dependents. |
| 79 | + * Enables more granular invalidation and reprocessing. |
| 80 | + |
| 81 | +2. **Lazy Evaluation**: |
| 82 | + * Semantic Analysis could potentially be executed lazily, only when features requiring the rich AST are invoked. Basic syntax checks use only the Parser's output. |
| 83 | + |
| 84 | +3. **Caching Opportunities**: |
| 85 | + * The Syntax Tree (`NodeList`) output by the Parser is a potential caching point. |
| 86 | + * Results from Semantic Analysis (the rich AST) can also be cached. |
| 87 | + |
| 88 | +4. **Asynchronous Processing**: |
| 89 | + * The fast Parser phase could run synchronously for immediate feedback, while the potentially slower Semantic Analysis phase(s) could run asynchronously. |
| 90 | + |
| 91 | +## Detailed Design |
| 92 | + |
| 93 | +### Syntax Tree Structure (`NodeList`) |
| 94 | + |
| 95 | +The output of the **Parser (Syntax Layer)** will be a `NodeList`, representing the basic syntactic structure. It's a flat, sequential list corresponding closely to the significant tokens: |
| 96 | + |
| 97 | +```rust |
| 98 | +// Output of the Parser (Syntax Layer) |
| 99 | +pub struct NodeList { |
| 100 | + nodes: Vec<SimpleNode>, |
| 101 | + line_offsets: LineOffsets, // Derived during lexing/parsing |
| 102 | +} |
| 103 | + |
| 104 | +// Represents a node recognized purely based on syntax |
| 105 | +pub enum SimpleNode { |
| 106 | + Tag { |
| 107 | + name: String, // Syntactically identified tag name |
| 108 | + content: String, // Raw content inside {% ... %} |
| 109 | + span: Span, |
| 110 | + }, |
| 111 | + Variable { |
| 112 | + content: String, // Raw content inside {{ ... }} |
| 113 | + span: Span, |
| 114 | + }, |
| 115 | + Text { |
| 116 | + content: String, |
| 117 | + span: Span, |
| 118 | + }, |
| 119 | + Comment { // Only {# ... #} comments recognized here |
| 120 | + content: String, |
| 121 | + span: Span, |
| 122 | + }, |
| 123 | + // Maybe other types like HtmlTag if needed for basic structure |
| 124 | +} |
| 125 | +``` |
| 126 | + |
| 127 | +Key characteristics: |
| 128 | +1. Represents output of *syntactic analysis* only. |
| 129 | +2. `SimpleNode` variants contain raw content, minimal processing. |
| 130 | +3. Flat list structure (or a very basic tree if preferred). |
| 131 | +4. No semantic understanding (tag types, relationships, filters). |
| 132 | + |
| 133 | +### Semantic Analysis Process (Single-File) |
| 134 | + |
| 135 | +The **Semantic Analyzer (Single-File Semantics Layer)** takes the `NodeList` (Syntax Tree) as input and produces the rich `Ast`: |
| 136 | + |
| 137 | +1. Iterate through the `NodeList`. |
| 138 | +2. For `SimpleNode::Tag` nodes: |
| 139 | + * Look up the `tag.name` in the `TagSpecs`. |
| 140 | + * Based on the `TagSpec` (Container, Single, Inclusion): |
| 141 | + * **Container:** Find matching closing tags (`SimpleNode::Tag` with expected name) later in the `NodeList`. Identify intermediate branch tags. Recursively process nodes between the opening/closing/branch tags to build nested `Node::Block(Block::Container)` or `Node::Block(Block::Branch)`. |
| 142 | + * **Single:** Create a `Node::Block(Block::Single)`. |
| 143 | + * **Inclusion:** Create a `Node::Block(Block::Inclusion)`. Identify template name argument syntactically. |
| 144 | + * Parse and validate tag arguments (`tag.content`) according to `ArgSpec` in the `TagSpec`. |
| 145 | +3. For `SimpleNode::Variable` nodes: |
| 146 | + * Parse the `variable.content` into variable bits and `DjangoFilter`s. |
| 147 | + * Validate filter syntax. (Actual filter existence/argument validation might involve `TagSpecs` or a separate filter registry). |
| 148 | + * Create a `Node::Variable`. |
| 149 | +4. For `SimpleNode::Text` and `SimpleNode::Comment`: Create corresponding `Node::Text` / `Node::Comment`. |
| 150 | +5. Assemble these rich `Node` objects into the final nested `Ast` structure. |
| 151 | +6. Collect semantic errors encountered during this process (mismatched tags, invalid arguments, unknown filters, etc.). |
| 152 | + |
| 153 | +### Template Inheritance Handling (Cross-File Semantics) |
| 154 | + |
| 155 | +Template inheritance (`extends`, `blocks`, `includes`) is handled by the **LSP Server / Project Analyzer (Cross-File Semantics Layer)**: |
| 156 | + |
| 157 | +1. The Semantic Analyzer (Single-File) identifies inheritance-related tags (`{% extends %}`, `{% block %}`, `{% include %}`) and represents them in the rich `Ast` like other tags (e.g., as `Block::Single` or `Block::Container`). |
| 158 | +2. The LSP Server Layer: |
| 159 | + * Collects `Ast`s from all relevant project files. |
| 160 | + * Builds an inheritance graph based on `extends` and `include` relationships found in the ASTs. |
| 161 | + * Resolves `block` overrides across the inheritance chain. |
| 162 | + * Provides cross-file validation (circular extends, missing templates/blocks). |
| 163 | + * Powers LSP features requiring cross-file knowledge. |
| 164 | + |
| 165 | +This separation aligns with Django's own rendering process where inheritance is resolved after individual templates are parsed. |
| 166 | + |
| 167 | +## Implementation Plan |
| 168 | + |
| 169 | +### Phase 1: Define Syntax Tree Structure |
| 170 | +- [ ] Define `SimpleNode` enum (Tag, Variable, Text, Comment). |
| 171 | +- [ ] Define `NodeList` struct holding `Vec<SimpleNode>` and `LineOffsets`. |
| 172 | +- [ ] Implement basic methods for `NodeList`. |
| 173 | +- [ ] Ensure `Span` information is accurately captured for `SimpleNode`s. |
| 174 | +- [ ] Add tests for the `NodeList` and `SimpleNode` structures. |
| 175 | + |
| 176 | +### Phase 2: Implement Syntactic Parser |
| 177 | +- [ ] Refactor `parser.rs` into a `SyntacticParser` (or similar name). |
| 178 | +- [ ] Implement logic to consume `TokenStream` and produce a `NodeList`. |
| 179 | +- [ ] Focus solely on recognizing syntactic structures (`{% .. %}`, `{{ .. }}`, etc.) and mapping them to `SimpleNode`s *without* using `TagSpecs`. |
| 180 | +- [ ] Handle basic syntax error detection (e.g., unclosed `{%`). |
| 181 | +- [ ] Implement error recovery to continue parsing and produce a partial `NodeList`. |
| 182 | +- [ ] Preserve accurate `Span` information from tokens to `SimpleNode`s. |
| 183 | +- [ ] Create unit tests specifically for the Syntactic Parser. |
| 184 | + |
| 185 | +### Phase 3: Implement Single-File Semantic Analyzer |
| 186 | +- [ ] Create a new `SemanticAnalyzer` struct/module. |
| 187 | +- [ ] Implement the logic to process an input `NodeList` (Syntax Tree). |
| 188 | +- [ ] Integrate `TagSpecs` lookup. |
| 189 | +- [ ] Implement algorithms for matching container/branch/closing tags based on `TagSpecs`. |
| 190 | +- [ ] Implement logic to parse/validate tag arguments based on `ArgSpec`. |
| 191 | +- [ ] Implement logic to parse/validate variable filters. |
| 192 | +- [ ] Build the final rich `Ast` tree structure with correct nesting. |
| 193 | +- [ ] Collect and report semantic errors found during analysis (e.g., missing `endif`, bad arguments). |
| 194 | +- [ ] Create unit tests for the Semantic Analyzer. |
| 195 | + |
| 196 | +### Phase 4: LSP-Specific Optimizations |
| 197 | +- [ ] Implement incremental parsing support for the Syntactic Parser (Phase 2). |
| 198 | +- [ ] Add caching for the `NodeList` (Syntax Tree). |
| 199 | +- [ ] Implement selective invalidation/reprocessing for the Semantic Analyzer (Phase 3) based on changes in the `NodeList`. |
| 200 | +- [ ] Explore lazy execution of Semantic Analysis for syntax-only operations. |
| 201 | +- [ ] Add performance benchmarks focused on editing scenarios and LSP request timings. |
| 202 | + |
| 203 | +### Phase 5: Implement Cross-File Semantic Analysis Framework (LSP Layer) |
| 204 | +- [ ] Design components within the LSP server for managing multiple ASTs. |
| 205 | +- [ ] Create data structures for the template inheritance graph. |
| 206 | +- [ ] Implement logic to build the graph by analyzing `extends`/`include` tags in ASTs. |
| 207 | +- [ ] Implement `block` resolution logic across the inheritance chain. |
| 208 | +- [ ] Create interfaces for LSP features (diagnostics, navigation) to query this cross-file information. |
| 209 | +- [ ] Implement tests for template inheritance resolution. |
| 210 | + |
| 211 | +### Phase 6: Error Handling Strategy |
| 212 | +- [ ] Define distinct error types for Syntax Errors (from Parser), Single-File Semantic Errors (from Analyzer), and Cross-File Errors (from LSP Layer). |
| 213 | +- [ ] Implement robust error collection mechanisms for each phase. |
| 214 | +- [ ] Ensure all errors retain accurate `Span` information traceable to the original source. |
| 215 | +- [ ] Provide clear error messages indicating the nature (syntax vs. semantic) and source of the error. |
| 216 | +- [ ] Implement error recovery in both the Parser and Semantic Analyzer. |
| 217 | +- [ ] Add tests specifically for error handling and recovery across phases. |
| 218 | + |
| 219 | +### Phase 7: Update Validation and Public API |
| 220 | +- [ ] Review/Update the existing `Validator` (`validator.rs`). Much of its logic will move into the Semantic Analyzer (Phase 3). Determine if a separate final validation step is still needed on the rich AST. |
| 221 | +- [ ] Update the public API in `lib.rs` (e.g., `parse_template`) to reflect the new internal pipeline. Options: |
| 222 | + * Expose only the final rich `Ast` and combined errors. |
| 223 | + * Optionally expose the intermediate `NodeList` (Syntax Tree) for tools needing only syntax info. |
| 224 | +- [ ] Maintain backward compatibility during transition if possible, or clearly document API changes. |
| 225 | +- [ ] Update documentation for the new architecture and API. |
| 226 | + |
| 227 | +### Phase 8: Testing and Performance Optimization |
| 228 | +- [ ] Create comprehensive integration tests covering the entire pipeline (Lexer -> Parser -> Semantic Analyzer). |
| 229 | +- [ ] Test complex templates with nesting, various tags, filters, etc. |
| 230 | +- [ ] Test template inheritance scenarios via the LSP layer integration. |
| 231 | +- [ ] Benchmark performance of each phase and the end-to-end process. Compare against the old one-pass approach. |
| 232 | +- [ ] Identify and optimize bottlenecks, particularly for incremental updates. |
| 233 | +- [ ] Document performance characteristics and trade-offs. |
| 234 | + |
| 235 | +## Progressive Implementation Strategy |
| 236 | + |
| 237 | +(This can remain largely the same as the original plan) |
| 238 | + |
| 239 | +1. **Initial Development Phase**: Implement the Syntactic Parser and Semantic Analyzer alongside the existing parser, using a feature flag or configuration option to switch. |
| 240 | +2. **Testing Phase**: Run both the old and new pipelines on test suites, compare AST outputs (where possible) and error reporting, fix discrepancies. |
| 241 | +3. **Transition Phase**: Default to the original parser but allow opt-in to the new phased pipeline via configuration. Gather feedback. |
| 242 | +4. **Completion Phase**: Make the new phased pipeline the default. Deprecate and eventually remove the old one-pass parser code. |
| 243 | + |
| 244 | +## Detailed Progress Checklist |
| 245 | + |
| 246 | +(Update checklist items based on the revised phase names and tasks described above) |
| 247 | + |
| 248 | +### Phase 1: Define Syntax Tree Structure |
| 249 | +- [ ] Define `SimpleNode` enum... |
| 250 | +- [ ] Define `NodeList` struct... |
| 251 | +... (etc.) |
| 252 | + |
| 253 | +### Phase 2: Implement Syntactic Parser |
| 254 | +- [ ] Create `SyntacticParser` module/struct... |
| 255 | +- [ ] Implement `TokenStream` to `NodeList` conversion... |
| 256 | +- [ ] Add syntax error collection... |
| 257 | +... (etc.) |
| 258 | + |
| 259 | +### Phase 3: Implement Single-File Semantic Analyzer |
| 260 | +- [ ] Create `SemanticAnalyzer` module/struct... |
| 261 | +- [ ] Implement `NodeList` processing... |
| 262 | +- [ ] Add `TagSpecs` integration... |
| 263 | +... (etc.) |
| 264 | + |
| 265 | +*(Continue updating checklists for Phases 4-8 similarly)* |
| 266 | + |
| 267 | +## Notes and LSP Considerations |
| 268 | + |
| 269 | +- **Clear Phasing**: The Syntax -> Single-File Semantics -> Cross-File Semantics phasing provides a clean workflow, isolating different levels of complexity. |
| 270 | +- **Responsiveness**: The primary LSP benefit comes from the fast **Parser (Syntax Layer)** providing quick syntax validation. Semantic analysis can potentially run asynchronously or with delays. |
| 271 | +- **Memory Usage**: Storing the intermediate Syntax Tree (`NodeList`) plus the final rich `Ast` will increase memory usage compared to the one-pass approach. Monitor impact. |
| 272 | +- **Incremental Updates**: Key advantage for LSP. Changes trigger re-parsing (Phase 1) locally, potentially followed by targeted re-analysis (Phase 2) of affected nodes/sub-trees. Requires careful dependency tracking. |
| 273 | +- **Error Resilience**: The Parser can produce a useful Syntax Tree even if semantic errors exist later. Errors are clearly tied to the phase (Syntax, Semantic) that found them. |
| 274 | +- **Template Inheritance**: Explicitly handled in the final, cross-file semantic phase (LSP Layer), keeping the core parser and single-file analyzer focused. |
| 275 | +- **Custom Tags**: `{% load %}` identified by the Parser; tag definitions applied by the Semantic Analyzer using updated `TagSpecs`. |
| 276 | +- **Performance Monitoring**: Crucial to benchmark phase timings, especially for incremental updates, to ensure LSP responsiveness goals are met. |
0 commit comments