feat: EntityReference rule now respects AsciiDoc subs attributes

rolfedh · claude · rolfedh · commit eaaf3d280a1b · 2025-07-31T20:20:18.000-04:00
- Check code block subs attribute to determine if entities should be processed - Only fix entities in code blocks when subs includes "replacements" - Respect subs="none", subs="attributes+", subs="normal" etc. - Add comprehensive tests for different subs scenarios - Add documentation explaining the behavior This ensures that: - Code examples preserve literal entities by default - Users can opt-in to entity processing with subs="replacements" - Aditi follows AsciiDoc's substitution model correctly Related to upstream issue jhradilek/asciidoctor-dita-vale#98 Addresses #13 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -308,25 +308,25 @@ A comprehensive test suite prevents Jekyll deployment failures:
 ## Recent Development Focus (July 2025)
 
 ### Statistics
-- Total commits: 220
+- Total commits: 221
 
 ### Latest Achievements
+- ✅ Entityreference rule now respects asciidoc subs attributes.
 - ✅ Implement single-source versioning.
 - ✅ Add intermediate recheck step and fix accurate fix counting in journey workflow.
 - ✅ Add vale configuration and update asciidocdita styles for improved validation.
 - ✅ Reintroduce claude.md updater workflow with enhanced commit parsing.
-- ✅ Improve directory selection ui for better user experience.
 
 ### Development Focus
 - **Ci/Cd**: 89 commits
 - **Features**: 23 commits
 - **Bug Fixes**: 18 commits
 - **Documentation**: 17 commits
-- **Testing**: 12 commits
+- **Testing**: 13 commits
 
 ### Most Active Files
 - `docs/_data/recent_commits.yml`: 85 changes
-- `CLAUDE.md`: 63 changes
+- `CLAUDE.md`: 64 changes
 - `src/aditi/commands/journey.py`: 19 changes
 <!-- /AUTO-GENERATED:RECENT -->
 
diff --git a/docs/ENTITY_REFERENCE_HANDLING.md b/docs/ENTITY_REFERENCE_HANDLING.md
@@ -0,0 +1,91 @@
+# EntityReference Rule - Code Block Handling
+
+This document explains how Aditi's EntityReference rule handles HTML entities in different contexts, particularly in code blocks.
+
+## Overview
+
+The EntityReference rule converts unsupported HTML entities (like `&nbsp;`, `&copy;`, etc.) to DITA-compatible AsciiDoc attributes (like `{nbsp}`, `{copy}`). However, entities in code contexts should often remain literal.
+
+## Code Block Behavior
+
+### Default Behavior
+
+By default, entities in code blocks are **NOT** converted:
+
+```asciidoc
+[source,html]
+----
+<p>Hello&nbsp;World</p>  <!-- &nbsp; remains literal -->
+----
+```
+
+### With Substitutions Enabled
+
+When the `subs` attribute includes `replacements`, entities **ARE** converted:
+
+```asciidoc
+[source,html,subs="replacements"]
+----
+<p>Hello&nbsp;World</p>  <!-- &nbsp; becomes {nbsp} -->
+----
+```
+
+### Common Substitution Patterns
+
+1. **`subs="attributes+"`** - Only processes attribute references, NOT entities:
+   ```asciidoc
+   [source,terminal,subs="attributes+"]
+   ----
+   echo "Version {version}"    # {version} is replaced
+   echo "Hello&nbsp;World"     # &nbsp; remains literal
+   ----
+   ```
+
+2. **`subs="attributes+,replacements"`** - Processes both:
+   ```asciidoc
+   [source,html,subs="attributes+,replacements"]
+   ----
+   <p>{product}&nbsp;v{version}</p>  # Both {product} and &nbsp; are processed
+   ----
+   ```
+
+3. **`subs="normal"`** - All normal substitutions including replacements:
+   ```asciidoc
+   [listing,subs="normal"]
+   ----
+   Text with &copy; symbol  # &copy; becomes {copy}
+   ----
+   ```
+
+4. **`subs="none"`** - No substitutions at all:
+   ```asciidoc
+   [source,html,subs="none"]
+   ----
+   <p>&trade; {version}</p>  # Nothing is processed
+   ----
+   ```
+
+## Inline Code
+
+Entities in inline code (backticks) are **NEVER** converted:
+
+```asciidoc
+Use the `&nbsp;` entity in HTML.  # &nbsp; remains literal
+```
+
+## Why This Matters
+
+This behavior ensures that:
+1. Code examples remain accurate and don't have their entities converted
+2. When you DO want entities processed in code blocks (e.g., for documentation), you can enable it with `subs="replacements"`
+3. Aditi respects AsciiDoc's substitution model
+
+## Known Limitations
+
+- Nested code blocks are not fully supported. The outer block's settings may affect inner blocks.
+- Complex substitution patterns (like conditional processing) follow AsciiDoc's standard rules.
+
+## Related Issues
+
+- [Vale Issue #98](https://github.com/jhradilek/asciidoctor-dita-vale/issues/98) - Vale incorrectly flags entities in code blocks
+- [Aditi Issue #13](https://github.com/rolfedh/aditi/issues/13) - Aditi correctly handles these cases
diff --git a/src/aditi/rules/entity_reference.py b/src/aditi/rules/entity_reference.py
@@ -127,23 +127,194 @@ def validate_fix(self, fix: Fix, file_content: str) -> bool:
         if not line_content:
             return False
             
-        # Check if we're inside a code block
-        # Look for code block delimiters before the current line
-        lines = file_content.splitlines()
-        in_code_block = False
-        for i in range(fix.violation.line - 1):
-            line = lines[i].strip()
-            if line == "----" or line == "....":
-                in_code_block = not in_code_block
-                
-        if in_code_block:
-            return False
-            
         # Check if we're inside inline code
         # Count backticks before the entity position
         before_text = line_content[:fix.violation.column - 1]
         backtick_count = before_text.count("`")
         if backtick_count % 2 != 0:  # Odd number means we're inside inline code
             return False
             
-        return True
+        # Check if we're inside a code block and if replacements are enabled
+        lines = file_content.splitlines()
+        code_block_info = self._get_code_block_context(lines, fix.violation.line - 1)
+        
+        if code_block_info['in_code_block']:
+            # Check if replacements are enabled for this code block
+            if code_block_info['replacements_enabled']:
+                return True  # Entities should be processed
+            else:
+                return False  # Entities should remain literal
+            
+        return True
+    
+    def _get_code_block_context(self, lines: list, target_line_idx: int) -> dict:
+        """Determine if we're in a code block and check its substitution settings.
+        
+        Args:
+            lines: List of all lines in the document
+            target_line_idx: Zero-based index of the target line
+            
+        Returns:
+            Dict with 'in_code_block' and 'replacements_enabled' flags
+        """
+        in_code_block = False
+        block_type = None
+        block_start_line = -1
+        subs_value = None
+        pending_source_subs = None  # Store subs from [source] line
+        
+        for i in range(min(target_line_idx + 1, len(lines))):
+            line = lines[i].strip()
+            
+            # First check if this is a source attribute line
+            if line.startswith("[source"):
+                # Extract subs but don't mark as in block yet
+                pending_source_subs = self._extract_subs_from_line(line)
+                continue
+                
+            # Check for listing/source block delimiters
+            if line == "----":
+                if not in_code_block:
+                    in_code_block = True
+                    block_type = "listing"
+                    block_start_line = i
+                    # Check if there was a source line just before
+                    if i > 0 and pending_source_subs is not None:
+                        subs_value = pending_source_subs
+                        block_type = "source"
+                    else:
+                        # Look for other attributes in previous lines
+                        subs_value = self._find_block_attributes(lines, i)
+                    pending_source_subs = None  # Reset
+                else:
+                    # Closing delimiter
+                    in_code_block = False
+                    block_type = None
+                    subs_value = None
+                    pending_source_subs = None
+            elif line == "....":
+                if not in_code_block:
+                    in_code_block = True
+                    block_type = "literal"
+                    block_start_line = i
+                    # Look for attributes in previous lines
+                    subs_value = self._find_block_attributes(lines, i)
+                else:
+                    # Closing delimiter
+                    in_code_block = False
+                    block_type = None
+                    subs_value = None
+            else:
+                # Any other line resets pending source
+                if line and not line.startswith("["):
+                    pending_source_subs = None
+        
+        # Determine if replacements are enabled
+        replacements_enabled = False
+        if subs_value:
+            # Parse subs value
+            subs_list = self._parse_subs_value(subs_value)
+            replacements_enabled = 'replacements' in subs_list
+        
+        return {
+            'in_code_block': in_code_block,
+            'replacements_enabled': replacements_enabled,
+            'block_type': block_type,
+            'subs': subs_value
+        }
+    
+    def _find_block_attributes(self, lines: list, delimiter_idx: int) -> Optional[str]:
+        """Find block attributes that might contain subs setting.
+        
+        Args:
+            lines: List of all lines
+            delimiter_idx: Index of the block delimiter line
+            
+        Returns:
+            The subs value if found, None otherwise
+        """
+        # Look backwards for block attributes (up to 3 lines)
+        for i in range(max(0, delimiter_idx - 3), delimiter_idx):
+            line = lines[i].strip()
+            # Check for [source,...] or [listing,...] style attributes
+            if line.startswith('[') and line.endswith(']'):
+                return self._extract_subs_from_line(line)
+        return None
+    
+    def _extract_subs_from_line(self, line: str) -> Optional[str]:
+        """Extract subs value from an attribute line.
+        
+        Args:
+            line: Line containing attributes like [source,java,subs="attributes+"]
+            
+        Returns:
+            The subs value if found, None otherwise
+        """
+        import re
+        
+        # Look for subs="value" or subs='value'
+        match = re.search(r'subs\s*=\s*["\']([^"\']+)["\']', line)
+        if match:
+            return match.group(1)
+        
+        # Look for subs=value (without quotes)
+        match = re.search(r'subs\s*=\s*([^,\]]+)', line)
+        if match:
+            return match.group(1).strip()
+        
+        return None
+    
+    def _parse_subs_value(self, subs_value: str) -> list:
+        """Parse the subs attribute value into a list of substitutions.
+        
+        Args:
+            subs_value: Value like "attributes+", "replacements", "+replacements,-attributes"
+            
+        Returns:
+            List of active substitution types
+        """
+        if not subs_value:
+            return []
+        
+        # Handle special values
+        if subs_value == 'normal':
+            # Normal substitutions
+            return ['specialcharacters', 'quotes', 'attributes', 'replacements', 'macros', 'post_replacements']
+        elif subs_value == 'none':
+            return []
+        elif subs_value == 'verbatim':
+            return ['specialcharacters']
+        
+        # For code blocks, default is no substitutions
+        # We start with empty list and only add what's explicitly requested
+        active_subs = []
+        
+        # Parse comma-separated list with +/- modifiers
+        parts = [p.strip() for p in subs_value.split(',')]
+        
+        for part in parts:
+            if not part:
+                continue
+                
+            # Check for trailing + which means "add to defaults"
+            # For code blocks, default is empty, so "attributes+" just adds attributes
+            if part.endswith('+') and not part.startswith('+'):
+                sub_type = part[:-1]  # Remove trailing +
+                if sub_type and sub_type not in active_subs:
+                    active_subs.append(sub_type)
+            elif part.startswith('+'):
+                # Explicit add with +prefix
+                sub_type = part[1:]
+                if sub_type and sub_type not in active_subs:
+                    active_subs.append(sub_type)
+            elif part.startswith('-'):
+                # Remove from existing
+                sub_type = part[1:]
+                if sub_type in active_subs:
+                    active_subs.remove(sub_type)
+            else:
+                # No modifier - this replaces everything
+                if part not in active_subs:
+                    active_subs.append(part)
+        
+        return active_subs
diff --git a/tests/unit/rules/test_entity_reference_subs.py b/tests/unit/rules/test_entity_reference_subs.py