diff --git a/build/jupyterize/QUICKSTART.md b/build/jupyterize/QUICKSTART.md new file mode 100644 index 000000000..9ea119883 --- /dev/null +++ b/build/jupyterize/QUICKSTART.md @@ -0,0 +1,94 @@ +# Jupyterize - Quick Start Guide + +## Installation + +```bash +pip install nbformat +``` + +## Basic Usage + +```bash +# Convert a file (creates example.ipynb) +python build/jupyterize/jupyterize.py example.py + +# Specify output location +python build/jupyterize/jupyterize.py example.py -o notebooks/example.ipynb + +# Enable verbose logging +python build/jupyterize/jupyterize.py example.py -v +``` + +## What It Does + +Converts code example files → Jupyter notebooks (`.ipynb`) + +**Automatic:** +- ✅ Detects language from file extension +- ✅ Selects appropriate Jupyter kernel +- ✅ Excludes `EXAMPLE:` and `BINDER_ID` markers +- ✅ Includes code in `HIDE_START`/`HIDE_END` blocks +- ✅ Excludes code in `REMOVE_START`/`REMOVE_END` blocks +- ✅ Creates separate cells for each `STEP_START`/`STEP_END` block + +## Supported Languages + +| Extension | Language | Kernel | +|-----------|------------|--------------| +| `.py` | Python | python3 | +| `.js` | JavaScript | javascript | +| `.go` | Go | gophernotes | +| `.cs` | C# | csharp | +| `.java` | Java | java | +| `.php` | PHP | php | +| `.rs` | Rust | rust | + +## Input File Format + +```python +# EXAMPLE: example_id +# BINDER_ID optional-binder-id +import redis + +# STEP_START connect +r = redis.Redis() +# STEP_END + +# STEP_START set_get +r.set('foo', 'bar') +r.get('foo') +# STEP_END +``` + +## Output Structure + +Creates a Jupyter notebook with: +- **Preamble cell** - Code before first `STEP_START` +- **Step cells** - Each `STEP_START`/`STEP_END` block +- **Kernel metadata** - Automatically set based on language +- **Step metadata** - Step names stored in cell metadata + +## Common Issues + +**"Unsupported file extension"** +→ Use a supported extension (.py, .js, .go, .cs, .java, .php, .rs) + +**"File must start with EXAMPLE: marker"** +→ Add `# EXAMPLE: ` (or `//` for JS/Go/etc.) as first line + +**"Input file not found"** +→ Check file path is correct + +## Testing + +```bash +# Run automated tests +python build/jupyterize/test_jupyterize.py +``` + +## More Information + +- **User Guide**: `build/jupyterize/README.md` +- **Technical Spec**: `build/jupyterize/SPECIFICATION.md` +- **Implementation**: `build/jupyterize/IMPLEMENTATION.md` + diff --git a/build/jupyterize/README.md b/build/jupyterize/README.md new file mode 100644 index 000000000..8c1dfcd88 --- /dev/null +++ b/build/jupyterize/README.md @@ -0,0 +1,338 @@ +# Jupyterize - Code Example to Jupyter Notebook Converter + +## Overview + +`jupyterize` is a command-line tool that converts code example files into Jupyter notebook (`.ipynb`) files. It processes source code files that use special comment markers to delimit logical steps, converting each step into a separate cell in the generated notebook. + +This tool is designed to work with the Redis documentation code example format (documented in `build/tcedocs/`) but can be extended to support other formats. + +**Key Features:** +- **Automatic language detection**: Detects programming language and Jupyter kernel from file extension +- **Smart marker processing**: Automatically handles HIDE, REMOVE, and metadata markers with sensible defaults +- **Multi-language support**: Works with any programming language supported by Jupyter kernels +- **Simple interface**: Minimal configuration required - just point it at a file + +## Purpose + +The tool enables: +- **Interactive documentation**: Convert static code examples into executable Jupyter notebooks +- **Multi-language support**: Generate notebooks for any programming language supported by Jupyter kernels +- **Step-by-step execution**: Each `STEP_START`/`STEP_END` block becomes a separate notebook cell +- **Automated workflow**: Batch convert multiple examples for documentation or educational purposes + +## Installation + +### Requirements + +- Python 3.7 or higher +- Required Python packages (install via pip): + ```bash + pip install nbformat + ``` + +### Optional Dependencies + +For enhanced functionality: +- `jupyter` - To run and test generated notebooks locally +- `jupyterlab` - For a modern notebook interface + +## Usage + +### Basic Command-Line Syntax + +```bash +python jupyterize.py [options] +``` + +### Options + +- `-o, --output ` - Output notebook file path (default: same name as input with `.ipynb` extension) +- `-v, --verbose` - Enable verbose logging +- `-h, --help` - Show help message + +### Automatic Behavior + +The tool automatically handles the following without requiring configuration: + +- **Language and kernel detection**: Determined from file extension (`.py` → Python/python3, `.js` → JavaScript/javascript, etc.) +- **Metadata markers**: `EXAMPLE:` and `BINDER_ID` markers are always excluded from notebook output +- **Hidden blocks**: Code within `HIDE_START`/`HIDE_END` markers is always included in notebooks (these are only hidden in web documentation) +- **Removed blocks**: Code within `REMOVE_START`/`REMOVE_END` markers is always excluded from notebooks (test boilerplate) + +### Examples + +**Convert a Python example:** +```bash +python jupyterize.py local_examples/client-specific/redis-py/landing.py +# Output: local_examples/client-specific/redis-py/landing.ipynb +# Language and kernel auto-detected from .py extension +``` + +**Specify output location:** +```bash +python jupyterize.py local_examples/client-specific/redis-py/landing.py -o notebooks/landing.ipynb +``` + +**Convert a JavaScript example:** +```bash +python jupyterize.py examples/example.js +# Output: examples/example.ipynb +# Language and kernel auto-detected from .js extension +``` + +**Batch convert all Python examples:** +```bash +find local_examples -name "*.py" -exec python jupyterize.py {} \; +``` + +**Verbose mode for debugging:** +```bash +python jupyterize.py example.py -v +# Shows detected language, kernel, parsed markers, and processing steps +``` + +## Input File Format + +The tool processes files that follow the Redis documentation code example format. See `build/tcedocs/README.md` for complete documentation. + +### Required Markers + +**Example ID** (required, must be first line): +```python +# EXAMPLE: example_id +``` + +### Step Markers + +**Step blocks** (optional, creates separate cells): +```python +# STEP_START step_name +# ... code for this step ... +# STEP_END +``` + +- Each `STEP_START`/`STEP_END` block becomes a separate notebook cell +- Code outside step blocks is placed in a single cell at the beginning +- Step names are used as cell metadata (can be displayed in notebook UI) + +### Optional Markers + +**BinderHub ID** (optional, line 2): +```python +# BINDER_ID commit_hash_or_branch_name +``` + +**Hidden code blocks** (optional): +```python +# HIDE_START +# ... code hidden by default in docs ... +# HIDE_END +``` +- These blocks are **included** in notebooks (only hidden in web documentation) +- Useful for setup code that users should run but doesn't need emphasis in docs + +**Removed code blocks** (optional): +```python +# REMOVE_START +# ... test framework code, imports, etc. ... +# REMOVE_END +``` +- Always **excluded** from notebooks (test boilerplate that shouldn't be in user-facing examples) + +### Example Input File + +```python +# EXAMPLE: landing +# BINDER_ID python-landing +import redis + +# STEP_START connect +r = redis.Redis(host='localhost', port=6379, decode_responses=True) +# STEP_END + +# STEP_START set_get_string +r.set('foo', 'bar') +# True +r.get('foo') +# bar +# STEP_END + +# STEP_START close +r.close() +# STEP_END +``` + +### Generated Notebook Structure + +The above example generates a notebook with 4 cells: + +1. **Cell 1** (code): `import redis` +2. **Cell 2** (code, metadata: `step=connect`): `r = redis.Redis(...)` +3. **Cell 3** (code, metadata: `step=set_get_string`): `r.set('foo', 'bar')` and `r.get('foo')` +4. **Cell 4** (code, metadata: `step=close`): `r.close()` + +**Note**: The `EXAMPLE:` and `BINDER_ID` marker lines are automatically excluded from the notebook output. + +## Language Support + +The tool supports any programming language that has a Jupyter kernel. The language is auto-detected from the file extension. + +### Supported Languages and Kernels + +| Language | File Extension | Default Kernel | Comment Prefix | +|------------|----------------|----------------|----------------| +| Python | `.py` | `python3` | `#` | +| JavaScript | `.js` | `javascript` | `//` | +| TypeScript | `.ts` | `typescript` | `//` | +| Java | `.java` | `java` | `//` | +| Go | `.go` | `gophernotes` | `//` | +| C# | `.cs` | `csharp` | `//` | +| PHP | `.php` | `php` | `//` | +| Ruby | `.rb` | `ruby` | `#` | +| Rust | `.rs` | `rust` | `//` | + +### Adding New Languages + +To add support for a new language: + +1. **Update language mappings** in `jupyterize.py`: + ```python + LANGUAGE_MAP = { + '.ext': 'language_name', + # ... + } + ``` + +2. **Update kernel mappings**: + ```python + KERNEL_MAP = { + 'language_name': 'kernel_name', + # ... + } + ``` + +3. **Update comment prefix mappings**: + ```python + COMMENT_PREFIX = { + 'language_name': '//', + # ... + } + ``` + +4. **Install the Jupyter kernel** (if not already installed): + ```bash + # Example for Go + go install github.com/gopherdata/gophernotes@latest + ``` + +## Output Format + +The tool generates standard Jupyter Notebook files (`.ipynb`) in JSON format, compatible with: +- Jupyter Notebook +- JupyterLab +- VS Code +- Google Colab +- BinderHub +- Any other Jupyter-compatible environment + +### Notebook Metadata + +Generated notebooks include: +- **Kernel specification**: Language and kernel name +- **Language info**: Programming language metadata +- **Cell metadata**: Step names (if using STEP_START/STEP_END) +- **Custom metadata**: Example ID, source file path + +## Advanced Usage + +### Integration with Build Pipeline + +The tool can be integrated into the documentation build process: + +```bash +# In build/make.py or a separate script +python build/jupyterize/jupyterize.py local_examples/**/*.py -o notebooks/ +``` + +### Custom Processing + +For custom processing logic, import the tool as a module: + +```python +from jupyterize import JupyterizeConverter + +converter = JupyterizeConverter(input_file='example.py') +notebook = converter.convert() +converter.save(notebook, 'output.ipynb') +``` + +The converter automatically detects language and kernel from the file extension and applies the standard processing rules for markers. + +## Troubleshooting + +### Common Issues + +**Issue**: "Kernel not found" error +- **Solution**: Install the required Jupyter kernel for your language +- **Check available kernels**: `jupyter kernelspec list` + +**Issue**: Comment markers not detected +- **Solution**: Ensure comment prefix matches the language (e.g., `#` for Python, `//` for JavaScript) +- **Check**: First line must be `# EXAMPLE: id` or `// EXAMPLE: id` + +**Issue**: Empty notebook generated +- **Solution**: Verify that the input file contains code outside of REMOVE_START/REMOVE_END blocks +- **Note**: REMOVE blocks are always excluded, HIDE blocks are always included + +**Issue**: Steps not creating separate cells +- **Solution**: Ensure `STEP_START` and `STEP_END` markers are properly paired and use correct comment syntax + +**Issue**: Unexpected code in notebook output +- **Solution**: Remember that HIDE_START/HIDE_END blocks are included in notebooks (they're only hidden in web docs) +- **Solution**: Use REMOVE_START/REMOVE_END for code that should never appear in notebooks + +### Debug Mode + +Enable verbose logging to troubleshoot issues: + +```bash +python jupyterize.py example.py -v +``` + +This will show: +- Detected language and kernel +- Parsed markers and line ranges +- Cell creation process +- Output file location + +## Related Documentation + +- **Code Example Format**: `build/tcedocs/README.md` - User guide for writing examples +- **Technical Specification**: `build/tcedocs/SPECIFICATION.md` - System architecture and implementation details +- **Example Parser**: `build/components/example.py` - Python module that parses example files + +## Future Enhancements + +Potential improvements for future versions: + +- **Markdown cells**: Convert comments to markdown cells for documentation +- **Output formats**: Support for other formats (e.g., Google Colab, VS Code notebooks) +- **Validation**: Verify that generated notebooks are executable +- **Testing**: Automatically run notebooks to ensure examples work +- **Metadata preservation**: Include more metadata from source files (highlight ranges, etc.) +- **Template support**: Custom notebook templates for different use cases + +## Contributing + +When contributing to this tool: + +1. Follow the existing code style and structure +2. Add tests for new features +3. Update this README with new options or features +4. Ensure compatibility with the existing code example format +5. Test with multiple programming languages + +## License + +This tool is part of the Redis documentation project and follows the same license as the parent repository. + diff --git a/build/jupyterize/SPECIFICATION.md b/build/jupyterize/SPECIFICATION.md new file mode 100644 index 000000000..8560241cb --- /dev/null +++ b/build/jupyterize/SPECIFICATION.md @@ -0,0 +1,1805 @@ +# Jupyterize - Technical Specification + +> **For End Users**: See `build/jupyterize/README.md` for usage documentation. + +## Document Purpose + +This specification provides implementation details for developers building the `jupyterize.py` script. It focuses on the essential technical information needed to convert code example files into Jupyter notebooks. + +**Related Documentation:** +- User guide: `build/jupyterize/README.md` +- Code example format: `build/tcedocs/README.md` and `build/tcedocs/SPECIFICATION.md` +- Existing parser: `build/components/example.py` + +## Quickstart for Implementers (TL;DR) + +- Goal: Convert a marked example file into a clean Jupyter notebook. +- Inputs: Source file with markers (EXAMPLE, STEP_START/END, HIDE/REMOVE), file extension for language. +- Output: nbformat v4 notebook with cells per step. + +Steps: +1) Parse file line-by-line into blocks (preamble + steps) using marker rules +2) Detect language from extension and load `build/jupyterize/jupyterize_config.json` +3) If boilerplate is configured for the language, prepend a boilerplate cell +4) For each block: unwrap using `unwrap_patterns` → dedent → rstrip; skip empty cells +5) Assemble notebook (kernelspec/metadata) and write to `.ipynb` + +Pitfalls to avoid: +- Always `.lower()` language keys for config and kernels +- Handle both `#EXAMPLE:` and `# EXAMPLE:` formats +- Save preamble before the first step and any trailing preamble at end +- Apply unwrap patterns in listed order; for Java, remove `@Test` before method wrappers +- Dedent after unwrapping when any unwrap patterns exist for the language + +Add a new language (5 steps): +1) Copy the C# pattern set as a starting point +2) Examine 3–4 real repo files for that language (don’t guess pattern count) +3) Add language-specific patterns (e.g., Java `@Test`, `static main()`) +4) Write one synthetic test and one real-file test per client library variant +5) Iterate on patterns until real files produce clean notebooks + +--- + +## Table of Contents +## Marker Legend (1-minute reference) + +- EXAMPLE: — Skip this line; defines the example id (must be first line) +- BINDER_ID — Skip this line; not included in the notebook +- STEP_START / STEP_END — Use as cell boundaries; markers themselves are excluded +- HIDE_START / HIDE_END — Include the code inside; markers excluded (unlike web docs, code is visible) +- REMOVE_START / REMOVE_END — Exclude the code inside; markers excluded + +--- + + +1. [Critical Implementation Notes](#critical-implementation-notes) +2. [Code Quality Patterns](#code-quality-patterns) +3. [System Overview](#system-overview) +4. [Core Mappings](#core-mappings) +5. [Implementation Approach](#implementation-approach) +6. [Marker Processing Rules](#marker-processing-rules) +7. [Language-Specific Features](#language-specific-features) +8. [Notebook Generation](#notebook-generation) +9. [Error Handling](#error-handling) +10. [Testing](#testing) + +--- + +## Critical Implementation Notes + +> **⚠️ Read This First!** These are the most common pitfalls discovered during implementation. + +### 1. Always Use `.lower()` for Dictionary Lookups + +**Problem**: The `PREFIXES` and `KERNEL_SPECS` dictionaries use **lowercase** keys (`'python'`, `'node.js'`), but `EXTENSION_TO_LANGUAGE` returns mixed-case values (`'Python'`, `'Node.js'`). + +**Solution**: Always use `.lower()` when accessing these dictionaries: + +```python +# ❌ WRONG - Will cause KeyError +prefix = PREFIXES[language] # KeyError if language = 'Python' + +# ✅ CORRECT +prefix = PREFIXES[language.lower()] +``` + +This applies to: +- `PREFIXES[language.lower()]` in parsing +- `KERNEL_SPECS[language.lower()]` in notebook creation + +### 2. Check Both Marker Formats (Use Helper Function!) + +**Problem**: Markers can appear with or without a space after the comment prefix. + +**Examples**: +- `# EXAMPLE: test` (with space) +- `#EXAMPLE: test` (without space) + +**Solution**: Create a helper function to avoid repetition: + +```python +def _check_marker(line, prefix, marker): + """ + Check if a line contains a marker (with or without space after prefix). + + Args: + line: Line to check + prefix: Comment prefix (e.g., '#', '//') + marker: Marker to look for (e.g., 'EXAMPLE:', 'STEP_START') + + Returns: + bool: True if marker is found + """ + return f'{prefix} {marker}' in line or f'{prefix}{marker}' in line + +# ✅ CORRECT - Use helper throughout +if _check_marker(line, prefix, EXAMPLE): + # Handle EXAMPLE marker +``` + +**Why a helper function?** +- You'll check markers ~8 times in the parsing function +- DRY principle - don't repeat yourself +- Easier to maintain - one place to update if logic changes +- More readable - clear intent + +### 3. Import from Existing Modules + +**Problem**: Redefining constants that already exist in the build system. + +**Solution**: Import from existing modules: + +```python +# ✅ Import these - don't redefine! +from local_examples import EXTENSION_TO_LANGUAGE +from components.example import PREFIXES +from components.example import HIDE_START, HIDE_END, REMOVE_START, REMOVE_END, STEP_START, STEP_END, EXAMPLE, BINDER_ID +``` + +### 4. Handle Empty Directory Name + +**Problem**: `os.path.dirname()` returns empty string for files in current directory. + +**Solution**: Check if dirname is non-empty before creating: + +```python +# ❌ WRONG - os.makedirs('') will fail +output_dir = os.path.dirname(output_path) +os.makedirs(output_dir, exist_ok=True) + +# ✅ CORRECT +output_dir = os.path.dirname(output_path) +if output_dir and not os.path.exists(output_dir): + os.makedirs(output_dir, exist_ok=True) +``` + +### 5. Save Preamble Before Starting Step + +**Problem**: When entering a STEP, accumulated preamble code gets lost. + +**Solution**: Save preamble to cells list before starting a new step: + +```python +if f'{prefix} {STEP_START}' in line: + # ✅ Save preamble first! + if preamble_lines: + cells.append({'code': ''.join(preamble_lines), 'step_name': None}) + preamble_lines = [] + + in_step = True + # ... rest of step handling +``` + +### 6. Don't Forget Remaining Preamble + +**Problem**: Code after the last STEP_END gets lost. + +**Solution**: Save remaining preamble at end of parsing: + +```python +# After the main loop +if preamble_lines: + cells.append({'code': ''.join(preamble_lines), 'step_name': None}) +``` + +### 7. Track Duplicate Step Names + +**Problem**: Users may accidentally reuse step names (copy-paste errors). + +**Solution**: Track seen step names and warn on duplicates: + +```python +seen_step_names = set() + +# When processing STEP_START: +if step_name and step_name in seen_step_names: + logging.warning(f"Duplicate step name '{step_name}' (previously defined)") +elif step_name: + seen_step_names.add(step_name) +``` + +**Why warn instead of error?** +- Jupyter notebooks can have duplicate cell metadata +- Non-breaking - helps users but doesn't stop processing +- Useful for debugging example files + +### 8. Handle Language-Specific Boilerplate and Wrappers + +**Problem**: Different languages have different requirements for Jupyter notebooks: +- **C#**: Needs `#r "nuget: PackageName, Version"` directives for dependencies +- **Test wrappers**: Source files have class/method wrappers needed for testing but not for notebooks + +**Solution**: Two-part approach: + +**Part 1: Boilerplate Injection** +- Define language-specific boilerplate in configuration +- Insert as first cell (before preamble) +- Example: C# needs `#r "nuget: NRedisStack, 1.1.1"` + +**Part 2: Structural Unwrapping** +- Detect and remove language-specific structural wrappers +- C#: Remove `public class ClassName { ... }` and `public void Run() { ... }` +- Keep only the actual example code inside + +**Why this matters**: +- Without boilerplate: Notebooks won't run (missing dependencies) +- Without unwrapping: Notebooks have unnecessary test framework code +- These aren't marked with REMOVE blocks because they're needed for tests + +**See**: [Language-Specific Features](#language-specific-features) section for detailed implementation. + +### 9. Unwrapping Patterns: Single‑line vs Multi‑line, and Dedenting (Based on Implementation Experience) + +During implementation, several non‑obvious details significantly reduced bugs and rework: + +- Pattern classes and semantics + - Single‑line patterns: When `start_pattern == end_pattern`, treat as “remove this line only”. Examples: `public class X {` or `public void Run() {` on one line. + - Multi‑line patterns: When `start_pattern != end_pattern`, remove the start line, everything until the end line, and the end line itself. Use this to strip a wrapper’s braces while preserving the inner code with a separate “keep content” strategy. + - Use anchored patterns with `^` to avoid over‑matching. Prefer `re.match` (anchored at the start) over `re.search`. + +- Wrappers split across cells + - Real C# files often split wrappers across lines/blocks (e.g., class name on line N, `{` or `}` in later lines). Because parsing splits code into preamble/step cells, wrapper open/close tokens may land in separate cells. + - Practical approach: Use separate, simple patterns to remove opener lines (class/method declarations with `{` either on the same line or next line) and a generic pattern to remove solitary closing braces in any cell. + +- Order of operations inside cell creation + 1) Apply unwrapping patterns (in the order listed in configuration) + 2) Dedent code (e.g., `textwrap.dedent`) so content previously nested inside wrappers aligns to column 0 + 3) Strip trailing whitespace (e.g., `rstrip()`) + 4) Skip empty cells + +- Dedent all cells when unwrapping is enabled + - Even if a particular cell didn’t change after unwrapping, its content may still be indented due to having originated inside a method/class in the source file. Dedent ALL cells whenever `unwrap_patterns` are configured for the language. + +- Logging for traceability + - Emit `DEBUG` logs per applied pattern (e.g., pattern `type`) to simplify diagnosing regex issues. + +- Safety tips for patterns + - Anchor with `^` and keep them specific; avoid overly greedy constructs. + - Keep patterns minimal and composable (e.g., separate `class_opening`, `method_opening`, `closing_braces`). + - Validate patterns at startup or wrap application with try/except to warn and continue on malformed regex. + +### 10. Closing Brace Removal Must Be Match-Based, Not Pattern-Based (Critical Bug Fix) + +**Problem**: The initial implementation removed closing braces based on the number of unwrap patterns configured, not the number of patterns that actually matched. This caused a critical bug where closing braces from control structures (for loops, foreach loops, if statements) were incorrectly removed. + +**Example of the bug**: +```csharp +// Original code in a cell +for (var i = 0; i < resultsList.Count; i++) +{ + Console.WriteLine(i); +} + +// BUG: Closing brace was removed, resulting in: +for (var i = 0; i < resultsList.Count; i++) +{ + Console.WriteLine(i); +// Missing } +``` + +**Root cause**: The unwrapping logic counted braces to remove based on pattern configuration (e.g., "C# has 4 patterns with braces, so remove 4 closing braces from every cell"), rather than counting how many patterns actually matched in each specific cell. + +**Solution**: Modified `remove_matching_lines()` to return a tuple `(modified_code, match_count)` and updated `unwrap_code()` to only remove closing braces when patterns actually match: + +```python +# Before (WRONG): +for pattern_config in unwrap_patterns: + code = remove_matching_lines(code, pattern, end_pattern) + if '{' in pattern: + braces_removed += 1 # Always increments! + +# After (CORRECT): +for pattern_config in unwrap_patterns: + code, match_count = remove_matching_lines(code, pattern, end_pattern) + if match_count > 0 and '{' in pattern: + braces_removed += match_count # Only increments if pattern matched +``` + +**Implementation details**: +1. `remove_matching_lines()` now returns `(code, match_count)` instead of just `code` +2. `unwrap_code()` tracks `braces_removed` based on actual matches, not pattern configuration +3. `remove_trailing_braces()` scans from the end and removes only the exact number of trailing closing braces +4. The `closing_braces` pattern was removed from configuration files (C# and Java) since it's now handled programmatically + +**Time saved by documenting this**: ~2 hours of debugging similar issues in the future. + +**Follow-up fix**: After implementing match-based brace removal, a second issue was discovered: cells containing **only** orphaned closing braces (from removed class/method wrappers) were still being included in the notebook. These cells appeared when the closing braces were after a REMOVE block, causing them to be parsed as a separate preamble cell. + +**Solution**: Added a filter in `create_cells()` to skip cells that contain only closing braces and whitespace: + +```python +# Skip cells that contain only closing braces and whitespace +# (orphaned closing braces from removed class/method wrappers) +if lang_config.get('unwrap_patterns'): + # Remove all whitespace and check if only closing braces remain + code_no_whitespace = re.sub(r'\s', '', code) + if code_no_whitespace and re.match(r'^}+$', code_no_whitespace): + logging.debug(f"Skipping cell {i} (contains only closing braces)") + continue +``` + +This ensures that orphaned closing brace cells are completely removed from the final notebook. + +### 11. Pattern Count Differences Between Languages (Java Implementation Insight) + +**Key Discovery**: When adding Java support after C#, the pattern count increased from 5 to 8 patterns. + +**Why the difference?** + +| Language | Patterns | Unique Requirements | +|----------|----------|---------------------| +| **C#** | 5 | `class_single_line`, `class_opening`, `method_single_line`, `method_opening`, `closing_braces` | +| **Java** | 8 | All C# patterns PLUS `test_annotation`, `static_main_single_line`, `static_main_opening` | + +**Java-specific additions**: +1. **`test_annotation`** - Java uses `@Test` annotations on separate lines before methods (C# uses `[Test]` attributes which are less common in our examples) +2. **`static_main_single_line`** - Java examples often use `public static void main(String[] args)` instead of instance methods +3. **`static_main_opening`** - Multi-line version of static main + +**Critical insight**: Don't assume pattern counts will be identical across languages, even for similar class-based languages. + +**Pattern order matters more in Java**: +- `test_annotation` MUST come before `method_opening` (otherwise the annotation line might not be removed) +- Specific patterns (single-line) before generic patterns (multi-line) +- Openers before closers + +**Implementation tip**: When adding a new language: +1. Start with the C# patterns as a template +2. Examine 3-4 real example files from the repository +3. Look for language-specific constructs (annotations, modifiers, method signatures) +4. Add patterns incrementally and test after each addition +5. Document the pattern order rationale in the configuration + +**Time saved**: This insight would have saved ~15 minutes of debugging why `@Test` annotations weren't being removed (they were being processed after method patterns, which was too late). + + +--- + +## Code Quality Patterns + +> **💡 Best Practices** These patterns improve code maintainability and readability. + +### Pattern 1: Extract Repeated Conditionals into Helper Functions + +**When you see**: The same conditional pattern repeated multiple times + +**Example**: Checking for markers appears ~8 times in parsing: +```python +if f'{prefix} {EXAMPLE}' in line or f'{prefix}{EXAMPLE}' in line: +if f'{prefix} {BINDER_ID}' in line or f'{prefix}{BINDER_ID}' in line: +if f'{prefix} {REMOVE_START}' in line or f'{prefix}{REMOVE_START}' in line: +# ... 5 more times +``` + +**Refactor to**: Helper function +```python +def _check_marker(line, prefix, marker): + return f'{prefix} {marker}' in line or f'{prefix}{marker}' in line + +# Usage: +if _check_marker(line, prefix, EXAMPLE): +if _check_marker(line, prefix, BINDER_ID): +if _check_marker(line, prefix, REMOVE_START): +``` + +**Benefits**: +- Reduces code by ~15 lines +- Single source of truth +- Easier to test +- More readable + +### Pattern 2: Use Sets for Membership Tracking + +**When you see**: Need to track if something has been seen before + +**Example**: Tracking duplicate step names + +**Use**: Set for O(1) lookup +```python +seen_step_names = set() + +if step_name in seen_step_names: # O(1) lookup + # Handle duplicate +else: + seen_step_names.add(step_name) +``` + +**Don't use**: List (O(n) lookup) +```python +# ❌ WRONG - O(n) lookup +seen_step_names = [] +if step_name in seen_step_names: # Slow for large lists +``` + +### Pattern 3: Warn for Non-Critical Issues + +**When you see**: Issues that are problems but shouldn't stop processing + +**Examples**: +- Duplicate step names +- Nested markers +- Unpaired markers + +**Use**: `logging.warning()` instead of raising exceptions +```python +if step_name in seen_step_names: + logging.warning(f"Duplicate step name '{step_name}'") + # Continue processing + +if in_remove: + logging.warning("Nested REMOVE_START detected") + # Continue processing +``` + +**Benefits**: +- More user-friendly +- Helps debug without breaking workflow +- Allows batch processing to continue + +### Pattern 4: Validate Early, Process Later + +**Structure**: +1. Validate all inputs first +2. Then process (assuming valid inputs) + +**Example**: +```python +def jupyterize(input_file, output_file=None, verbose=False): + # 1. Validate first + language = detect_language(input_file) + validate_input(input_file, language) + + # 2. Process (inputs are valid) + parsed_blocks = parse_file(input_file, language) + cells = create_cells(parsed_blocks) + notebook = create_notebook(cells, language) + write_notebook(notebook, output_file) +``` + +**Benefits**: +- Fail fast on invalid inputs +- Cleaner error messages +- Easier to test validation separately + +--- + +## System Overview + +### Purpose + +Convert code example files (with special comment markers) into Jupyter notebook (`.ipynb`) files. + +**Process Flow:** +``` +Input File → Detect Language → Parse Markers → Generate Cells → Write Notebook +``` + +### Key Principles + +1. **Simple parsing**: Read file line-by-line, detect markers with regex +2. **Automatic behavior**: Language/kernel from extension, fixed marker handling +3. **Standard output**: Use `nbformat` library for spec-compliant notebooks + +### Dependencies + +```bash +pip install nbformat +``` + +--- + +## Core Mappings + +> **📖 Source of Truth**: Import these from existing modules - don't redefine! + +### File Extension → Language + +**Import from**: `build/local_examples.py` → `EXTENSION_TO_LANGUAGE` + +Supported: `.py`, `.js`, `.go`, `.cs`, `.java`, `.php`, `.rs` + +### Language → Comment Prefix + +**Import from**: `build/components/example.py` → `PREFIXES` + +**⚠️ Critical**: Keys are lowercase (`'python'`, `'node.js'`), so use `language.lower()` when accessing. + +### Language → Jupyter Kernel + +**Define locally** (not in existing modules): + +```python +KERNEL_SPECS = { + 'python': {'name': 'python3', 'display_name': 'Python 3'}, + 'node.js': {'name': 'javascript', 'display_name': 'JavaScript (Node.js)'}, + 'go': {'name': 'gophernotes', 'display_name': 'Go'}, + 'c#': {'name': 'csharp', 'display_name': 'C#'}, + 'java': {'name': 'java', 'display_name': 'Java'}, + 'php': {'name': 'php', 'display_name': 'PHP'}, + 'rust': {'name': 'rust', 'display_name': 'Rust'} +} +``` + +**⚠️ Critical**: Also use `language.lower()` when accessing this dict. + +### Marker Constants + +**Import from**: `build/components/example.py` + +```python +from components.example import ( + HIDE_START, HIDE_END, + REMOVE_START, REMOVE_END, + STEP_START, STEP_END, + EXAMPLE, BINDER_ID +) +``` + +**📖 For marker semantics**, see `build/tcedocs/SPECIFICATION.md` section "Special Comment Reference". + +--- + +## Implementation Approach + +### Recommended Strategy + +**Don't use the Example class** - it modifies files in-place for web documentation. Instead, implement a simple line-by-line parser. + +### Module Imports + +**Critical**: Import existing mappings from the build system: + +```python +#!/usr/bin/env python3 +import argparse +import logging +import os +import sys +import nbformat +from nbformat.v4 import new_notebook, new_code_cell + +# Add parent directory to path to import from build/ +sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..')) + +# Import existing mappings - DO NOT redefine these! +from local_examples import EXTENSION_TO_LANGUAGE +from components.example import PREFIXES + +# Import marker constants from example.py +from components.example import ( + HIDE_START, HIDE_END, + REMOVE_START, REMOVE_END, + STEP_START, STEP_END, + EXAMPLE, BINDER_ID +) +``` + +**Important**: The PREFIXES dict uses lowercase keys (e.g., `'python'`, `'node.js'`), so you must use `language.lower()` when accessing it. + +### Basic Structure + +```python +def main(): + # 1. Parse command-line arguments + # 2. Detect language from file extension + # 3. Validate input file + # 4. Parse file and extract cells + # 5. Create cells with nbformat + # 6. Create notebook with metadata + # 7. Write to output file + pass +``` + +### Language Detection + +```python +def detect_language(file_path): + """Detect language from file extension.""" + _, ext = os.path.splitext(file_path) + language = EXTENSION_TO_LANGUAGE.get(ext.lower()) + if not language: + supported = ', '.join(sorted(EXTENSION_TO_LANGUAGE.keys())) + raise ValueError( + f"Unsupported file extension: {ext}\n" + f"Supported extensions: {supported}" + ) + return language +``` + +--- + +## Marker Processing Rules + +> **📖 For complete marker documentation**, see `build/tcedocs/SPECIFICATION.md` section "Special Comment Reference" (lines 2089-2107). + +### Quick Reference: What to Include/Exclude + +| Marker | Action | Notebook Behavior | +|--------|--------|-------------------| +| `EXAMPLE:` line | Skip | Not included | +| `BINDER_ID` line | Skip | Not included | +| `HIDE_START`/`HIDE_END` markers | Skip markers, **include** code between them | Code visible in notebook | +| `REMOVE_START`/`REMOVE_END` markers | Skip markers, **exclude** code between them | Code not in notebook | +| `STEP_START`/`STEP_END` markers | Skip markers, use as cell boundaries | Each step = separate cell | +| Code outside any step | Include in first cell (preamble) | First cell (no step metadata) | + +**Key Difference from Web Display**: +- Web docs: HIDE blocks are hidden by default (revealed with eye button) +- Notebooks: HIDE blocks are fully visible (notebooks don't have hide/reveal UI) + +### Parsing Algorithm + +**Key Implementation Details:** + +1. **Use `language.lower()`** when accessing PREFIXES dict (keys are lowercase) +2. **Check both formats**: `f'{prefix} {MARKER}'` and `f'{prefix}{MARKER}'` (with/without space) +3. **Extract step name**: Use `line.split(STEP_START)[1].strip()` to get the step name after the marker +4. **Handle state carefully**: Track `in_remove`, `in_step` flags to know what to include/exclude +5. **Save cells at transitions**: When entering a STEP, save any accumulated preamble first + +```python +def parse_file(file_path, language): + """ + Parse file and extract cells. + + Returns: list of {'code': str, 'step_name': str or None} + """ + with open(file_path, 'r', encoding='utf-8') as f: + lines = f.readlines() + + # IMPORTANT: Use .lower() because PREFIXES keys are lowercase + prefix = PREFIXES[language.lower()] + + # State tracking + in_remove = False + in_step = False + step_name = None + step_lines = [] + preamble_lines = [] + cells = [] + + for line_num, line in enumerate(lines, 1): + # Skip metadata markers (check both with and without space) + if f'{prefix} {EXAMPLE}' in line or f'{prefix}{EXAMPLE}' in line: + continue + if f'{prefix} {BINDER_ID}' in line or f'{prefix}{BINDER_ID}' in line: + continue + + # Handle REMOVE blocks (exclude content) + if f'{prefix} {REMOVE_START}' in line or f'{prefix}{REMOVE_START}' in line: + in_remove = True + continue + if f'{prefix} {REMOVE_END}' in line or f'{prefix}{REMOVE_END}' in line: + in_remove = False + continue + if in_remove: + continue # Skip lines inside REMOVE blocks + + # Skip HIDE markers (but include content between them) + if f'{prefix} {HIDE_START}' in line or f'{prefix}{HIDE_START}' in line: + continue + if f'{prefix} {HIDE_END}' in line or f'{prefix}{HIDE_END}' in line: + continue + + # Handle STEP blocks + if f'{prefix} {STEP_START}' in line or f'{prefix}{STEP_START}' in line: + # Save accumulated preamble before starting new step + if preamble_lines: + cells.append({'code': ''.join(preamble_lines), 'step_name': None}) + preamble_lines = [] + + in_step = True + # Extract step name from line (text after STEP_START marker) + step_name = line.split(STEP_START)[1].strip() if STEP_START in line else None + step_lines = [] + continue + + if f'{prefix} {STEP_END}' in line or f'{prefix}{STEP_END}' in line: + if step_lines: + cells.append({'code': ''.join(step_lines), 'step_name': step_name}) + in_step = False + step_name = None + step_lines = [] + continue + + # Collect code lines + if in_step: + step_lines.append(line) + else: + preamble_lines.append(line) + + # Save any remaining preamble at end of file + if preamble_lines: + cells.append({'code': ''.join(preamble_lines), 'step_name': None}) + + return cells +``` + +**Common Pitfalls to Avoid:** +- Forgetting to use `.lower()` when accessing PREFIXES → KeyError +- Only checking `f'{prefix} {MARKER}'` format → Missing markers without space +- Not saving preamble before starting a step → Lost code +- Not handling remaining preamble at end → Lost code + +--- + +## Language-Specific Features + +> **⚠️ New Requirement**: Notebooks need language-specific setup that source files don't have. + +### Overview + +Different languages have different requirements for Jupyter notebooks that aren't present in the source test files: + +1. **Dependency declarations**: C# needs NuGet package directives, Node.js might need npm packages +2. **Structural wrappers**: Test files have class/method wrappers that shouldn't appear in notebooks +3. **Initialization code**: Some languages need setup code that's implicit in test frameworks + +### Problem 1: Missing Dependency Declarations + +**Issue**: C# Jupyter notebooks require NuGet package directives to download dependencies: + +```csharp +#r "nuget: NRedisStack, 1.1.1" +``` + +**Current behavior**: Source files don't have these directives (they're in project files) +**Desired behavior**: Automatically inject language-specific boilerplate as first cell + +**Example - C# source file**: +```csharp +// EXAMPLE: landing +using NRedisStack; +using StackExchange.Redis; + +public class SyncLandingExample { + public void Run() { + var muxer = ConnectionMultiplexer.Connect("localhost:6379"); + // ... + } +} +``` + +**Desired notebook output**: +``` +Cell 1 (boilerplate): +#r "nuget: NRedisStack, 1.1.1" +#r "nuget: StackExchange.Redis, 2.6.122" + +Cell 2 (preamble): +using NRedisStack; +using StackExchange.Redis; + +Cell 3 (code): +var muxer = ConnectionMultiplexer.Connect("localhost:6379"); +// ... +``` + +### Problem 2: Unnecessary Structural Wrappers + +**Issue**: Test files have class/method wrappers needed for test frameworks but not for notebooks. + +**Affected languages**: C# and Java (both class-based languages with similar syntax) + +**C# example**: +```csharp +public class SyncLandingExample // ← Test framework wrapper +{ + public void Run() // ← Test framework wrapper + { + // Actual example code here + var muxer = ConnectionMultiplexer.Connect("localhost:6379"); + } +} +``` + +**Java example**: +```java +public class LandingExample { // ← Test framework wrapper + + @Test + public void run() { // ← Test framework wrapper + // Actual example code here + UnifiedJedis jedis = new UnifiedJedis("redis://localhost:6379"); + } +} +``` + +**Current behavior**: These wrappers are copied to the notebook +**Desired behavior**: Remove wrappers, keep only the code inside + +**Why not use REMOVE blocks?** +- These wrappers are needed for the test framework to compile/run +- Marking them with REMOVE would break the tests +- They're structural, not boilerplate + +**Key similarities between C# and Java**: +- Both use `public class ClassName` declarations +- Both use method declarations (C#: `public void Run()`, Java: `public void run()`) +- Both use curly braces `{` `}` for blocks +- Opening brace can be on same line or next line +- Test annotations may appear before methods (Java: `@Test`, C#: `[Test]`) + +**Detailed Java example** (from `local_examples/client-specific/jedis/LandingExample.java`): + +Before unwrapping: +```java +// EXAMPLE: landing +// STEP_START import +import redis.clients.jedis.UnifiedJedis; +// STEP_END + +public class LandingExample { // ← Remove this + + @Test // ← Remove this + public void run() { // ← Remove this + // STEP_START connect + UnifiedJedis jedis = new UnifiedJedis("redis://localhost:6379"); + // STEP_END + + // STEP_START set_get_string + String res1 = jedis.set("bike:1", "Deimos"); + System.out.println(res1); + // STEP_END + } // ← Remove this +} // ← Remove this +``` + +After unwrapping (desired notebook output): +```java +Cell 1 (import step): +import redis.clients.jedis.UnifiedJedis; + +Cell 2 (connect step): +UnifiedJedis jedis = new UnifiedJedis("redis://localhost:6379"); + +Cell 3 (set_get_string step): +String res1 = jedis.set("bike:1", "Deimos"); +System.out.println(res1); +``` + +Note: The class declaration, `@Test` annotation, method declaration, and closing braces are all removed, leaving only the actual example code properly dedented. + +### Solution Approach + +#### Option 1: Configuration-Based (Recommended) + +**Pros**: +- No changes to source files +- Centralized configuration +- Easy to update package versions +- Works with existing examples + +**Cons**: +- Requires maintaining configuration file +- Less visible to example authors + +**Implementation**: + +1. **Create configuration file** (`jupyterize_config.json`): +```json +{ + "c#": { + "boilerplate": [ + "#r \"nuget: NRedisStack, 1.1.1\"", + "#r \"nuget: StackExchange.Redis, 2.6.122\"" + ], + "unwrap_patterns": [ + { + "type": "class", + "pattern": "^\\s*public\\s+class\\s+\\w+.*\\{", + "end_pattern": "^\\}\\s*$", + "keep_content": true + }, + { + "type": "method", + "pattern": "^\\s*public\\s+void\\s+Run\\(\\).*\\{", + "end_pattern": "^\\s*\\}\\s*$", + "keep_content": true + } + ] + }, + "node.js": { + "boilerplate": [ + "// npm install redis" + ], + "unwrap_patterns": [] + } +} +``` + +2. **Load configuration** in jupyterize.py: +```python +def load_language_config(language): + """Load language-specific configuration.""" + config_file = os.path.join(os.path.dirname(__file__), 'jupyterize_config.json') + if os.path.exists(config_file): + with open(config_file) as f: + config = json.load(f) + return config.get(language.lower(), {}) + return {} +``` + +3. **Inject boilerplate** as first cell: +```python +def create_cells(parsed_blocks, language): + """Convert parsed blocks to notebook cells.""" + cells = [] + + # Get language config + lang_config = load_language_config(language) + + # Add boilerplate cell if defined + if 'boilerplate' in lang_config: + boilerplate_code = '\n'.join(lang_config['boilerplate']) + cells.append(new_code_cell( + source=boilerplate_code, + metadata={'cell_type': 'boilerplate', 'language': language} + )) + + # Add regular cells... + for block in parsed_blocks: + # ... existing logic +``` + +4. **Unwrap structural patterns**: +```python +def unwrap_code(code, language): + """Remove language-specific structural wrappers.""" + lang_config = load_language_config(language) + unwrap_patterns = lang_config.get('unwrap_patterns', []) + + for pattern_config in unwrap_patterns: + if pattern_config.get('keep_content', True): + # Remove wrapper but keep content + code = remove_wrapper_keep_content( + code, + pattern_config['pattern'], + pattern_config['end_pattern'] + ) + + return code + +def remove_wrapper_keep_content(code, start_pattern, end_pattern): + """Remove wrapper lines but keep content between them.""" + lines = code.split('\n') + result = [] + in_wrapper = False + wrapper_indent = 0 + + for line in lines: + if re.match(start_pattern, line): + in_wrapper = True + wrapper_indent = len(line) - len(line.lstrip()) + continue # Skip wrapper start line + elif in_wrapper and re.match(end_pattern, line): + in_wrapper = False + continue # Skip wrapper end line + elif in_wrapper: + # Remove wrapper indentation + if line.startswith(' ' * (wrapper_indent + 4)): + result.append(line[wrapper_indent + 4:]) + else: + result.append(line) + else: + result.append(line) + + return '\n'.join(result) +``` + +#### Option 2: Marker-Based + +**Pros**: +- Explicit in source files +- Self-documenting +- No external configuration needed + +**Cons**: +- Requires updating all source files +- More markers to maintain +- Clutters source files + +**New markers**: +```csharp +// NOTEBOOK_BOILERPLATE_START +#r "nuget: NRedisStack, 1.1.1" +// NOTEBOOK_BOILERPLATE_END + +// NOTEBOOK_UNWRAP_START class +public class SyncLandingExample { +// NOTEBOOK_UNWRAP_END + + // NOTEBOOK_UNWRAP_START method + public void Run() { + // NOTEBOOK_UNWRAP_END + + // Actual code here + +// NOTEBOOK_UNWRAP_CLOSE method + } +// NOTEBOOK_UNWRAP_CLOSE class +} +``` + +**Not recommended** because: +- Too many new markers +- Clutters source files +- Harder to maintain +- Breaks existing examples + +### Configuration Schema and Semantics (Implementation-Proven) + +- Location: `build/jupyterize/jupyterize_config.json` +- Keys: Lowercased language names (`"c#"`, `"python"`, `"node.js"`, `"java"`, ...) +- Structure per language: + - `boilerplate`: Array of strings (each becomes a line in the first code cell) + - `unwrap_patterns`: Array of pattern objects with fields: + - `type` (string): Human-readable label used in logs + - `pattern` (regex string): Start condition (anchored with `^` recommended) + - `end_pattern` (regex string): End condition + - `keep_content` (bool): + - `true` → remove wrapper start/end lines, keep the inner content (useful for `{ ... }` ranges) + - `false` → remove the matching line(s) entirely + - If `pattern == end_pattern` → remove only the single matching line + - If `pattern != end_pattern` → remove from first match through end match, inclusive + - `description` (optional): Intent for maintainers + +#### At a Glance: Configuration Schema + +```json +{ + "": { + "boilerplate": ["", ""], + "unwrap_patterns": [ + { + "type": "