feat: parallel processing

RetroModernDev · RetroModernDev · commit 47a668856172 · 2025-10-01T07:43:18.000+02:00
diff --git a/Projects/Nitrodigest/Docs/Getting Started/Quickstart.md b/Projects/Nitrodigest/Docs/Getting Started/Quickstart.md
@@ -23,7 +23,7 @@ You can also run NitroDigest without any arguments to process all supported file
 nitrodigest
 ```
 
-This will automatically find and summarize all text files (`.txt`, `.md`, `.html`, `.json`, `.csv`, `.log`, etc.) in the current working directory.
+This will automatically find and summarize all text files (`.txt`, `.md`, `.html`, `.json`, `.csv`, `.log`, etc.) in the current working directory. NitroDigest processes multiple files in parallel (up to 4 simultaneously by default) and shows a progress bar during processing.
 
 ## 3. Observe the output
 
diff --git a/Projects/Nitrodigest/Docs/Guides/Summarizing All Files in a Directory.md b/Projects/Nitrodigest/Docs/Guides/Summarizing All Files in a Directory.md
@@ -15,8 +15,9 @@ This command will:
 
 1. Scan the directory and all subdirectories
 2. Find all supported text files
-3. Process each file individually using your default model
-4. Output each summary to the terminal in sequence
+3. Process multiple files in parallel using your default model (up to 4 files simultaneously by default)
+4. Display a progress bar during processing
+5. Output all summaries at the end
 
 ### Process Current Directory
 
@@ -66,19 +67,22 @@ Will process `meeting-notes.txt`, `project-report.md`, `data-analysis.csv`, `web
 
 ### Terminal Output (Default)
 
-By default, all summaries are displayed in your terminal one after another:
+By default, all summaries are displayed in your terminal after processing completes:
 
 ```bash
 nitrodigest documents/
 ```
 
-You'll see processing messages and formatted summaries for each file:
+You'll see a progress bar during processing, followed by all summaries:
 
 ```bash
 Processing directory: documents/
-Processing file: documents/meeting-notes.txt
-Generating summary for meeting-notes.txt...
-2025-05-26 07:55:42,615 - cli.summarizer.base.OllamaSummarizer - INFO - Sending request to Ollama API using model mistral
+Found 5 files to process with 4 workers
+
+Processing:  100%|████████████████████| 5/5 [00:12<00:00, 2.45s/file] ✓ more-notes.txt
+
+Processing complete: 5 successful, 0 failed
+
 ---
 date: '2025-05-16 07:50:22'
 id: documents/meeting-notes.txt
@@ -91,10 +95,15 @@ tokens: 189
 
 <summary of meeting-notes.txt>
 
-Processing file: documents/project-report.md
-Generating summary for project-report.md...
+================================================================================
+
+---
+date: '2025-05-16 08:15:10'
+id: documents/project-report.md
 ...
-Directory processing complete: 4 of 4 files processed successfully
+---
+
+<summary of project-report.md>
 ```
 
 ### Save All Summaries to One File
@@ -149,12 +158,33 @@ project/
 
 All three files (`overview.md`, `specifications.txt`, and `notes.txt`) will be processed.
 
-### File Ordering
+### Parallel Processing
 
-Files are processed in the order they're discovered by the file system, which typically means:
+NitroDigest processes multiple files simultaneously to improve performance. By default, it uses 4 parallel workers, meaning up to 4 files can be processed at the same time.
 
-- Files in the main directory first
-- Then files in subdirectories
+#### Adjusting Parallel Workers
+
+You can control the number of parallel workers based on your system resources and needs:
+
+```bash
+# Use 8 workers for faster processing (good for powerful systems)
+nitrodigest documents/ --max-workers 8
+
+# Use 2 workers for slower systems or to reduce resource usage
+nitrodigest documents/ --max-workers 2
+
+# Use 1 worker for sequential processing
+nitrodigest documents/ --max-workers 1
+```
+
+**When to adjust workers:**
+- **Increase workers (6-8):** If you have a powerful system and want maximum speed
+- **Decrease workers (1-2):** If you have limited RAM, CPU, or want to reduce system load
+- **Keep default (4):** For most use cases, this provides a good balance
+
+### File Ordering
+
+Files are processed in parallel, so they may complete in a different order than discovered. However, all files in the directory and subdirectories will be processed.
 
 ## Practical Use Cases
 
@@ -192,6 +222,20 @@ nitrodigest meeting_notes_march/ > march_meetings_summary.md
 
 ## Tips and Best Practices
 
+### Performance Optimization
+
+For best performance when processing large directories:
+
+```bash
+# Use more workers on powerful systems
+nitrodigest large_directory/ --max-workers 8 > summaries.md
+
+# Monitor your system resources (CPU, RAM) and adjust workers accordingly
+# If Ollama is running on the same machine, consider your model's resource needs
+```
+
+**Pro tip:** The optimal number of workers depends on your Ollama setup. If Ollama is using significant resources, fewer workers may actually be faster.
+
 ### Organize Your Input
 
 Structure your directories logically before processing:
@@ -253,6 +297,18 @@ If your directory contains specialized content, use a custom prompt:
 nitrodigest technical_docs/ --prompt "Summarize this technical document focusing on implementation details and requirements" > tech_summaries.md
 ```
 
+### Combining Parallel Processing with Other Options
+
+You can combine `--max-workers` with other options for optimized processing:
+
+```bash
+# Fast processing with custom model and 8 workers
+nitrodigest documents/ --model llama3 --max-workers 8 > summaries.md
+
+# Slower but thorough processing with 2 workers and custom prompt
+nitrodigest research/ --max-workers 2 --prompt-file research_prompt.txt > research_summaries.md
+```
+
 ## Next Steps
 
 - **[Custom Prompts](./Overriding%20Prompt%20Templates.md):** Explore Overriding Prompt Templates for specialized content
diff --git a/Projects/Nitrodigest/Docs/NitroDigest – Documentation.md b/Projects/Nitrodigest/Docs/NitroDigest – Documentation.md
@@ -10,7 +10,7 @@ permalink: projects/nitrodigest/docs
 - **Local AI Summarization:** Uses Ollama to run LLMs on your machine, preserving privacy and working offline.
 - **Multiple Input Formats:** Supports plain text, Markdown, HTML, CSV, JSON, and other text-based files.
 - **Multiple Output Formats: By default NitroDigest returns Text, but for advanced processing it can return JSON.
-- **Batch Processing:** Summarize a single file or all files in a directory in one command.
+- **Parallel Batch Processing:** Summarize a single file or process multiple files in a directory simultaneously with configurable parallel workers for faster processing.
 - **Configurable Prompts:** Uses prompt templates that you can customize to change the style or content of summaries.
 - **Extensible:** Easily switch to different models (e.g., use a larger or domain-specific Ollama model) and adjust token budgets or segmentation for large inputs.
 
diff --git a/Projects/Nitrodigest/README.md b/Projects/Nitrodigest/README.md
@@ -71,6 +71,7 @@ Available arguments:
 - `--ollama_api_url`: URL of Ollama API (default: <http://localhost:11434>)
 - `--format`: Output format. Can be `text` or `json` (default: text)
 - `--include-original`: Include original text in the summary output (default: False)
+- `--max-workers`: Maximum number of parallel workers for directory processing (default: 4)
 
 ### Custom Prompt Configuration
 
diff --git a/Projects/Nitrodigest/setup.cfg b/Projects/Nitrodigest/setup.cfg
@@ -20,6 +20,7 @@ install_requires =
     pyyaml>=6.0
     nltk>=3.9.1
     emoji>=2.14.1
+    tqdm>=4.67.1
 
 [options.packages.find]
 where = src
diff --git a/Projects/Nitrodigest/src/cli/main.py b/Projects/Nitrodigest/src/cli/main.py
@@ -5,6 +5,8 @@
 import yaml
 from datetime import datetime
 import json
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
 
 from .summarizer import (
     OllamaSummarizer,
@@ -61,6 +63,12 @@ def main():
         action="store_true",
         help="Include original text in the summary output"
     )
+    parser.add_argument(
+        "--max-workers",
+        type=int,
+        default=4,
+        help="Maximum number of parallel workers for directory processing (default: 4)"
+    )
 
     args = parser.parse_args()
 
@@ -103,15 +111,15 @@ def main():
     elif not args.content:
         current_dir = os.getcwd()
         process_directory(current_dir, summarizer,
-                          args.format, args.include_original)
+                          args.format, args.include_original, args.max_workers)
 
     else:
         if os.path.isfile(args.content):
             process_file(args.content, summarizer,
                          args.format, args.include_original)
         elif os.path.isdir(args.content):
             process_directory(args.content, summarizer,
-                              args.format, args.include_original)
+                              args.format, args.include_original, args.max_workers)
         else:
             process_text(args.content, summarizer,
                          args.format, args.include_original)
@@ -142,7 +150,7 @@ def process_text(content: str, summarizer: OllamaSummarizer, format: str, includ
 
 
 def process_file(file_path, summarizer, format: str, include_original: bool):
-    """Process a single file for summarization"""
+    """Process a single file for summarization and print results"""
     try:
         logger.info(f"Processing file: {file_path}")
 
@@ -168,31 +176,164 @@ def process_file(file_path, summarizer, format: str, include_original: bool):
         raise
 
 
-def process_directory(directory_path, summarizer, format: str, include_original: bool):
-    """Process all text files in a directory for summarization"""
-    logger.info(f"Processing directory: {directory_path}")
+def _process_file_return_result(file_path, summarizer, format: str, include_original: bool):
+    """Process a single file and return the result without printing"""
+    try:
+        with open(file_path, 'r', encoding='utf-8') as f:
+            content = f.read()
+
+        if not content.strip():
+            return None
+
+        file_name = os.path.basename(file_path)
+        metadata = {
+            'title': file_name,
+            'source': 'file://' + os.path.abspath(file_path),
+            'date': datetime.fromtimestamp(os.path.getmtime(file_path)).strftime("%Y-%m-%d %H:%M:%S"),
+            'id': file_path
+        }
+
+        result = summarizer.summarize(content, metadata)
 
-    file_count = 0
-    success_count = 0
+        if not result.is_success():
+            return None
 
+        return {
+            'content': content,
+            'metadata': metadata,
+            'summary': result.summary,
+            'model_used': result.model_used,
+            'tokens_used': result.tokens_used,
+            'file_path': file_path
+        }
+
+    except Exception:
+        raise
+
+
+def process_directory(directory_path, summarizer, format: str, include_original: bool, max_workers: int = 4):
+    """Process all text files in a directory with parallel processing and progress tracking"""
+
+    files_to_process = []
     for root, _, files in os.walk(directory_path):
         for filename in files:
-            # Only process text files - check common text file extensions
             if filename.lower().endswith(('.txt', '.md', '.html', '.htm', '.xml', '.json', '.csv', '.log')):
                 file_path = os.path.join(root, filename)
+                files_to_process.append(file_path)
+
+    file_count = len(files_to_process)
+
+    if file_count == 0:
+        print("No text files found to process")
+        return
+
+    print(f"\nProcessing directory: {directory_path}")
+    print(f"Found {file_count} files to process with {max_workers} workers\n")
+
+    import logging
+    original_levels = {}
+    for log_name in ['cli.summarizer.base.OllamaSummarizer', 'cli.main']:
+        log = logging.getLogger(log_name)
+        original_levels[log_name] = log.level
+        log.setLevel(logging.WARNING)
+
+    results = []
+    errors = []
+
+    with ThreadPoolExecutor(max_workers=max_workers) as executor:
+        future_to_file = {
+            executor.submit(_process_file_return_result, file_path, summarizer, format, include_original): file_path
+            for file_path in files_to_process
+        }
+
+        with tqdm(
+            total=file_count,
+            desc="Processing",
+            unit="file",
+            bar_format='{desc}: {percentage:3.0f}%|{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}]',
+            leave=True,
+            position=0
+        ) as pbar:
+            for future in as_completed(future_to_file):
+                file_path = future_to_file[future]
+                file_name = os.path.basename(file_path)
+
                 try:
-                    process_file(file_path, summarizer,
-                                 format, include_original)
-                    success_count += 1
-                    logger.info(f"File {success_count} processed successfully")
+                    result = future.result()
+                    if result:
+                        results.append(result)
+                        pbar.set_postfix_str(
+                            f"✓ {file_name[:50]}", refresh=True)
+                    else:
+                        errors.append(
+                            (file_path, "Empty file or failed to generate summary"))
+                        pbar.set_postfix_str(
+                            f"✗ {file_name[:50]}", refresh=True)
                 except Exception as e:
-                    logger.error(
-                        f"Error when processing file {file_path}: {e}")
+                    errors.append((file_path, str(e)))
+                    pbar.set_postfix_str(f"✗ {file_name[:50]}", refresh=True)
                 finally:
-                    file_count += 1
+                    pbar.update(1)
+
+    for log_name, level in original_levels.items():
+        logging.getLogger(log_name).setLevel(level)
+
+    print(
+        f"\nProcessing complete: {len(results)} successful, {len(errors)} failed\n")
 
-    logger.info(
-        f"Directory processing complete: {success_count} of {file_count} files processed successfully")
+    if errors:
+        print("Failed files:")
+        for file_path, error in errors:
+            print(f"  - {os.path.basename(file_path)}: {error}")
+        print()
+
+    for idx, result in enumerate(results, 1):
+        _print_result(result, format, include_original)
+        if idx < len(results):
+            print("\n" + "=" * 80 + "\n")
+
+
+def _print_result(result, format: str, include_original: bool):
+    """Print a single result"""
+    metadata = result['metadata']
+    summary = result['summary']
+    content = result['content']
+
+    if format == 'text':
+        print('---')
+        yaml.dump(
+            {
+                'title': metadata.get('title', 'Untitled'),
+                'source': metadata.get('source', 'Unknown'),
+                'date': metadata.get('date', datetime.now().strftime("%Y-%m-%d")),
+                'id': metadata.get('id', ''),
+                'summary_date': datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+                'model': result['model_used'],
+                'tokens': result['tokens_used']
+            },
+            sys.stdout,
+            default_flow_style=False,
+            allow_unicode=True
+        )
+        print('---\n')
+        print(_json_to_text(summary))
+
+        if include_original:
+            print("\n---\n")
+            print("## Original Text\n")
+            print(content)
+    elif format == 'json':
+        json_summary = json.loads(summary)
+        json_summary["metadata"] = metadata
+        json_summary["model_used"] = result['model_used']
+        json_summary["tokens_used"] = result['tokens_used']
+
+        if include_original:
+            json_summary["original_text"] = content
+
+        print(json.dumps(json_summary, ensure_ascii=False, indent=2))
+    else:
+        print(summary)
 
 
 def _generate_summary(content, summarizer, metadata, format, include_original=True) -> int:
diff --git a/Projects/Nitrodigest/src/cli/requirements.txt b/Projects/Nitrodigest/src/cli/requirements.txt