Skip to content

Commit 47a6688

Browse files
feat: parallel processing
1 parent fd54e1c commit 47a6688

File tree

7 files changed

+236
-34
lines changed

7 files changed

+236
-34
lines changed

Projects/Nitrodigest/Docs/Getting Started/Quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ You can also run NitroDigest without any arguments to process all supported file
2323
nitrodigest
2424
```
2525

26-
This will automatically find and summarize all text files (`.txt`, `.md`, `.html`, `.json`, `.csv`, `.log`, etc.) in the current working directory.
26+
This will automatically find and summarize all text files (`.txt`, `.md`, `.html`, `.json`, `.csv`, `.log`, etc.) in the current working directory. NitroDigest processes multiple files in parallel (up to 4 simultaneously by default) and shows a progress bar during processing.
2727

2828
## 3. Observe the output
2929

Projects/Nitrodigest/Docs/Guides/Summarizing All Files in a Directory.md

Lines changed: 70 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,9 @@ This command will:
1515

1616
1. Scan the directory and all subdirectories
1717
2. Find all supported text files
18-
3. Process each file individually using your default model
19-
4. Output each summary to the terminal in sequence
18+
3. Process multiple files in parallel using your default model (up to 4 files simultaneously by default)
19+
4. Display a progress bar during processing
20+
5. Output all summaries at the end
2021

2122
### Process Current Directory
2223

@@ -66,19 +67,22 @@ Will process `meeting-notes.txt`, `project-report.md`, `data-analysis.csv`, `web
6667

6768
### Terminal Output (Default)
6869

69-
By default, all summaries are displayed in your terminal one after another:
70+
By default, all summaries are displayed in your terminal after processing completes:
7071

7172
```bash
7273
nitrodigest documents/
7374
```
7475

75-
You'll see processing messages and formatted summaries for each file:
76+
You'll see a progress bar during processing, followed by all summaries:
7677

7778
```bash
7879
Processing directory: documents/
79-
Processing file: documents/meeting-notes.txt
80-
Generating summary for meeting-notes.txt...
81-
2025-05-26 07:55:42,615 - cli.summarizer.base.OllamaSummarizer - INFO - Sending request to Ollama API using model mistral
80+
Found 5 files to process with 4 workers
81+
82+
Processing: 100%|████████████████████| 5/5 [00:12<00:00, 2.45s/file] ✓ more-notes.txt
83+
84+
Processing complete: 5 successful, 0 failed
85+
8286
---
8387
date: '2025-05-16 07:50:22'
8488
id: documents/meeting-notes.txt
@@ -91,10 +95,15 @@ tokens: 189
9195

9296
<summary of meeting-notes.txt>
9397

94-
Processing file: documents/project-report.md
95-
Generating summary for project-report.md...
98+
================================================================================
99+
100+
---
101+
date: '2025-05-16 08:15:10'
102+
id: documents/project-report.md
96103
...
97-
Directory processing complete: 4 of 4 files processed successfully
104+
---
105+
106+
<summary of project-report.md>
98107
```
99108

100109
### Save All Summaries to One File
@@ -149,12 +158,33 @@ project/
149158

150159
All three files (`overview.md`, `specifications.txt`, and `notes.txt`) will be processed.
151160

152-
### File Ordering
161+
### Parallel Processing
153162

154-
Files are processed in the order they're discovered by the file system, which typically means:
163+
NitroDigest processes multiple files simultaneously to improve performance. By default, it uses 4 parallel workers, meaning up to 4 files can be processed at the same time.
155164

156-
- Files in the main directory first
157-
- Then files in subdirectories
165+
#### Adjusting Parallel Workers
166+
167+
You can control the number of parallel workers based on your system resources and needs:
168+
169+
```bash
170+
# Use 8 workers for faster processing (good for powerful systems)
171+
nitrodigest documents/ --max-workers 8
172+
173+
# Use 2 workers for slower systems or to reduce resource usage
174+
nitrodigest documents/ --max-workers 2
175+
176+
# Use 1 worker for sequential processing
177+
nitrodigest documents/ --max-workers 1
178+
```
179+
180+
**When to adjust workers:**
181+
- **Increase workers (6-8):** If you have a powerful system and want maximum speed
182+
- **Decrease workers (1-2):** If you have limited RAM, CPU, or want to reduce system load
183+
- **Keep default (4):** For most use cases, this provides a good balance
184+
185+
### File Ordering
186+
187+
Files are processed in parallel, so they may complete in a different order than discovered. However, all files in the directory and subdirectories will be processed.
158188

159189
## Practical Use Cases
160190

@@ -192,6 +222,20 @@ nitrodigest meeting_notes_march/ > march_meetings_summary.md
192222

193223
## Tips and Best Practices
194224

225+
### Performance Optimization
226+
227+
For best performance when processing large directories:
228+
229+
```bash
230+
# Use more workers on powerful systems
231+
nitrodigest large_directory/ --max-workers 8 > summaries.md
232+
233+
# Monitor your system resources (CPU, RAM) and adjust workers accordingly
234+
# If Ollama is running on the same machine, consider your model's resource needs
235+
```
236+
237+
**Pro tip:** The optimal number of workers depends on your Ollama setup. If Ollama is using significant resources, fewer workers may actually be faster.
238+
195239
### Organize Your Input
196240

197241
Structure your directories logically before processing:
@@ -253,6 +297,18 @@ If your directory contains specialized content, use a custom prompt:
253297
nitrodigest technical_docs/ --prompt "Summarize this technical document focusing on implementation details and requirements" > tech_summaries.md
254298
```
255299

300+
### Combining Parallel Processing with Other Options
301+
302+
You can combine `--max-workers` with other options for optimized processing:
303+
304+
```bash
305+
# Fast processing with custom model and 8 workers
306+
nitrodigest documents/ --model llama3 --max-workers 8 > summaries.md
307+
308+
# Slower but thorough processing with 2 workers and custom prompt
309+
nitrodigest research/ --max-workers 2 --prompt-file research_prompt.txt > research_summaries.md
310+
```
311+
256312
## Next Steps
257313

258314
- **[Custom Prompts](./Overriding%20Prompt%20Templates.md):** Explore Overriding Prompt Templates for specialized content

Projects/Nitrodigest/Docs/NitroDigest – Documentation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ permalink: projects/nitrodigest/docs
1010
- **Local AI Summarization:** Uses Ollama to run LLMs on your machine, preserving privacy and working offline.
1111
- **Multiple Input Formats:** Supports plain text, Markdown, HTML, CSV, JSON, and other text-based files.
1212
- **Multiple Output Formats: By default NitroDigest returns Text, but for advanced processing it can return JSON.
13-
- **Batch Processing:** Summarize a single file or all files in a directory in one command.
13+
- **Parallel Batch Processing:** Summarize a single file or process multiple files in a directory simultaneously with configurable parallel workers for faster processing.
1414
- **Configurable Prompts:** Uses prompt templates that you can customize to change the style or content of summaries.
1515
- **Extensible:** Easily switch to different models (e.g., use a larger or domain-specific Ollama model) and adjust token budgets or segmentation for large inputs.
1616

Projects/Nitrodigest/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ Available arguments:
7171
- `--ollama_api_url`: URL of Ollama API (default: <http://localhost:11434>)
7272
- `--format`: Output format. Can be `text` or `json` (default: text)
7373
- `--include-original`: Include original text in the summary output (default: False)
74+
- `--max-workers`: Maximum number of parallel workers for directory processing (default: 4)
7475

7576
### Custom Prompt Configuration
7677

Projects/Nitrodigest/setup.cfg

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ install_requires =
2020
pyyaml>=6.0
2121
nltk>=3.9.1
2222
emoji>=2.14.1
23+
tqdm>=4.67.1
2324

2425
[options.packages.find]
2526
where = src

Projects/Nitrodigest/src/cli/main.py

Lines changed: 159 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
import yaml
66
from datetime import datetime
77
import json
8+
from concurrent.futures import ThreadPoolExecutor, as_completed
9+
from tqdm import tqdm
810

911
from .summarizer import (
1012
OllamaSummarizer,
@@ -61,6 +63,12 @@ def main():
6163
action="store_true",
6264
help="Include original text in the summary output"
6365
)
66+
parser.add_argument(
67+
"--max-workers",
68+
type=int,
69+
default=4,
70+
help="Maximum number of parallel workers for directory processing (default: 4)"
71+
)
6472

6573
args = parser.parse_args()
6674

@@ -103,15 +111,15 @@ def main():
103111
elif not args.content:
104112
current_dir = os.getcwd()
105113
process_directory(current_dir, summarizer,
106-
args.format, args.include_original)
114+
args.format, args.include_original, args.max_workers)
107115

108116
else:
109117
if os.path.isfile(args.content):
110118
process_file(args.content, summarizer,
111119
args.format, args.include_original)
112120
elif os.path.isdir(args.content):
113121
process_directory(args.content, summarizer,
114-
args.format, args.include_original)
122+
args.format, args.include_original, args.max_workers)
115123
else:
116124
process_text(args.content, summarizer,
117125
args.format, args.include_original)
@@ -142,7 +150,7 @@ def process_text(content: str, summarizer: OllamaSummarizer, format: str, includ
142150

143151

144152
def process_file(file_path, summarizer, format: str, include_original: bool):
145-
"""Process a single file for summarization"""
153+
"""Process a single file for summarization and print results"""
146154
try:
147155
logger.info(f"Processing file: {file_path}")
148156

@@ -168,31 +176,164 @@ def process_file(file_path, summarizer, format: str, include_original: bool):
168176
raise
169177

170178

171-
def process_directory(directory_path, summarizer, format: str, include_original: bool):
172-
"""Process all text files in a directory for summarization"""
173-
logger.info(f"Processing directory: {directory_path}")
179+
def _process_file_return_result(file_path, summarizer, format: str, include_original: bool):
180+
"""Process a single file and return the result without printing"""
181+
try:
182+
with open(file_path, 'r', encoding='utf-8') as f:
183+
content = f.read()
184+
185+
if not content.strip():
186+
return None
187+
188+
file_name = os.path.basename(file_path)
189+
metadata = {
190+
'title': file_name,
191+
'source': 'file://' + os.path.abspath(file_path),
192+
'date': datetime.fromtimestamp(os.path.getmtime(file_path)).strftime("%Y-%m-%d %H:%M:%S"),
193+
'id': file_path
194+
}
195+
196+
result = summarizer.summarize(content, metadata)
174197

175-
file_count = 0
176-
success_count = 0
198+
if not result.is_success():
199+
return None
177200

201+
return {
202+
'content': content,
203+
'metadata': metadata,
204+
'summary': result.summary,
205+
'model_used': result.model_used,
206+
'tokens_used': result.tokens_used,
207+
'file_path': file_path
208+
}
209+
210+
except Exception:
211+
raise
212+
213+
214+
def process_directory(directory_path, summarizer, format: str, include_original: bool, max_workers: int = 4):
215+
"""Process all text files in a directory with parallel processing and progress tracking"""
216+
217+
files_to_process = []
178218
for root, _, files in os.walk(directory_path):
179219
for filename in files:
180-
# Only process text files - check common text file extensions
181220
if filename.lower().endswith(('.txt', '.md', '.html', '.htm', '.xml', '.json', '.csv', '.log')):
182221
file_path = os.path.join(root, filename)
222+
files_to_process.append(file_path)
223+
224+
file_count = len(files_to_process)
225+
226+
if file_count == 0:
227+
print("No text files found to process")
228+
return
229+
230+
print(f"\nProcessing directory: {directory_path}")
231+
print(f"Found {file_count} files to process with {max_workers} workers\n")
232+
233+
import logging
234+
original_levels = {}
235+
for log_name in ['cli.summarizer.base.OllamaSummarizer', 'cli.main']:
236+
log = logging.getLogger(log_name)
237+
original_levels[log_name] = log.level
238+
log.setLevel(logging.WARNING)
239+
240+
results = []
241+
errors = []
242+
243+
with ThreadPoolExecutor(max_workers=max_workers) as executor:
244+
future_to_file = {
245+
executor.submit(_process_file_return_result, file_path, summarizer, format, include_original): file_path
246+
for file_path in files_to_process
247+
}
248+
249+
with tqdm(
250+
total=file_count,
251+
desc="Processing",
252+
unit="file",
253+
bar_format='{desc}: {percentage:3.0f}%|{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}]',
254+
leave=True,
255+
position=0
256+
) as pbar:
257+
for future in as_completed(future_to_file):
258+
file_path = future_to_file[future]
259+
file_name = os.path.basename(file_path)
260+
183261
try:
184-
process_file(file_path, summarizer,
185-
format, include_original)
186-
success_count += 1
187-
logger.info(f"File {success_count} processed successfully")
262+
result = future.result()
263+
if result:
264+
results.append(result)
265+
pbar.set_postfix_str(
266+
f"✓ {file_name[:50]}", refresh=True)
267+
else:
268+
errors.append(
269+
(file_path, "Empty file or failed to generate summary"))
270+
pbar.set_postfix_str(
271+
f"✗ {file_name[:50]}", refresh=True)
188272
except Exception as e:
189-
logger.error(
190-
f"Error when processing file {file_path}: {e}")
273+
errors.append((file_path, str(e)))
274+
pbar.set_postfix_str(f"✗ {file_name[:50]}", refresh=True)
191275
finally:
192-
file_count += 1
276+
pbar.update(1)
277+
278+
for log_name, level in original_levels.items():
279+
logging.getLogger(log_name).setLevel(level)
280+
281+
print(
282+
f"\nProcessing complete: {len(results)} successful, {len(errors)} failed\n")
193283

194-
logger.info(
195-
f"Directory processing complete: {success_count} of {file_count} files processed successfully")
284+
if errors:
285+
print("Failed files:")
286+
for file_path, error in errors:
287+
print(f" - {os.path.basename(file_path)}: {error}")
288+
print()
289+
290+
for idx, result in enumerate(results, 1):
291+
_print_result(result, format, include_original)
292+
if idx < len(results):
293+
print("\n" + "=" * 80 + "\n")
294+
295+
296+
def _print_result(result, format: str, include_original: bool):
297+
"""Print a single result"""
298+
metadata = result['metadata']
299+
summary = result['summary']
300+
content = result['content']
301+
302+
if format == 'text':
303+
print('---')
304+
yaml.dump(
305+
{
306+
'title': metadata.get('title', 'Untitled'),
307+
'source': metadata.get('source', 'Unknown'),
308+
'date': metadata.get('date', datetime.now().strftime("%Y-%m-%d")),
309+
'id': metadata.get('id', ''),
310+
'summary_date': datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
311+
'model': result['model_used'],
312+
'tokens': result['tokens_used']
313+
},
314+
sys.stdout,
315+
default_flow_style=False,
316+
allow_unicode=True
317+
)
318+
print('---\n')
319+
print(_json_to_text(summary))
320+
321+
if include_original:
322+
print("\n---\n")
323+
print("## Original Text\n")
324+
print(content)
325+
elif format == 'json':
326+
json_summary = json.loads(summary)
327+
json_summary["metadata"] = metadata
328+
json_summary["model_used"] = result['model_used']
329+
json_summary["tokens_used"] = result['tokens_used']
330+
331+
if include_original:
332+
json_summary["original_text"] = content
333+
334+
print(json.dumps(json_summary, ensure_ascii=False, indent=2))
335+
else:
336+
print(summary)
196337

197338

198339
def _generate_summary(content, summarizer, metadata, format, include_original=True) -> int:

0 commit comments

Comments
 (0)