Skip to content

Commit 9e6c674

Browse files
committed
maint: clean up classify-docs artifacts and documentation
1 parent 4563ace commit 9e6c674

File tree

3 files changed

+262
-260
lines changed

3 files changed

+262
-260
lines changed

tools/classify-docs/PROJECT.md

Lines changed: 204 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -315,29 +315,33 @@ Initial template → Page 1 → Updated v1 → Page 2 → Updated v2 → ... →
315315
- 4 false-positive MEDIUM flags: Documents 17, 23, 8, 24 (legitimately no caveats, but flagged for review)
316316
- Trade-off accepted: Better to over-flag for human review than miss actual errors in security classification
317317

318-
### Phase 6: Testing & Validation
318+
### Phase 6: Testing & Validation
319319

320-
**Implementation Guide**: `phase-6-guide.md` (future)
321-
322-
**Objectives**:
323-
- Test system prompt generation with guide and policy documents
324-
- Test classification with real 27-page classified document
325-
- Validate classification accuracy
326-
- Measure performance metrics
327-
- Document lessons learned
320+
**Status**: Complete
328321

329-
**Deliverables**:
330-
- `system-prompt.txt` - Generated from policy documents
331-
- `classification-results.json` - Classification test results
332-
- Performance analysis document
333-
- Architecture recommendations for go-agents-document-context
322+
**Development Summary**: `_context/.archive/04-document-classification.md`
334323

335-
**Success Criteria**:
336-
- Generated system prompt is comprehensive and accurate
337-
- Classification results are accurate for sample documents
338-
- Acceptable performance (time, token usage)
339-
- Clear lessons learned documented
340-
- Validated architecture patterns for both processing patterns
324+
**Objectives**:
325+
- ✅ Test system prompt generation with guide and policy documents
326+
- ✅ Test classification with 27-document test set
327+
- ✅ Validate classification accuracy
328+
- ✅ Measure performance metrics
329+
- ✅ Document lessons learned
330+
331+
**Results**:
332+
- ✅ Generated system prompt comprehensive and cached in `.cache/system-prompt.json`
333+
- ✅ Classification accuracy: 96.3% (26/27 documents correct)
334+
- ✅ Conservative confidence scoring successfully flags edge cases for human review
335+
- ✅ Performance acceptable: ~6-10 seconds per page with o4-mini
336+
- ✅ Lessons learned documented in Phase 5 development summary
337+
- ✅ Sequential processing pattern validated for both use cases
338+
339+
**Success Criteria Met**:
340+
- ✅ Generated system prompt is comprehensive and accurate
341+
- ✅ Classification results achieve 96.3% accuracy on test set
342+
- ✅ Acceptable performance (6-10s per page, manageable token usage)
343+
- ✅ Clear lessons learned documented
344+
- ✅ Validated architecture patterns (sequential processing with context accumulation)
341345

342346
## Output Structure
343347

@@ -498,31 +502,190 @@ This POC will answer critical questions for go-agents-document-context:
498502
- Performance analysis and optimization recommendations
499503
- Document lessons learned for go-agents-document-context library design
500504

501-
## Future Library Extraction
505+
## Next Steps: Component Extraction
506+
507+
With the prototype validated (96.3% accuracy, 27-document test set), the next phase involves extracting reusable components into standardized libraries for broader use across document processing workflows.
508+
509+
### Prompt Engineering Infrastructure
510+
511+
**Goal**: Consolidate prompts into a standardized `pkg/prompts` package with `text/template` integration.
512+
513+
**Components to Extract**:
514+
- System prompt generation templates (currently in `pkg/prompt/`)
515+
- Classification prompt templates (currently embedded in `pkg/classify/document.go`)
516+
- Self-check verification questions
517+
- Confidence scoring guidance
518+
519+
**Organization Strategy**:
520+
- Organize by execution purpose (classification, system-prompt-generation, etc.)
521+
- Use `text/template` for parameterized prompt generation
522+
- Version control for prompt iterations
523+
- Single point of reference/update for all prompts
524+
525+
**Benefits**:
526+
- Testable prompt templates
527+
- Clear separation of prompt content from execution logic
528+
- Easier prompt iteration and A/B testing
529+
- Standardized prompt management pattern
530+
531+
**Target**: Extract pattern to go-agents for standardized prompt management
502532

503-
After POC completion, validated patterns will inform go-agents-document-context:
533+
### Document Processing Library
504534

505-
### Core Library (`go-agents-document-context`)
506-
- Document/Page interfaces from `document/`
507-
- PDF processor implementation
508-
- Additional format processors (DOCX, XLSX, PPTX, images)
509-
- Both processing patterns (parallel, sequential)
510-
- Context optimization utilities
511-
- Caching infrastructure (if validated as valuable)
535+
**Goal**: Create standalone library for PDF processing and image conversion.
536+
537+
**Components to Extract**:
538+
- `pkg/document/` primitives (Document/Page interfaces, PDF implementation)
539+
- ImageMagick integration for page rendering
540+
- Configurable image options (DPI, format, quality)
541+
- Resource lifecycle management
542+
543+
**Future Extensions**:
544+
- Support for additional formats (DOCX, XLSX, PPTX, images)
545+
- Pluggable format processors
546+
- Text extraction capabilities
547+
- OCR integration
548+
549+
**Design Considerations**:
550+
- Provider-specific constraints (e.g., Azure 20MB image limit)
551+
- Memory efficiency for large documents
552+
- Progressive page processing vs. batch loading
553+
- Format detection and auto-selection
554+
555+
**Target**: New standalone document processing library
556+
557+
### Parallel Processing Infrastructure
558+
559+
**Goal**: Extract and preserve parallel processing pattern for future resilience improvements.
560+
561+
**Components to Extract** (from git history commit d97ab1c^):
562+
563+
**Core Implementation**:
564+
```go
565+
func ProcessPages[T any](
566+
ctx context.Context,
567+
cfg config.ParallelConfig,
568+
pages []document.Page,
569+
processor func(context.Context, document.Page) (T, error),
570+
progress ProgressFunc,
571+
) ([]T, error)
572+
```
573+
574+
**Configuration**:
575+
```go
576+
type ParallelConfig struct {
577+
WorkerCap int // Default: 16
578+
}
579+
```
580+
581+
**Key Features**:
582+
- Worker pool with auto-detection (`min(runtime.NumCPU()*2, cfg.WorkerCap, len(pages))`)
583+
- Result ordering preserved through indexed result collection
584+
- Fail-fast error handling with context cancellation
585+
- Background result collection to prevent deadlocks
586+
- Modern Go 1.25.2 patterns (`sync.WaitGroup.Go()`, deferred channel closure)
587+
588+
**Architecture Highlights**:
589+
- Three-channel pattern: work queue, result channel, done signal
590+
- Goroutines: N workers + work distributor + background result collector
591+
- Deadlock prevention: Result collector runs in background, drains all results
592+
- Context coordination: First error cancels context, stops all workers
593+
594+
**Current Status**:
595+
- ✅ Architecture implemented and validated (Phase 2)
596+
- ✅ Comprehensive tests written and passing
597+
- ⚠️ Removed during Phase 5 due to Azure rate limiting
598+
- 🔄 Preserved in git history (commit d97ab1c^) for future extraction
599+
600+
**Future Work**:
601+
- Make resilient to rate limiting through adaptive worker scaling
602+
- Implement backpressure mechanisms
603+
- Provider-specific rate limit detection and handling
604+
- Integration with retry infrastructure for resilience
605+
606+
**Design Considerations**:
607+
- Dynamic worker pool scaling based on rate limit feedback
608+
- Graceful degradation (parallel → sequential on rate limit detection)
609+
- Per-provider rate limit configuration
610+
- Token bucket or similar rate limiting algorithms
611+
612+
**Target**: https://github.com/JaimeStill/go-agents-orchestration
613+
614+
### Sequential Processing Infrastructure
615+
616+
**Goal**: Extract generic context accumulation pattern for broader use.
617+
618+
**Components to Extract**:
619+
- `pkg/processing/sequential.go` implementation
620+
- Generic `ContextProcessor[T]` pattern
621+
- Progress reporting with intermediate result visibility
622+
- Context accumulation across processing stages
623+
624+
**Generalization Strategy**:
625+
- Beyond document processing (applicable to any sequential workflow)
626+
- Support for streaming/incremental processing
627+
- Configurable context update strategies
628+
- Checkpoint and resume capabilities
629+
630+
**Design Considerations**:
631+
- Memory management for large accumulated contexts
632+
- Context serialization for checkpointing
633+
- Error recovery and retry integration
634+
- Performance monitoring and metrics
635+
636+
**Target**: https://github.com/JaimeStill/go-agents-orchestration
637+
638+
### Retry Infrastructure
639+
640+
**Goal**: Extract retry logic with exponential backoff and provider-specific strategies.
641+
642+
**Components to Extract**:
643+
- `pkg/retry/` implementation with exponential backoff
644+
- Configurable retry parameters (max attempts, backoff multiplier, max backoff)
645+
- Non-retryable error marking
646+
- Provider-specific rate limit handling (e.g., Azure 429 responses)
647+
648+
**Integration Points**:
649+
- Parallel processing for per-worker retry
650+
- Sequential processing for progressive backoff
651+
- Provider implementations for rate limit detection
652+
653+
**Design Considerations**:
654+
- Provider-specific retry strategies (different providers have different rate limits)
655+
- Jitter for distributed systems
656+
- Circuit breaker patterns for sustained failures
657+
- Retry budget and quota management
658+
659+
**Target**: https://github.com/JaimeStill/go-agents-orchestration
660+
661+
### Configuration Patterns
662+
663+
**Goal**: Document validated configuration patterns for broader adoption.
664+
665+
**Patterns to Document**:
666+
- Pointer-based defaults for boolean configs (enables `true` defaults)
667+
- Optional vs required field handling
668+
- Default value merging strategies
669+
- Provider-specific vs generic configuration
670+
671+
**Lessons Learned**:
672+
- Configuration should only exist during initialization
673+
- Transform configuration into domain objects at system boundaries
674+
- Avoid passing configuration through multiple layers
675+
- Validation at point of use, not in configuration package
676+
677+
**Target**: Architecture documentation in go-agents repository
512678

513679
### What Remains Application-Specific
514-
- Classification logic and prompts
515-
- System prompt generation instructions
516-
- Result aggregation and formatting
517-
- CLI interface and configuration
518-
- Domain-specific error handling
519-
520-
### Open Questions for Library Design
521-
- Should caching be part of the library or application concern?
522-
- How to handle provider-specific constraints (Azure 20MB limit)?
523-
- What abstractions support both vision and text extraction use cases?
524-
- How to make format processors pluggable?
525-
- Should sequential processing be generalized beyond context strings?
680+
681+
The following components are domain-specific to classification and should remain in this prototype:
682+
683+
- **Classification prompt content** (moves to consolidated `pkg/prompts/`)
684+
- **System prompt generation logic** (refactored to use consolidated prompts)
685+
- **CLI interface and tooling** (`main.go`, `cmd/test-*`)
686+
- **Domain-specific caching strategies** (`.cache/system-prompt.json`)
687+
- **Test dataset and ground truth** (`_context/marked-documents/`)
688+
- **Classification result schemas** (`DocumentClassification` struct)
526689

527690
## References
528691

0 commit comments

Comments
 (0)