Skip to content

Latest commit

 

History

History
306 lines (229 loc) · 8.36 KB

File metadata and controls

306 lines (229 loc) · 8.36 KB

Long Hansards Processing Guide

This guide explains how to handle very long parliament hansards (50,000+ words) using advanced chunking techniques and PDF text extraction.

Overview

Parliament hansards can be extremely long documents, often containing 50,000+ words across multiple sessions. The standard summarization approach would fail due to token limits. This guide provides solutions for:

  1. Text Chunking: Splitting long documents into manageable pieces
  2. Multi-stage Summarization: Summarizing chunks, then combining results
  3. Database Integration: Working with existing speech_data from your pipeline
  4. Cost Optimization: Efficient processing to minimize API costs

Architecture for Long Documents

1. Chunking Strategy

Long Hansard (50,000+ words)
    ↓
Split into chunks (6,000 chars each)
    ↓
Summarize each chunk individually
    ↓
Combine chunk summaries
    ↓
Generate final comprehensive summary

2. Configuration Options

# Chunking configuration
OPENAI_CHUNK_SIZE=6000        # Characters per chunk
OPENAI_CHUNK_OVERLAP=500      # Overlap between chunks
OPENAI_MAX_CHUNKS=10          # Maximum chunks to process

3. Processing Flow

  1. Text Extraction: Extract text from database or PDF
  2. Length Check: Determine if chunking is needed
  3. Chunking: Split text at sentence boundaries
  4. Chunk Summarization: Summarize each chunk
  5. Final Summary: Combine chunk summaries
  6. Translation: Translate to Bahasa Malaysia

Text Processing from Database

Since you already have a data pipeline that handles PDF text extraction and stores all text in your database, the AI summarization service works directly with the existing speech_data field in the Sitting model.

Data Flow

PDF Files → Your Data Pipeline → Database (speech_data)
    ↓
AI Summarization Service
    ↓
Chunking (if needed)
    ↓
Summary Generation

Usage Examples

1. Process Long Hansards with Chunking

# Process specific sitting with chunking
python src/manage.py process_long_hansards \
    --house dewan-rakyat \
    --date 2024-01-15 \
    --chunk-size 6000 \
    --max-chunks 10

# Process multiple sittings
python src/manage.py process_long_hansards \
    --house dewan-rakyat \
    --term 15 \
    --limit 20 \
    --chunk-size 8000 \
    --max-chunks 15

3. API Usage for Long Documents

// Frontend integration for long hansards
const generateLongSummary = async (house, date) => {
    const response = await fetch('/api/summary/', {
        method: 'POST',
        headers: {'Content-Type': 'application/json'},
        body: JSON.stringify({
            house: house,
            date: date,
            force_regenerate: false
        })
    });
    
    const result = await response.json();
    
    if (result.success) {
        console.log('Summary generated:', result.summary_en);
        console.log('Chunks processed:', result.chunks_processed);
    }
};

Cost Optimization Strategies

1. Chunk Size Optimization

  • Smaller chunks (4,000 chars): More accurate, higher cost
  • Larger chunks (8,000 chars): Less accurate, lower cost
  • Optimal balance (6,000 chars): Good accuracy, reasonable cost

2. Chunk Limit Management

  • Max 10 chunks: Prevents runaway costs
  • Overlap reduction: Minimize redundant processing
  • Smart truncation: Stop at reasonable limits

3. Caching Strategy

  • Database storage: Never regenerate same summary
  • Status tracking: Avoid duplicate processing
  • Error recovery: Retry failed chunks only

Performance Considerations

1. Processing Time

Short hansard (< 6,000 chars): ~30 seconds
Medium hansard (6,000-30,000 chars): ~2-3 minutes
Long hansard (30,000+ chars): ~5-10 minutes

2. API Rate Limits

  • OpenAI rate limits: 3,500 requests/minute
  • Chunking reduces: Concurrent requests
  • Sequential processing: Avoid rate limit issues

3. Memory Usage

  • Chunk processing: Process one chunk at a time
  • Text extraction: Stream large files
  • Database updates: Use transactions

Error Handling

1. Chunk Processing Errors

# Handle individual chunk failures
for i, chunk in enumerate(chunks):
    try:
        chunk_summary = await summarize_chunk(chunk, i + 1, len(chunks))
        if chunk_summary:
            chunk_summaries.append(chunk_summary)
        else:
            logger.warning(f"Failed to summarize chunk {i + 1}")
    except Exception as e:
        logger.error(f"Error in chunk {i + 1}: {e}")

2. Recovery Strategies

  • Partial summaries: Continue with successful chunks
  • Retry mechanisms: Exponential backoff for failures
  • Database fallback: Use existing speech_data if needed

Monitoring and Logging

1. Key Metrics

# Track processing metrics
logger.info(f"Extracted {len(speech_text)} characters of speech text")
logger.info(f"Split text into {len(chunks)} chunks")
logger.info(f"Successfully generated summary for chunk {chunk_num}/{total_chunks}")
logger.info("Successfully generated final summary")

2. Performance Monitoring

  • Processing time: Track total processing duration
  • Chunk count: Monitor number of chunks processed
  • Success rate: Track successful vs failed chunks
  • Cost tracking: Monitor API usage and costs

3. Error Tracking

  • Chunk failures: Log specific chunk errors
  • API errors: Monitor OpenAI API responses
  • Database errors: Track transaction failures

Best Practices

1. Document Length Assessment

# Determine processing strategy based on length
if len(speech_text) <= 6000:
    # Use simple summarization
    summary = await simple_summarize(speech_text)
else:
    # Use chunking approach
    summary = await chunked_summarize(speech_text)

2. Chunk Boundary Optimization

# Break at sentence boundaries
sentence_endings = re.finditer(r'[.!?]\s+', search_text)
if sentence_endings:
    last_ending = sentence_endings[-1]
    end = search_start + last_ending.end()

3. Quality Assurance

  • Summary coherence: Ensure final summary flows well
  • Content completeness: Verify all key points included
  • Language accuracy: Check translation quality
  • Format consistency: Maintain parliamentary tone

Troubleshooting

Common Issues

  1. "Text too long for processing"

    • Increase OPENAI_MAX_CHUNKS
    • Reduce OPENAI_CHUNK_SIZE
    • Check for infinite loops
  2. "Text extraction failed"

    • Verify speech_data format in database
    • Check for empty or malformed JSON
    • Validate text content quality
  3. "Chunk summaries don't combine well"

    • Adjust chunk overlap
    • Improve final summary prompt
    • Check for content duplication
  4. "Processing takes too long"

    • Reduce chunk size
    • Implement parallel processing
    • Use background tasks

Debug Commands

# Test chunking configuration
python src/manage.py shell
>>> from api.services import ai_service
>>> text = "long text here..."
>>> chunks = ai_service._chunk_text(text)
>>> print(f"Split into {len(chunks)} chunks")

# Test chunking configuration
python src/manage.py shell
>>> from api.services import ai_service
>>> text = "long text here..."
>>> chunks = ai_service._chunk_text(text)
>>> print(f"Split into {len(chunks)} chunks")

## Future Enhancements

1. **Parallel Processing**: Process chunks concurrently
2. **Smart Chunking**: Use semantic boundaries
3. **Incremental Summarization**: Update summaries as new data arrives
4. **Quality Scoring**: Assess summary quality automatically
5. **Multi-language Support**: Support more languages
6. **Real-time Processing**: WebSocket-based updates
7. **Content Classification**: Categorize discussion topics
8. **Smart Chunking**: Use semantic boundaries

## Cost Estimation

### Example Cost Calculation

Long hansard: 50,000 characters Chunk size: 6,000 characters Number of chunks: 9 Cost per chunk: $0.02 Total cost: $0.18 per hansard


### Cost Optimization Tips

1. **Batch processing**: Process multiple hansards together
2. **Caching**: Never regenerate existing summaries
3. **Smart truncation**: Stop at reasonable limits
4. **Model selection**: Use appropriate model for task
5. **Rate limiting**: Avoid API rate limit penalties

This comprehensive approach ensures that even the longest parliament hansards can be processed efficiently while maintaining quality and controlling costs.