This guide explains how to handle very long parliament hansards (50,000+ words) using advanced chunking techniques and PDF text extraction.
Parliament hansards can be extremely long documents, often containing 50,000+ words across multiple sessions. The standard summarization approach would fail due to token limits. This guide provides solutions for:
- Text Chunking: Splitting long documents into manageable pieces
- Multi-stage Summarization: Summarizing chunks, then combining results
- Database Integration: Working with existing speech_data from your pipeline
- Cost Optimization: Efficient processing to minimize API costs
Long Hansard (50,000+ words)
↓
Split into chunks (6,000 chars each)
↓
Summarize each chunk individually
↓
Combine chunk summaries
↓
Generate final comprehensive summary
# Chunking configuration
OPENAI_CHUNK_SIZE=6000 # Characters per chunk
OPENAI_CHUNK_OVERLAP=500 # Overlap between chunks
OPENAI_MAX_CHUNKS=10 # Maximum chunks to process- Text Extraction: Extract text from database or PDF
- Length Check: Determine if chunking is needed
- Chunking: Split text at sentence boundaries
- Chunk Summarization: Summarize each chunk
- Final Summary: Combine chunk summaries
- Translation: Translate to Bahasa Malaysia
Since you already have a data pipeline that handles PDF text extraction and stores all text in your database, the AI summarization service works directly with the existing speech_data field in the Sitting model.
PDF Files → Your Data Pipeline → Database (speech_data)
↓
AI Summarization Service
↓
Chunking (if needed)
↓
Summary Generation
# Process specific sitting with chunking
python src/manage.py process_long_hansards \
--house dewan-rakyat \
--date 2024-01-15 \
--chunk-size 6000 \
--max-chunks 10
# Process multiple sittings
python src/manage.py process_long_hansards \
--house dewan-rakyat \
--term 15 \
--limit 20 \
--chunk-size 8000 \
--max-chunks 15// Frontend integration for long hansards
const generateLongSummary = async (house, date) => {
const response = await fetch('/api/summary/', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
house: house,
date: date,
force_regenerate: false
})
});
const result = await response.json();
if (result.success) {
console.log('Summary generated:', result.summary_en);
console.log('Chunks processed:', result.chunks_processed);
}
};- Smaller chunks (4,000 chars): More accurate, higher cost
- Larger chunks (8,000 chars): Less accurate, lower cost
- Optimal balance (6,000 chars): Good accuracy, reasonable cost
- Max 10 chunks: Prevents runaway costs
- Overlap reduction: Minimize redundant processing
- Smart truncation: Stop at reasonable limits
- Database storage: Never regenerate same summary
- Status tracking: Avoid duplicate processing
- Error recovery: Retry failed chunks only
Short hansard (< 6,000 chars): ~30 seconds
Medium hansard (6,000-30,000 chars): ~2-3 minutes
Long hansard (30,000+ chars): ~5-10 minutes
- OpenAI rate limits: 3,500 requests/minute
- Chunking reduces: Concurrent requests
- Sequential processing: Avoid rate limit issues
- Chunk processing: Process one chunk at a time
- Text extraction: Stream large files
- Database updates: Use transactions
# Handle individual chunk failures
for i, chunk in enumerate(chunks):
try:
chunk_summary = await summarize_chunk(chunk, i + 1, len(chunks))
if chunk_summary:
chunk_summaries.append(chunk_summary)
else:
logger.warning(f"Failed to summarize chunk {i + 1}")
except Exception as e:
logger.error(f"Error in chunk {i + 1}: {e}")- Partial summaries: Continue with successful chunks
- Retry mechanisms: Exponential backoff for failures
- Database fallback: Use existing speech_data if needed
# Track processing metrics
logger.info(f"Extracted {len(speech_text)} characters of speech text")
logger.info(f"Split text into {len(chunks)} chunks")
logger.info(f"Successfully generated summary for chunk {chunk_num}/{total_chunks}")
logger.info("Successfully generated final summary")- Processing time: Track total processing duration
- Chunk count: Monitor number of chunks processed
- Success rate: Track successful vs failed chunks
- Cost tracking: Monitor API usage and costs
- Chunk failures: Log specific chunk errors
- API errors: Monitor OpenAI API responses
- Database errors: Track transaction failures
# Determine processing strategy based on length
if len(speech_text) <= 6000:
# Use simple summarization
summary = await simple_summarize(speech_text)
else:
# Use chunking approach
summary = await chunked_summarize(speech_text)# Break at sentence boundaries
sentence_endings = re.finditer(r'[.!?]\s+', search_text)
if sentence_endings:
last_ending = sentence_endings[-1]
end = search_start + last_ending.end()- Summary coherence: Ensure final summary flows well
- Content completeness: Verify all key points included
- Language accuracy: Check translation quality
- Format consistency: Maintain parliamentary tone
-
"Text too long for processing"
- Increase
OPENAI_MAX_CHUNKS - Reduce
OPENAI_CHUNK_SIZE - Check for infinite loops
- Increase
-
"Text extraction failed"
- Verify speech_data format in database
- Check for empty or malformed JSON
- Validate text content quality
-
"Chunk summaries don't combine well"
- Adjust chunk overlap
- Improve final summary prompt
- Check for content duplication
-
"Processing takes too long"
- Reduce chunk size
- Implement parallel processing
- Use background tasks
# Test chunking configuration
python src/manage.py shell
>>> from api.services import ai_service
>>> text = "long text here..."
>>> chunks = ai_service._chunk_text(text)
>>> print(f"Split into {len(chunks)} chunks")
# Test chunking configuration
python src/manage.py shell
>>> from api.services import ai_service
>>> text = "long text here..."
>>> chunks = ai_service._chunk_text(text)
>>> print(f"Split into {len(chunks)} chunks")
## Future Enhancements
1. **Parallel Processing**: Process chunks concurrently
2. **Smart Chunking**: Use semantic boundaries
3. **Incremental Summarization**: Update summaries as new data arrives
4. **Quality Scoring**: Assess summary quality automatically
5. **Multi-language Support**: Support more languages
6. **Real-time Processing**: WebSocket-based updates
7. **Content Classification**: Categorize discussion topics
8. **Smart Chunking**: Use semantic boundaries
## Cost Estimation
### Example Cost Calculation
Long hansard: 50,000 characters Chunk size: 6,000 characters Number of chunks: 9 Cost per chunk: $0.02 Total cost: $0.18 per hansard
### Cost Optimization Tips
1. **Batch processing**: Process multiple hansards together
2. **Caching**: Never regenerate existing summaries
3. **Smart truncation**: Stop at reasonable limits
4. **Model selection**: Use appropriate model for task
5. **Rate limiting**: Avoid API rate limit penalties
This comprehensive approach ensures that even the longest parliament hansards can be processed efficiently while maintaining quality and controlling costs.