Skip to content

Latest commit

 

History

History
325 lines (252 loc) · 8.95 KB

File metadata and controls

325 lines (252 loc) · 8.95 KB

AI Summarization for Parliament Hansards

This document describes the AI summarization feature for the Malaysian Parliament Hansard system.

Overview

The AI summarization feature automatically generates comprehensive summaries of parliament hansard sessions using OpenAI's language models. The system supports both English and Bahasa Malaysia summaries, with automatic translation between the two languages.

Features

  • Bilingual Summaries: Generate summaries in both English and Bahasa Malaysia
  • Cost Optimization: Store summaries in database to avoid repeated API calls
  • Async Processing: Non-blocking summary generation
  • Status Tracking: Monitor summary generation status
  • Error Handling: Comprehensive error handling and retry mechanisms
  • Bulk Operations: Process multiple sittings at once

Architecture

Database Schema

The Sitting model has been extended with the following fields:

# AI Summarization fields
summary_en = models.TextField(null=True, blank=True, help_text="AI-generated summary in English")
summary_bm = models.TextField(null=True, blank=True, help_text="AI-generated summary in Bahasa Malaysia")
summary_generated_at = models.DateTimeField(null=True, blank=True, help_text="Timestamp when summary was generated")
summary_status = models.CharField(
    max_length=20, 
    default="pending", 
    choices=[
        ("pending", "Pending"),
        ("processing", "Processing"),
        ("completed", "Completed"),
        ("failed", "Failed")
    ],
    help_text="Status of summary generation"
)
summary_error = models.TextField(null=True, blank=True, help_text="Error message if summary generation failed")

Components

  1. AISummarizationService (src/api/services.py)

    • Handles OpenAI API interactions
    • Extracts speech text from JSON data
    • Generates summaries and translations
    • Manages API configuration
  2. SummaryView (src/api/summary_views.py)

    • REST API endpoints for summary operations
    • GET: Retrieve existing summaries
    • POST: Generate new summaries
  3. Management Command (src/api/management/commands/generate_summaries.py)

    • Command-line tool for bulk summary generation
    • Background processing capabilities

API Endpoints

1. Get Summary

GET /api/summary/?house=dewan-rakyat&date=2024-01-15&language=en

Parameters:

  • house: House type (dewan-rakyat, dewan-negara, kamar-khas)
  • date: Sitting date (YYYY-MM-DD)
  • language: Preferred language (en, bm)

Response:

{
    "summary_exists": true,
    "summary": "The parliamentary session focused on...",
    "language": "English",
    "status": "completed",
    "generated_at": "2024-01-15T10:30:00Z",
    "sitting_id": 123,
    "date": "2024-01-15",
    "filename": "hansard_2024_01_15.pdf"
}

2. Generate Summary

POST /api/summary/
Content-Type: application/json

{
    "house": "dewan-rakyat",
    "date": "2024-01-15",
    "force_regenerate": false
}

Parameters:

  • house: House type
  • date: Sitting date
  • force_regenerate: Force regenerate existing summary (optional)

Response:

{
    "success": true,
    "message": "Summary generated successfully",
    "summary_en": "The parliamentary session focused on...",
    "summary_bm": "Sesi parlimen memberi tumpuan kepada...",
    "generated_at": "2024-01-15T10:30:00Z",
    "sitting_id": 123
}

3. Check Summary Status

GET /api/summary/status/?house=dewan-rakyat&date=2024-01-15

Response:

{
    "sitting_id": 123,
    "date": "2024-01-15",
    "filename": "hansard_2024_01_15.pdf",
    "summary_status": "completed",
    "summary_exists": true,
    "generated_at": "2024-01-15T10:30:00Z",
    "error": null
}

4. Bulk Summary Generation

POST /api/summary/bulk/
Content-Type: application/json

{
    "sittings": [
        {"house": "dewan-rakyat", "date": "2024-01-15"},
        {"house": "dewan-rakyat", "date": "2024-01-16"}
    ],
    "force_regenerate": false
}

Configuration

Environment Variables

Add the following environment variables to your .env file:

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1  # Optional, defaults to OpenAI
OPENAI_MODEL=gpt-4o-mini  # Optional, defaults to gpt-4o-mini
OPENAI_MAX_TOKENS=2000  # Optional, defaults to 2000

Database Migration

Run the migration to add summary fields to the database:

python src/manage.py migrate

Usage

1. API Usage

Frontend Integration:

// Get summary for a sitting
const response = await fetch('/api/summary/?house=dewan-rakyat&date=2024-01-15&language=en');
const data = await response.json();

if (data.summary_exists) {
    console.log('Summary:', data.summary);
} else {
    // Generate summary if not exists
    const generateResponse = await fetch('/api/summary/', {
        method: 'POST',
        headers: {'Content-Type': 'application/json'},
        body: JSON.stringify({
            house: 'dewan-rakyat',
            date: '2024-01-15'
        })
    });
}

2. Command Line Usage

Generate summaries for specific criteria:

# Generate summaries for Dewan Rakyat, Term 15
python src/manage.py generate_summaries --house dewan-rakyat --term 15 --limit 50

# Generate summary for specific date
python src/manage.py generate_summaries --house dewan-rakyat --date 2024-01-15

# Force regenerate existing summaries
python src/manage.py generate_summaries --house dewan-rakyat --force --limit 10

# Dry run to see what would be processed
python src/manage.py generate_summaries --house dewan-rakyat --dry-run

3. Background Processing

For production environments, consider using the management command with a task queue:

# Using cron for scheduled processing
0 2 * * * cd /path/to/hansards-back && source venv/bin/activate && python src/manage.py generate_summaries --house dewan-rakyat --limit 20

# Using Celery for async processing
celery -A hansards worker -l info

Cost Optimization

The system is designed to minimize OpenAI API costs:

  1. Caching: Summaries are stored in the database and reused
  2. Status Tracking: Prevents duplicate processing
  3. Text Truncation: Long speeches are truncated to stay within token limits
  4. Batch Processing: Process multiple sittings efficiently

Error Handling

The system includes comprehensive error handling:

  1. API Failures: Network issues, rate limits, API errors
  2. Invalid Data: Malformed speech data, missing content
  3. Configuration Issues: Missing API keys, invalid settings
  4. Database Issues: Connection problems, constraint violations

All errors are logged and stored in the summary_error field for debugging.

Monitoring

Status Tracking

Monitor summary generation status:

  • pending: Summary not yet generated
  • processing: Currently being generated
  • completed: Successfully generated
  • failed: Generation failed (check summary_error)

Logging

The system logs all operations:

import logging
logger = logging.getLogger(__name__)

Key log events:

  • Summary generation start/completion
  • API errors and retries
  • Translation failures
  • Database operations

Security Considerations

  1. API Key Management: Store OpenAI API keys securely
  2. Rate Limiting: Implement rate limiting for API endpoints
  3. Input Validation: Validate all input parameters
  4. Error Information: Don't expose sensitive error details to clients

Performance Considerations

  1. Async Processing: Use async/await for API calls
  2. Database Indexing: Consider indexes on summary fields
  3. Caching: Implement Redis caching for frequently accessed summaries
  4. Batch Processing: Process multiple sittings in batches

Troubleshooting

Common Issues

  1. "OpenAI API key not configured"

    • Set OPENAI_API_KEY environment variable
    • Check API key validity
  2. "No speech text extracted"

    • Verify speech_data format
    • Check for empty or malformed JSON
  3. "Rate limit exceeded"

    • Implement exponential backoff
    • Reduce batch sizes
    • Check OpenAI rate limits
  4. "Summary generation failed"

    • Check summary_error field
    • Verify API configuration
    • Review speech data quality

Debug Commands

# Check summary status for all sittings
python src/manage.py shell
>>> from api.models import Sitting
>>> Sitting.objects.filter(summary_status='failed').values('date', 'summary_error')

# Test OpenAI connection
python src/manage.py shell
>>> from api.services import ai_service
>>> print(ai_service.client is not None)

Future Enhancements

  1. Multiple AI Providers: Support for other AI services
  2. Custom Prompts: Configurable summarization prompts
  3. Quality Metrics: Summary quality assessment
  4. User Feedback: Allow users to rate summaries
  5. Advanced Filtering: Filter summaries by topics, speakers, etc.
  6. Export Features: Export summaries in various formats
  7. Real-time Processing: WebSocket-based real-time updates