Skip to content

Exercise: Build Document Ingestion and RAG System with AWS Services #6

@jwalsh

Description

@jwalsh

Overview

Build a document ingestion and retrieval system using AWS services that allows users to upload documents via API, process them, and query them using RAG (Retrieval Augmented Generation).

Architecture Components

1. Document Ingestion

  • API Gateway endpoint for document uploads
  • Authentication/authorization
  • Document validation (file type, size limits)
  • Store raw documents in S3

2. Document Processing Pipeline

  • S3 event triggers for new documents
  • Extract text from various formats (PDF, DOCX, TXT)
  • Generate embeddings for document chunks
  • Store processed data and metadata

3. Search & Retrieval

  • Vector store for embeddings (OpenSearch)
  • Knowledge Base API for queries
  • RAG integration for intelligent responses
  • Relevance scoring and ranking

4. Storage Requirements

  • S3 bucket for raw documents
  • S3 bucket for processed documents
  • DynamoDB for document metadata
  • OpenSearch domain for vector storage

Non-Functional Requirements

Performance

  • < 5s document upload response time
  • < 2s query response time
  • Support 100 concurrent users
  • Handle documents up to 50MB

Security

  • API authentication (Cognito/IAM)
  • Encryption at rest (S3, OpenSearch)
  • Encryption in transit (HTTPS)
  • VPC isolation for processing

Scalability

  • Auto-scaling for Lambda functions
  • OpenSearch cluster sizing
  • S3 lifecycle policies
  • CloudFront CDN for static assets

Monitoring

  • CloudWatch metrics for all services
  • X-Ray tracing for request flow
  • Error alerting via SNS
  • Dashboard for system health

Acceptance Criteria

  • Successfully upload and process test documents
  • Query returns relevant results from uploaded documents
  • System handles errors gracefully
  • All security requirements met
  • Performance benchmarks achieved

Out of Scope

  • User interface (API only)
  • Multi-language support (English only for v1)
  • Real-time processing (async is acceptable)
  • Document editing capabilities

Dependencies

  • AWS Bedrock for embeddings
  • OpenSearch 2.x
  • Python 3.11+ for Lambda functions

Open Questions

  • Supported document formats?
  • Retention policy for documents?
  • Cost constraints/budget?
  • Specific embedding model preference?
  • Need for document versioning?

Metadata

Metadata

Assignees

No one assigned

    Labels

    ai-mlAI and Machine Learning topicsarchitectureSystem or solution architectureawsRelated to AWS services or infrastructureexerciseHands-on exercise or projectragRetrieval Augmented Generation related

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions