Watch the complete walkthrough: Click the badge above or watch on YouTube
Learn how to build an end-to-end RAG system with Google Cloud AI in under 15 minutes
A complete end-to-end Retrieval-Augmented Generation (RAG) system for processing invoice and contract documents using Google Cloud AI services. This system combines Document AI, BigQuery, Vector Search, and Gemini to create an intelligent question-answering assistant.
PDF Documents โ GCS Upload โ Document AI (OCR) โ Gemini (JSON Extraction)
โ
BigQuery Storage โ Text Chunking โ Vector Embeddings โ Vector Search Index
โ
User Query โ Embedding โ Similarity Search โ Context Retrieval โ Answer Generation
- Automated Document Processing: OCR and text extraction from non-searchable PDF invoices
- Intelligent Data Extraction: Structured JSON output with 95%+ field accuracy using Gemini 2.0 Flash
- Semantic Search: 768-dimensional vector embeddings for natural language querying
- Sub-second Response Times: Optimized chunking and retrieval pipeline
- Production-Ready: Error handling, deduplication, and scalable architecture
- Interactive Interface: Clean Streamlit web application with context transparency
- Google Cloud Platform: Document AI, BigQuery, Vector Search, Vertex AI
- AI/ML: Gemini 2.0 Flash, text-embedding-004 model
- Backend: Python, Google Cloud SDK
- Frontend: Streamlit
- Data Storage: Google Cloud Storage, BigQuery
- Vector Database: Vertex AI Vector Search
- Google Cloud Platform account with billing enabled
- Python 3.8+
- GCP Project with following APIs enabled:
- Document AI API
- BigQuery API
- Vertex AI API
- Cloud Storage API
-
Clone the repository
git clone https://github.com/aks861999/google-cloud-chatbot.git cd invoice-rag-system -
Install dependencies
pip install -r requirements.txt
-
Set up Google Cloud Authentication
# Option 1: Using service account key export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/service-account-key.json" # Option 2: Using gcloud CLI gcloud auth application-default login
-
Configure project settings Update the following variables in each script:
PROJECT_ID = "your-gcp-project-id" PROJECT_NUMBER = "your-project-number" LOCATION = "us-central1" # or your preferred region
python upload_to_gcs.py- Creates GCS bucket if it doesn't exist
- Uploads all PDF files from
./docsdirectory - Filters for PDF files only
Configuration:
GCS_BUCKET_NAME: Your Cloud Storage bucket nameSOURCE_PDF_DIR: Local directory containing PDF files
python gemini_process_document.pyThis script:
- Sets up BigQuery dataset and tables automatically
- Processes PDFs using Document AI for OCR
- Extracts structured data using Gemini 2.0 Flash
- Stores results in BigQuery with deduplication
Key Features:
- Smart JSON extraction with flexible prompting
- Automatic schema validation
- Error handling and retry logic
- Local JSON backup (optional)
python create_embeddings.pyCreates searchable embeddings:
- Chunks documents into 512-token segments
- Generates 768-dimensional vectors using text-embedding-004
- Creates and deploys Vector Search index
- Optimizes for semantic similarity matching
streamlit run app.pyAccess the web interface at http://localhost:8501
- "What is the invoice number?"
- "Who is the vendor for this invoice?"
- "What are all the line items and their costs?"
- "What is the total amount due?"
- "When is the payment due date?"
- "Show me all services with quantity greater than 5"
from app import get_context_from_vector_search, generate_answer
# Query the system programmatically
query = "What is the total invoice amount?"
context = get_context_from_vector_search(query, num_neighbors=3)
response = generate_answer(query, context)
print(response['answer'])- Purpose: Centralized document storage
- Input: Local PDF files
- Output: Files stored in Google Cloud Storage
- Features: Automatic directory creation, PDF filtering
- Purpose: Extract structured data from unstructured documents
- Input: PDF files from GCS
- Output: JSON data stored in BigQuery
- Key Technologies: Document AI (OCR), Gemini 2.0 Flash (extraction)
- Purpose: Create searchable vector representations
- Input: Structured data from BigQuery
- Output: Vector embeddings in Vector Search index
- Algorithm: text-embedding-004 model, 512-token chunking
- Purpose: User-facing question-answering interface
- Input: Natural language queries
- Output: Contextually accurate answers
- Architecture: Query embedding โ Vector search โ Context retrieval โ Answer generation
- Processing Speed: ~30 seconds per invoice (including OCR + extraction)
- Query Response Time: <2 seconds average
- Extraction Accuracy: 95%+ for standard invoice fields
- Context Relevance: Top-3 retrieval with 85%+ relevance scores
- Scalability: Handles 10K+ documents with sub-linear query time
BQ_DATASET_ID = "contract_data"
BQ_TABLE_ID = "documents"
RECREATE_BQ_TABLE = False # Set to True for schema updatesTEXT_CHUNK_SIZE = 512 # Token limit per chunk
NUM_NEIGHBORS = 3 # Retrieved context sections
EMBEDDING_MODEL = "text-embedding-004"SAVE_JSON_LOCALLY = True # Enable local JSON backup
UPDATE_IF_EXISTS = True # Update existing records-
Authentication Errors
# Verify credentials gcloud auth list gcloud config set project YOUR_PROJECT_ID
-
API Not Enabled
# Enable required APIs gcloud services enable documentai.googleapis.com gcloud services enable bigquery.googleapis.com gcloud services enable aiplatform.googleapis.com
-
Vector Search Index Issues
- Ensure index is properly deployed
- Check endpoint resource name format
- Verify embedding dimensions match (768)
-
Memory Issues with Large Documents
- Reduce
TEXT_CHUNK_SIZEto 256 or 128 - Process documents in smaller batches
- Monitor BigQuery slot usage
- Reduce
Enable verbose logging:
import logging
logging.basicConfig(level=logging.DEBUG)- Fork this repository
- Connect to Streamlit Cloud
- Add secrets in Streamlit dashboard:
[secrets] GOOGLE_APPLICATION_CREDENTIALS_JSON = "your-service-account-json"
# Build and deploy
gcloud run deploy invoice-rag-system \
--source . \
--platform managed \
--region us-central1 \
--allow-unauthenticatedFROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py"]- Multi-modal Processing: Support for images and tables within documents
- Batch Processing: High-volume document processing with queue management
- Advanced Filtering: Metadata-based search and filtering capabilities
- Multi-language Support: International document processing
- API Endpoints: RESTful API for programmatic access
- Real-time Streaming: Token-by-token response streaming
- Analytics Dashboard: Processing metrics and usage analytics
- Caching Layer: Redis for frequent queries
- Parallel Processing: Concurrent document processing
- Smart Chunking: Content-aware segmentation
- Model Fine-tuning: Domain-specific embedding models
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add type hints for new functions
- Include unit tests for new features
- Update documentation for API changes
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Cloud Platform for AI services
- Streamlit for the web framework
- The open-source community for inspiration and tools
- Issues: GitHub Issues
- Email: biswas.2491@gmail.com