Skip to content

Latest commit

 

History

History
190 lines (154 loc) · 8.22 KB

File metadata and controls

190 lines (154 loc) · 8.22 KB

System Architecture

Overview

This document describes the architecture of the Vertex AI MLOps Pipeline Demo, which demonstrates enterprise-grade machine learning workflows using Google Cloud Platform services.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        User Interface                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │   Jupyter   │  │ Vertex AI   │  │   Cloud     │            │
│  │ Notebooks   │  │  Console    │  │   Console   │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    CI/CD Pipeline                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │   Azure     │  │   GitHub    │  │   Cloud     │            │
│  │  DevOps     │  │   Actions   │  │   Build     │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Vertex AI Pipeline                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │   BigQuery  │  │  Dataflow   │  │   Dataproc  │            │
│  │   Analysis  │  │ Processing  │  │ Processing  │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Infrastructure Layer                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │     GCS     │  │     VPC     │  │     IAM     │            │
│  │   Storage   │  │  Network    │  │   Service   │            │
│  │             │  │             │  │  Accounts   │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘

Component Details

1. Infrastructure Layer

GCP Project

  • Project ID: prj-gft-vertexai-demo1
  • Region: europe-west2
  • Billing Account: 01A2F5-73127B-50AE5B

Networking

  • VPC: test-vpc-network
  • Subnet: my-subnet-123 (10.0.0.0/8)
  • Purpose: Isolated network for data processing

Storage

  • Artifact Bucket: vertex-ai-model-artifacts-bkt
  • Dataflow Templates: dataflow-templates-bkt
  • Dataflow Temp: dataflow-temp-bkt
  • Dataflow Artifacts: dataflow-artifacts-bkt

IAM Service Accounts

  • Vertex AI Executor: vertex-ai-executor@prj-gft-vertexai-demo1.iam.gserviceaccount.com
  • Dataproc: dataproc@prj-gft-vertexai-demo1.iam.gserviceaccount.com
  • Dataflow: dataflow@prj-gft-vertexai-demo1.iam.gserviceaccount.com

2. ML Pipeline Layer

BigQuery Component

  • Purpose: Data analysis and record counting
  • Dataset: bigquery-public-data.chicago_taxi_trips.taxi_trips
  • Outputs: Total records, 0.1% sample size

Dataflow Component

  • Purpose: Apache Beam data processing
  • Template: chicago-taxi-avg-speed-csv.json
  • Output: Average taxi speeds by time period

Dataproc Component

  • Purpose: Spark batch processing
  • Output: Processed taxi data with aggregations

3. Data Flow

1. Data Source
   └── Chicago Taxi Trips (BigQuery Public Dataset)
       └── 2. BigQuery Analysis
           ├── Count total records
           └── Calculate 0.1% sample
       └── 3. Dataflow Processing
           ├── Read taxi trip data
           ├── Calculate average speeds
           └── Write results to GCS
       └── 4. Dataproc Processing
           ├── Spark job execution
           ├── Data aggregation
           └── Store processed data
       └── 5. Results Aggregation
           ├── Combine all outputs
           └── Generate summary report

Security Architecture

Network Security

  • VPC with private subnets
  • Firewall rules for service-to-service communication
  • Cloud NAT for outbound internet access

Data Security

  • Encryption at rest (GCS, BigQuery)
  • Encryption in transit (TLS 1.2+)
  • IAM policies with least privilege access

Service Account Security

  • Dedicated service accounts per service
  • Minimal required permissions
  • Key rotation policies

Scalability Considerations

Horizontal Scaling

  • Dataflow auto-scaling based on data volume
  • Dataproc cluster scaling
  • BigQuery slot allocation

Vertical Scaling

  • Machine type selection for compute-intensive tasks
  • Memory optimization for large datasets

Monitoring & Observability

Logging

  • Cloud Logging for all services
  • Structured logging with correlation IDs
  • Log retention policies

Metrics

  • Cloud Monitoring dashboards
  • Custom metrics for pipeline performance
  • Alerting on failures and performance degradation

Tracing

  • Distributed tracing across pipeline components
  • Performance bottleneck identification

Disaster Recovery

Data Backup

  • GCS versioning enabled
  • BigQuery table snapshots
  • Cross-region replication for critical data

Infrastructure Recovery

  • Terraform/Terragrunt for infrastructure as code
  • Automated deployment pipelines
  • Environment-specific configurations

Cost Optimization

Resource Management

  • Auto-scaling for compute resources
  • Spot instances for non-critical workloads
  • Resource scheduling for batch jobs

Storage Optimization

  • Lifecycle policies for GCS buckets
  • BigQuery partitioning and clustering
  • Data archival strategies

Future Enhancements

Planned Improvements

  • Multi-region deployment
  • Advanced ML model training
  • Real-time streaming with Pub/Sub
  • Advanced monitoring with custom dashboards
  • Integration with external data sources

Technology Stack Evolution

  • Migration to newer GCP services
  • Adoption of new ML frameworks
  • Enhanced security features
  • Improved CI/CD practices