System Architecture

Overview

This document describes the architecture of the Vertex AI MLOps Pipeline Demo, which demonstrates enterprise-grade machine learning workflows using Google Cloud Platform services.

High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        User Interface                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │   Jupyter   │  │ Vertex AI   │  │   Cloud     │            │
│  │ Notebooks   │  │  Console    │  │   Console   │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    CI/CD Pipeline                               │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │   Azure     │  │   GitHub    │  │   Cloud     │            │
│  │  DevOps     │  │   Actions   │  │   Build     │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Vertex AI Pipeline                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │   BigQuery  │  │  Dataflow   │  │   Dataproc  │            │
│  │   Analysis  │  │ Processing  │  │ Processing  │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Infrastructure Layer                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐            │
│  │     GCS     │  │     VPC     │  │     IAM     │            │
│  │   Storage   │  │  Network    │  │   Service   │            │
│  │             │  │             │  │  Accounts   │            │
│  └─────────────┘  └─────────────┘  └─────────────┘            │
└─────────────────────────────────────────────────────────────────┘

Component Details

1. Infrastructure Layer

GCP Project

Project ID: prj-gft-vertexai-demo1
Region: europe-west2
Billing Account: 01A2F5-73127B-50AE5B

Networking

VPC: test-vpc-network
Subnet: my-subnet-123 (10.0.0.0/8)
Purpose: Isolated network for data processing

Storage

Artifact Bucket: vertex-ai-model-artifacts-bkt
Dataflow Templates: dataflow-templates-bkt
Dataflow Temp: dataflow-temp-bkt
Dataflow Artifacts: dataflow-artifacts-bkt

IAM Service Accounts

Vertex AI Executor: vertex-ai-executor@prj-gft-vertexai-demo1.iam.gserviceaccount.com
Dataproc: dataproc@prj-gft-vertexai-demo1.iam.gserviceaccount.com
Dataflow: dataflow@prj-gft-vertexai-demo1.iam.gserviceaccount.com

2. ML Pipeline Layer

BigQuery Component

Purpose: Data analysis and record counting
Dataset: bigquery-public-data.chicago_taxi_trips.taxi_trips
Outputs: Total records, 0.1% sample size

Dataflow Component

Purpose: Apache Beam data processing
Template: chicago-taxi-avg-speed-csv.json
Output: Average taxi speeds by time period

Dataproc Component

Purpose: Spark batch processing
Output: Processed taxi data with aggregations

3. Data Flow

1. Data Source
   └── Chicago Taxi Trips (BigQuery Public Dataset)
       └── 2. BigQuery Analysis
           ├── Count total records
           └── Calculate 0.1% sample
       └── 3. Dataflow Processing
           ├── Read taxi trip data
           ├── Calculate average speeds
           └── Write results to GCS
       └── 4. Dataproc Processing
           ├── Spark job execution
           ├── Data aggregation
           └── Store processed data
       └── 5. Results Aggregation
           ├── Combine all outputs
           └── Generate summary report

Security Architecture

Network Security

VPC with private subnets
Firewall rules for service-to-service communication
Cloud NAT for outbound internet access

Data Security

Encryption at rest (GCS, BigQuery)
Encryption in transit (TLS 1.2+)
IAM policies with least privilege access

Service Account Security

Dedicated service accounts per service
Minimal required permissions
Key rotation policies

Scalability Considerations

Horizontal Scaling

Dataflow auto-scaling based on data volume
Dataproc cluster scaling
BigQuery slot allocation

Vertical Scaling

Machine type selection for compute-intensive tasks
Memory optimization for large datasets

Monitoring & Observability

Logging

Cloud Logging for all services
Structured logging with correlation IDs
Log retention policies

Metrics

Cloud Monitoring dashboards
Custom metrics for pipeline performance
Alerting on failures and performance degradation

Tracing

Distributed tracing across pipeline components
Performance bottleneck identification

Disaster Recovery

Data Backup

GCS versioning enabled
BigQuery table snapshots
Cross-region replication for critical data

Infrastructure Recovery

Terraform/Terragrunt for infrastructure as code
Automated deployment pipelines
Environment-specific configurations

Cost Optimization

Resource Management

Auto-scaling for compute resources
Spot instances for non-critical workloads
Resource scheduling for batch jobs

Storage Optimization

Lifecycle policies for GCS buckets
BigQuery partitioning and clustering
Data archival strategies

Future Enhancements

Planned Improvements

Multi-region deployment
Advanced ML model training
Real-time streaming with Pub/Sub
Advanced monitoring with custom dashboards
Integration with external data sources

Technology Stack Evolution

Migration to newer GCP services
Adoption of new ML frameworks
Enhanced security features
Improved CI/CD practices

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System Architecture

Overview

High-Level Architecture

Component Details

1. Infrastructure Layer

GCP Project

Networking

Storage

IAM Service Accounts

2. ML Pipeline Layer

BigQuery Component

Dataflow Component

Dataproc Component

3. Data Flow

Security Architecture

Network Security

Data Security

Service Account Security

Scalability Considerations

Horizontal Scaling

Vertical Scaling

Monitoring & Observability

Logging

Metrics

Tracing

Disaster Recovery

Data Backup

Infrastructure Recovery

Cost Optimization

Resource Management

Storage Optimization

Future Enhancements

Planned Improvements

Technology Stack Evolution

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

System Architecture

Overview

High-Level Architecture

Component Details

1. Infrastructure Layer

GCP Project

Networking

Storage

IAM Service Accounts

2. ML Pipeline Layer

BigQuery Component

Dataflow Component

Dataproc Component

3. Data Flow

Security Architecture

Network Security

Data Security

Service Account Security

Scalability Considerations

Horizontal Scaling

Vertical Scaling

Monitoring & Observability

Logging

Metrics

Tracing

Disaster Recovery

Data Backup

Infrastructure Recovery

Cost Optimization

Resource Management

Storage Optimization

Future Enhancements

Planned Improvements

Technology Stack Evolution