|
| 1 | +""" |
| 2 | +Machine Learning Operations Playbook Adoption Workshop – Phase 2: |
| 3 | +Data Services Integration Architecture - Scavenger Hunt |
| 4 | +
|
| 5 | +File: custom_or_prebuilt_components.py |
| 6 | +
|
| 7 | +Purpose: |
| 8 | +-------- |
| 9 | +This file provides scavenger hunt instructions for learners to explore |
| 10 | +existing AWS-based labs (Lab 6.1 and Lab 6.2) and plan migration to |
| 11 | +Vertex AI architecture. The focus is on Amazon S3 and Amazon Redshift |
| 12 | +integration patterns, and how to prepare to convert them into Vertex AI |
| 13 | +components using Kubeflow @component decorators. |
| 14 | +
|
| 15 | +Learners should use VSCode/PyCharm search to locate the TODO markers |
| 16 | +listed below and record WHERE (line of code), WHAT (purpose), and WHY |
| 17 | +(rationale for migration). This stage is planning-only: code remains in |
| 18 | +its AWS state, but learners should envision how it will map to Vertex AI |
| 19 | +components. |
| 20 | +
|
| 21 | +Target Vertex Architecture Structure: |
| 22 | +------------------------------------- |
| 23 | +├── src/ |
| 24 | +│ ├── components/ |
| 25 | +│ │ ├── __init__.py |
| 26 | +│ │ │ |
| 27 | +│ │ ├── custom_data_quality_components.py # ✅ Custom |
| 28 | +│ │ ├── custom_training_components.py # ✅ Custom |
| 29 | +│ │ ├── custom_evaluation_components.py # ✅ Custom |
| 30 | +│ │ ├── custom_registry_components.py # ✅ Custom |
| 31 | +│ │ ├── custom_monitoring_components.py # ✅ Custom |
| 32 | +│ │ ├── custom_audit_components.py # ✅ Custom |
| 33 | +│ │ ├── custom_sysco_modelplaceholder_components.py # ✅ Custom |
| 34 | +│ │ │ |
| 35 | +│ │ └── prebuilt_bigquery_components.py # ✅ Pre-built |
| 36 | +
|
| 37 | +Scavenger Hunt Instructions: |
| 38 | +---------------------------- |
| 39 | +
|
| 40 | +1. Lab 6.1 — Amazon S3 Integration with SageMaker Workflows |
| 41 | + - Search: "# TODO: Lab 6.1.1 - Line-by-Line Import Exploration" |
| 42 | + * WHERE: Top of model.py imports |
| 43 | + * WHAT: Identify boto3/joblib imports |
| 44 | + * WHY: These libraries enable artifact persistence in S3 |
| 45 | + * Migration Planning: In Vertex AI, this logic would move into |
| 46 | + custom_training_components.py with @component decorators, using |
| 47 | + GCS (gs:// URIs) instead of S3. |
| 48 | + - Search: "# TODO: Lab 6.1.4 - S3 Data Loading Conversion" |
| 49 | + * WHERE: _s3_persist() function in model.py |
| 50 | + * WHAT: Inspect boto3.upload_file usage |
| 51 | + * WHY: Durable storage pattern in AWS |
| 52 | + * Migration Planning: Replace with GCS client logic inside a |
| 53 | + @component in custom_registry_components.py. |
| 54 | +
|
| 55 | + AWS Information to Gather for Migration: |
| 56 | + - S3 bucket name (e.g., `my-ml-artifacts-bucket`) |
| 57 | + - Bucket region (e.g., `us-east-1`) |
| 58 | + - IAM role or access keys with `AmazonS3FullAccess` |
| 59 | + - Artifact paths (prefixes like `s3://bucket/models/`) |
| 60 | + - Current SageMaker registry integration points |
| 61 | +
|
| 62 | + Equivalent in GCP: |
| 63 | + - GCS bucket name (e.g., `gs://my-ml-artifacts`) |
| 64 | + - GCP project ID and region |
| 65 | + - Service account with `Storage Admin` role |
| 66 | + - Artifact paths in GCS (prefixes like `gs://bucket/models/`) |
| 67 | +
|
| 68 | +2. Lab 6.2 — Amazon Redshift Data Pipeline and ML Integration |
| 69 | + - Search: "# TODO: Lab 6.2.1 - Data Access Pattern Conversion" |
| 70 | + * WHERE: ingest_model.py _read_from_redshift() |
| 71 | + * WHAT: Inspect select_sql_from_dict or pd.read_sql usage |
| 72 | + * WHY: Redshift → DataFrame conversion |
| 73 | + * Migration Planning: Equivalent logic would move into |
| 74 | + prebuilt_bigquery_components.py using BigQuery query components. |
| 75 | + - Search: "# TODO: Lab 6.2.4 - Data Movement and Performance Considerations" |
| 76 | + * WHERE: stage_table_to_s3() in ingest_model.py |
| 77 | + * WHAT: Inspect UNLOAD vs client-side upload patterns |
| 78 | + * WHY: Efficiency vs cost trade-offs in Redshift |
| 79 | + * Migration Planning: Replace with BigQuery export jobs inside |
| 80 | + prebuilt_bigquery_components.py or custom_data_quality_components.py. |
| 81 | +
|
| 82 | + AWS Information to Gather for Migration: |
| 83 | + - Redshift cluster identifier (e.g., `redshift-cluster-1`) |
| 84 | + - Database name (e.g., `analytics_db`) |
| 85 | + - Schema names (e.g., `public`, `ml_features`) |
| 86 | + - User credentials or IAM role with Redshift access |
| 87 | + - Connection endpoint (host, port) |
| 88 | + - Common SQL queries used for ETL (COPY, UNLOAD, CTAS) |
| 89 | +
|
| 90 | + Equivalent in GCP: |
| 91 | + - BigQuery dataset name (e.g., `ml_features_dataset`) |
| 92 | + - BigQuery table names (e.g., `training_data`, `evaluation_data`) |
| 93 | + - GCP project ID and region |
| 94 | + - Service account with `BigQuery Admin` role |
| 95 | + - SQL queries adapted to BigQuery syntax (SELECT, CREATE TABLE AS) |
| 96 | +
|
| 97 | +3. Planning Migration with Vertex Kubeflow @component Decorators |
| 98 | + - For S3 → GCS: |
| 99 | + * Wrap artifact persistence logic in @component functions inside |
| 100 | + custom_training_components.py and custom_registry_components.py. |
| 101 | + * Replace boto3 calls with google-cloud-storage client calls. |
| 102 | + - For Redshift → BigQuery: |
| 103 | + * Wrap ETL and query logic in @component functions inside |
| 104 | + prebuilt_bigquery_components.py. |
| 105 | + * Replace psycopg2/sqlalchemy calls with google-cloud-bigquery client |
| 106 | + or prebuilt BigQuery components. |
| 107 | +
|
| 108 | +Learner Deliverable: |
| 109 | +-------------------- |
| 110 | +For each TODO marker found: |
| 111 | +- Record WHERE: file name and line of code |
| 112 | +- Record WHAT: the code pattern or component |
| 113 | +- Record WHY: the rationale for its use in AWS |
| 114 | +- Record Migration Plan: which Vertex component file it would map to |
| 115 | + (custom_* or prebuilt_bigquery_components.py) with @component decorator |
| 116 | +
|
| 117 | +This scavenger hunt prepares learners to design the component structure |
| 118 | +shown above by understanding the migration path from AWS (S3, Redshift) |
| 119 | +to Vertex AI (GCS, BigQuery, Pipeline components). |
| 120 | +""" |
0 commit comments