Skip to content

Commit 104c0f5

Browse files
committed
Merge branch 'AIML-Developer-04/Feature-Branch' of https://github.com/CloudLearningSolution/targeted-training-repo-week5-vertexaipipeline into AIML-Developer-04/Feature-Branch
2 parents 59e4cd2 + bf3e1df commit 104c0f5

File tree

4 files changed

+220
-123
lines changed

4 files changed

+220
-123
lines changed
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
"""
2+
Machine Learning Operations Playbook Adoption Workshop – Phase 2:
3+
Data Services Integration Architecture - Scavenger Hunt
4+
5+
File: custom_or_prebuilt_components.py
6+
7+
Purpose:
8+
--------
9+
This file provides scavenger hunt instructions for learners to explore
10+
existing AWS-based labs (Lab 6.1 and Lab 6.2) and plan migration to
11+
Vertex AI architecture. The focus is on Amazon S3 and Amazon Redshift
12+
integration patterns, and how to prepare to convert them into Vertex AI
13+
components using Kubeflow @component decorators.
14+
15+
Learners should use VSCode/PyCharm search to locate the TODO markers
16+
listed below and record WHERE (line of code), WHAT (purpose), and WHY
17+
(rationale for migration). This stage is planning-only: code remains in
18+
its AWS state, but learners should envision how it will map to Vertex AI
19+
components.
20+
21+
Target Vertex Architecture Structure:
22+
-------------------------------------
23+
├── src/
24+
│ ├── components/
25+
│ │ ├── __init__.py
26+
│ │ │
27+
│ │ ├── custom_data_quality_components.py # ✅ Custom
28+
│ │ ├── custom_training_components.py # ✅ Custom
29+
│ │ ├── custom_evaluation_components.py # ✅ Custom
30+
│ │ ├── custom_registry_components.py # ✅ Custom
31+
│ │ ├── custom_monitoring_components.py # ✅ Custom
32+
│ │ ├── custom_audit_components.py # ✅ Custom
33+
│ │ ├── custom_sysco_modelplaceholder_components.py # ✅ Custom
34+
│ │ │
35+
│ │ └── prebuilt_bigquery_components.py # ✅ Pre-built
36+
37+
Scavenger Hunt Instructions:
38+
----------------------------
39+
40+
1. Lab 6.1 — Amazon S3 Integration with SageMaker Workflows
41+
- Search: "# TODO: Lab 6.1.1 - Line-by-Line Import Exploration"
42+
* WHERE: Top of model.py imports
43+
* WHAT: Identify boto3/joblib imports
44+
* WHY: These libraries enable artifact persistence in S3
45+
* Migration Planning: In Vertex AI, this logic would move into
46+
custom_training_components.py with @component decorators, using
47+
GCS (gs:// URIs) instead of S3.
48+
- Search: "# TODO: Lab 6.1.4 - S3 Data Loading Conversion"
49+
* WHERE: _s3_persist() function in model.py
50+
* WHAT: Inspect boto3.upload_file usage
51+
* WHY: Durable storage pattern in AWS
52+
* Migration Planning: Replace with GCS client logic inside a
53+
@component in custom_registry_components.py.
54+
55+
AWS Information to Gather for Migration:
56+
- S3 bucket name (e.g., `my-ml-artifacts-bucket`)
57+
- Bucket region (e.g., `us-east-1`)
58+
- IAM role or access keys with `AmazonS3FullAccess`
59+
- Artifact paths (prefixes like `s3://bucket/models/`)
60+
- Current SageMaker registry integration points
61+
62+
Equivalent in GCP:
63+
- GCS bucket name (e.g., `gs://my-ml-artifacts`)
64+
- GCP project ID and region
65+
- Service account with `Storage Admin` role
66+
- Artifact paths in GCS (prefixes like `gs://bucket/models/`)
67+
68+
2. Lab 6.2 — Amazon Redshift Data Pipeline and ML Integration
69+
- Search: "# TODO: Lab 6.2.1 - Data Access Pattern Conversion"
70+
* WHERE: ingest_model.py _read_from_redshift()
71+
* WHAT: Inspect select_sql_from_dict or pd.read_sql usage
72+
* WHY: Redshift → DataFrame conversion
73+
* Migration Planning: Equivalent logic would move into
74+
prebuilt_bigquery_components.py using BigQuery query components.
75+
- Search: "# TODO: Lab 6.2.4 - Data Movement and Performance Considerations"
76+
* WHERE: stage_table_to_s3() in ingest_model.py
77+
* WHAT: Inspect UNLOAD vs client-side upload patterns
78+
* WHY: Efficiency vs cost trade-offs in Redshift
79+
* Migration Planning: Replace with BigQuery export jobs inside
80+
prebuilt_bigquery_components.py or custom_data_quality_components.py.
81+
82+
AWS Information to Gather for Migration:
83+
- Redshift cluster identifier (e.g., `redshift-cluster-1`)
84+
- Database name (e.g., `analytics_db`)
85+
- Schema names (e.g., `public`, `ml_features`)
86+
- User credentials or IAM role with Redshift access
87+
- Connection endpoint (host, port)
88+
- Common SQL queries used for ETL (COPY, UNLOAD, CTAS)
89+
90+
Equivalent in GCP:
91+
- BigQuery dataset name (e.g., `ml_features_dataset`)
92+
- BigQuery table names (e.g., `training_data`, `evaluation_data`)
93+
- GCP project ID and region
94+
- Service account with `BigQuery Admin` role
95+
- SQL queries adapted to BigQuery syntax (SELECT, CREATE TABLE AS)
96+
97+
3. Planning Migration with Vertex Kubeflow @component Decorators
98+
- For S3 → GCS:
99+
* Wrap artifact persistence logic in @component functions inside
100+
custom_training_components.py and custom_registry_components.py.
101+
* Replace boto3 calls with google-cloud-storage client calls.
102+
- For Redshift → BigQuery:
103+
* Wrap ETL and query logic in @component functions inside
104+
prebuilt_bigquery_components.py.
105+
* Replace psycopg2/sqlalchemy calls with google-cloud-bigquery client
106+
or prebuilt BigQuery components.
107+
108+
Learner Deliverable:
109+
--------------------
110+
For each TODO marker found:
111+
- Record WHERE: file name and line of code
112+
- Record WHAT: the code pattern or component
113+
- Record WHY: the rationale for its use in AWS
114+
- Record Migration Plan: which Vertex component file it would map to
115+
(custom_* or prebuilt_bigquery_components.py) with @component decorator
116+
117+
This scavenger hunt prepares learners to design the component structure
118+
shown above by understanding the migration path from AWS (S3, Redshift)
119+
to Vertex AI (GCS, BigQuery, Pipeline components).
120+
"""

learn/hands_on_exercise.py

Lines changed: 0 additions & 20 deletions
This file was deleted.

requirements.txt

Lines changed: 100 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,100 @@
1-
# This file points to the actual requirements in src/
2-
# All dependencies are managed in src/requirements.txt
3-
-r src/requirements.txt
1+
# ============================================================================
2+
# Core ML & Data Science Libraries
3+
# ============================================================================
4+
# NumPy - Numerical computing (used by scikit-learn, pandas)
5+
numpy==1.23.5
6+
7+
# Pandas - Data manipulation (used in all components for DataFrame operations)
8+
pandas==1.5.3
9+
10+
# Scikit-learn - Machine learning (LogisticRegression, accuracy_score)
11+
scikit-learn==1.2.2
12+
13+
# Joblib - Model serialization (save/load models as .joblib)
14+
joblib==1.2.0
15+
16+
# ============================================================================
17+
# Google Cloud Platform Libraries
18+
# ============================================================================
19+
# Google Cloud Storage - GCS bucket operations (pipeline artifacts)
20+
google-cloud-storage==2.10.0
21+
22+
# Google Cloud AI Platform - Vertex AI SDK (pipeline execution, model registry)
23+
google-cloud-aiplatform==1.56.0
24+
25+
# Google Cloud BigQuery - BigQuery client (data reading in components)
26+
google-cloud-bigquery==3.11.4
27+
28+
# BigQuery data types - Required for BigQuery DataFrame compatibility
29+
db-dtypes==1.1.1
30+
31+
# PyArrow - Efficient data serialization for BigQuery
32+
pyarrow==12.0.1
33+
34+
# ============================================================================
35+
# Kubeflow Pipelines (KFP) v2
36+
# ============================================================================
37+
# KFP SDK - Pipeline definition and component creation
38+
kfp==2.7.0
39+
40+
# Google Cloud Pipeline Components - Pre-built GCP components (BigQuery Query Job)
41+
google-cloud-pipeline-components==2.14.1
42+
43+
# ============================================================================
44+
# Testing Libraries (for tests/unit/)
45+
# ============================================================================
46+
# Pytest - Test framework
47+
pytest==7.4.3
48+
49+
# Pytest-cov - Code coverage reporting
50+
pytest-cov==4.1.0
51+
52+
# Pytest-mock - Enhanced mocking capabilities
53+
pytest-mock==3.12.0
54+
55+
# ============================================================================
56+
# Code Quality & Linting (CI/CD)
57+
# ============================================================================
58+
# Flake8 - Python linting (enforces PEP8, detects errors)
59+
flake8==6.1.0
60+
61+
# Black - Code formatter (consistent code style)
62+
black==23.12.1
63+
64+
# isort - Import sorting (organized imports)
65+
isort==5.13.2
66+
67+
# ============================================================================
68+
# YAML Processing (for governance/resource_auditor.py)
69+
# ============================================================================
70+
# PyYAML - YAML parsing (used in resource auditor for pipeline YAML)
71+
PyYAML==6.0.1
72+
73+
# ============================================================================
74+
# Type Checking (Optional - for development)
75+
# ============================================================================
76+
# Mypy - Static type checking
77+
# mypy==1.7.1
78+
79+
# ============================================================================
80+
# Documentation (Optional - for future Sphinx docs)
81+
# ============================================================================
82+
# Sphinx - Documentation generation
83+
# sphinx==7.2.6
84+
# sphinx-rtd-theme==2.0.0
85+
86+
# ============================================================================
87+
# Version Compatibility Notes
88+
# ============================================================================
89+
# - All versions tested with Python 3.9
90+
# - Compatible with base_image="python:3.9" in KFP components
91+
# - google-cloud-aiplatform 1.56.0 supports KFP 2.7.0
92+
# - google-cloud-pipeline-components 2.14.1 requires KFP 2.7.0
93+
# - BigQuery libraries (google-cloud-bigquery, db-dtypes, pyarrow) work together
94+
# - Pytest versions compatible with Python 3.9
95+
#
96+
# Installation:
97+
# pip install -r src/requirements.txt
98+
#
99+
# Update all packages (use with caution):
100+
# pip install --upgrade -r src/requirements.txt

src/requirements.txt

Lines changed: 0 additions & 100 deletions
This file was deleted.

0 commit comments

Comments
 (0)