Skip to content

Commit a07c073

Browse files
Tutorial: Run SDK for pandas job on ray cluster. (#1616)
* adding tutorial * updates per review * Some nitpick changes * removing unnecessary env var Co-authored-by: Abdel Jaidi <[email protected]>
1 parent e9b76ed commit a07c073

File tree

2 files changed

+226
-0
lines changed

2 files changed

+226
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3
143143
- [031 - OpenSearch](https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/031%20-%20OpenSearch.ipynb)
144144
- [032 - Lake Formation Governed Tables](https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/032%20-%20Lake%20Formation%20Governed%20Tables.ipynb)
145145
- [033 - Amazon Neptune](https://github.com/aws/aws-sdk-pandas/blob/main/tutorials/033%20-%20Amazon%20Neptune.ipynb)
146+
- [034 - Distributing Calls on Ray Remote Cluster](https://github.com/aws/aws-sdk-pandas/blob/release-3.0.0/tutorials/034%20-%20Distributing%20Calls%20on%20Ray%20Remote%20Cluster.ipynb)
146147
- [**API Reference**](https://aws-sdk-pandas.readthedocs.io/en/3.0.0b1/api.html)
147148
- [Amazon S3](https://aws-sdk-pandas.readthedocs.io/en/3.0.0b1/api.html#amazon-s3)
148149
- [AWS Glue Catalog](https://aws-sdk-pandas.readthedocs.io/en/3.0.0b1/api.html#aws-glue-catalog)
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)\n",
8+
"\n",
9+
"# 34 - Distributing Calls on Ray Remote Cluster\n",
10+
"\n",
11+
"AWS SDK for pandas supports distribution of specific calls on a cluster of EC2s using [ray](https://docs.ray.io/)."
12+
]
13+
},
14+
{
15+
"cell_type": "code",
16+
"execution_count": 1,
17+
"metadata": {},
18+
"outputs": [],
19+
"source": [
20+
"\n",
21+
"!pip install \"awswrangler[distributed]==3.0.0b1\""
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"## Configure and Build Ray Cluster on AWS\n",
29+
"\n",
30+
"#### Build Prerequisite Infrastructure\n",
31+
"\n",
32+
"Build a security group and IAM instance profile for the Ray Cluster to use.\n",
33+
"\n",
34+
"[<img src=\"https://s3.amazonaws.com/cloudformation-examples/cloudformation-launch-stack.png\">](https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=RayPrerequisiteInfra&templateURL=https://aws-data-wrangler-public-artifacts.s3.amazonaws.com/cloudformation/ray-prerequisite-infra.json)\n",
35+
"\n",
36+
"#### Configure Ray Cluster Configuration\n",
37+
"Start with a cluster configuration file (YAML)."
38+
]
39+
},
40+
{
41+
"cell_type": "code",
42+
"execution_count": null,
43+
"metadata": {},
44+
"outputs": [],
45+
"source": [
46+
"!touch config.yml"
47+
]
48+
},
49+
{
50+
"cell_type": "markdown",
51+
"metadata": {},
52+
"source": [
53+
"Replace all values to match your desired region, account number and name of resources deployed by the above CloudFormation Stack.\n",
54+
"\n",
55+
"[Click here](https://console.aws.amazon.com/ec2/home?region=us-east-1#Images:visibility=public-images;search=:ray-amzn-wheels_latest_amzn_ray-1.9.2-cp38;v=3;$case=tags:false%5C,client:false;$regex=tags:false%5C,client:false) to find the Ray AMI for your desired region. The example configuration below uses the AMI for `us-east-1`"
56+
]
57+
},
58+
{
59+
"cell_type": "code",
60+
"execution_count": null,
61+
"metadata": {},
62+
"outputs": [],
63+
"source": [
64+
"cluster_name: pandas-sdk-cluster\n",
65+
"\n",
66+
"initial_workers: 2\n",
67+
"min_workers: 2\n",
68+
"max_workers: 2\n",
69+
"\n",
70+
"provider:\n",
71+
" type: aws\n",
72+
" region: us-east-1 # Change AWS region as necessary\n",
73+
" availability_zone: us-east-1a,us-east-1b,us-east-1c # Change as necessary\n",
74+
" security_group:\n",
75+
" GroupName: ray-cluster\n",
76+
" cache_stopped_nodes: False\n",
77+
"\n",
78+
"available_node_types:\n",
79+
" ray.head.default:\n",
80+
" node_config:\n",
81+
" InstanceType: m4.xlarge\n",
82+
" IamInstanceProfile:\n",
83+
" # Replace with your account id and profile name if you did not use the default value\n",
84+
" Arn: arn:aws:iam::{ACCOUNT ID}:instance-profile/ray-cluster\n",
85+
" # Replace ImageId if using a different region / python version\n",
86+
" ImageId: ami-0ea510fcb67686b48\n",
87+
"\n",
88+
" ray.worker.default:\n",
89+
" min_workers: 2\n",
90+
" max_workers: 2\n",
91+
" node_config:\n",
92+
" InstanceType: m4.xlarge\n",
93+
" IamInstanceProfile:\n",
94+
" # Replace with your account id and profile name if you did not use the default value\n",
95+
" Arn: arn:aws:iam::{ACCOUNT ID}:instance-profile/ray-cluster\n",
96+
" # Replace ImageId if using a different region / python version\n",
97+
" ImageId: ami-0ea510fcb67686b48\n",
98+
"\n",
99+
"\n",
100+
"setup_commands:\n",
101+
"- pip install \"awswrangler[distributed]==3.0.0b1\""
102+
]
103+
},
104+
{
105+
"cell_type": "markdown",
106+
"metadata": {},
107+
"source": [
108+
"#### Provision Ray Cluster\n",
109+
"\n",
110+
"The command below creates a Ray cluster in your account based on the aforementioned config file. It consists of one head node and 2 workers (m4xlarge EC2s)."
111+
]
112+
},
113+
{
114+
"cell_type": "code",
115+
"execution_count": null,
116+
"metadata": {},
117+
"outputs": [],
118+
"source": [
119+
"!ray up -y config.yml"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"Once the cluster is up and running, we set the `WR_ADDRESS` environment variable to the head node Ray Cluster Address"
127+
]
128+
},
129+
{
130+
"cell_type": "code",
131+
"execution_count": null,
132+
"metadata": {},
133+
"outputs": [],
134+
"source": [
135+
"!export WR_ADDRESS=\"ray://$(ray get-head-ip config.yml | tail -1):10001\""
136+
]
137+
},
138+
{
139+
"cell_type": "markdown",
140+
"metadata": {},
141+
"source": [
142+
"As a result, `awswrangler` API calls now run on the cluster, not on your local machine. The SDK detects the required dependencies for its `distributed` mode and parallelizes supported methods on the cluster."
143+
]
144+
},
145+
{
146+
"cell_type": "code",
147+
"execution_count": null,
148+
"metadata": {},
149+
"outputs": [],
150+
"source": [
151+
"import awswrangler as wr\n",
152+
"print(f\"Distributed Mode: {wr.config.distributed}\")"
153+
]
154+
},
155+
{
156+
"cell_type": "markdown",
157+
"metadata": {},
158+
"source": [
159+
"Get Bucket Name"
160+
]
161+
},
162+
{
163+
"cell_type": "code",
164+
"execution_count": null,
165+
"metadata": {},
166+
"outputs": [],
167+
"source": [
168+
"import getpass \n",
169+
"\n",
170+
"bucket = getpass.getpass()"
171+
]
172+
},
173+
{
174+
"cell_type": "markdown",
175+
"metadata": {},
176+
"source": [
177+
"Read & write some data at scale on the cluster"
178+
]
179+
},
180+
{
181+
"cell_type": "code",
182+
"execution_count": null,
183+
"metadata": {},
184+
"outputs": [],
185+
"source": [
186+
"df = wr.s3.read_parquet(path=\"s3://ursa-labs-taxi-data/2010/1*.parquet\", parallelism=1000)\n",
187+
"path=\"s3://{bucket}/taxi-data/\"\n",
188+
"wr.s3.to_parquet(df, path=path)"
189+
]
190+
},
191+
{
192+
"cell_type": "markdown",
193+
"metadata": {},
194+
"source": [
195+
"##### [More Info on Ray Clusters on AWS](https://docs.ray.io/en/latest/cluster/vms/getting-started.html#launch-a-cluster-on-a-cloud-provider)"
196+
]
197+
}
198+
],
199+
"metadata": {
200+
"kernelspec": {
201+
"display_name": "Python 3.9.13 ('awswrangler-mo8sEp3D-py3.9')",
202+
"language": "python",
203+
"name": "python3"
204+
},
205+
"language_info": {
206+
"codemirror_mode": {
207+
"name": "ipython",
208+
"version": 3
209+
},
210+
"file_extension": ".py",
211+
"mimetype": "text/x-python",
212+
"name": "python",
213+
"nbconvert_exporter": "python",
214+
"pygments_lexer": "ipython3",
215+
"version": "3.9.13"
216+
},
217+
"vscode": {
218+
"interpreter": {
219+
"hash": "abf31c45c41a2718a2f25e3a2e428f2a986d4fe24d411f7f5e3ce0fef626968d"
220+
}
221+
}
222+
},
223+
"nbformat": 4,
224+
"nbformat_minor": 4
225+
}

0 commit comments

Comments
 (0)