Ray Load Tests CDK Stack and Instructions for Load Testing (#1583)

malachi-constant · web-flow · commit 65305c2c3992 · 2022-09-09T10:22:39.000-07:00
* adding load test instructinos and ray stack * flake8 * black * isort * Tutorials updating paths (#1584) * sync * sync * sync * fixing pip install syntax * updating region env var * pylint
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -94,13 +94,6 @@ You can choose from three different environments to test your fixes/changes, bas
 * Pick up a Linux or MacOS.
 * Install Python 3.7, 3.8 or 3.9 with [poetry](https://github.com/python-poetry/poetry) for package management
 * Fork the AWS SDK for pandas repository and clone that into your development environment
-* Go to the project's directory create a Python's virtual environment for the project
-
-`python3 -m venv .venv && source .venv/bin/activate`
-
-or
-
-`python -m venv .venv && source .venv/bin/activate`
 
 * Install dependencies:
 
@@ -125,13 +118,6 @@ or
 * Pick up a Linux or MacOS.
 * Install Python 3.7, 3.8 or 3.9 with [poetry](https://github.com/python-poetry/poetry) for package management
 * Fork the AWS SDK for pandas repository and clone that into your development environment
-* Go to the project's directory create a Python's virtual environment for the project
-
-`python3 -m venv .venv && source .venv/bin/activate`
-
-or
-
-`python -m venv .venv && source .venv/bin/activate`
 
 * Install dependencies:
 
@@ -186,9 +172,6 @@ or
 * Pick up a Linux or MacOS.
 * Install Python 3.7, 3.8 or 3.9 with [poetry](https://github.com/python-poetry/poetry) for package management
 * Fork the AWS SDK for pandas repository and clone that into your development environment
-* Go to the project's directory create a Python's virtual environment for the project
-
-`python -m venv .venv && source .venv/bin/activate`
 
 * Then run the command bellow to install all dependencies:
 
@@ -262,6 +245,122 @@ or
 
 ``./test_infra/scripts/delete-stack.sh databases``
 
+## Ray Load Tests Environment 
+**DISCLAIMER**: Make sure you know what you are doing. These steps will charge some services on your AWS account and require a minimum security skill to keep your environment safe.
+
+* Pick up a Linux or MacOS.
+* Install Python 3.7, 3.8 or 3.9 with [poetry](https://github.com/python-poetry/poetry) for package management
+* Fork the AWS SDK for pandas repository and clone that into your development environment
+
+* Then run the command bellow to install all dependencies:
+
+``poetry install``
+
+* Go to the ``test_infra`` directory
+
+``cd test_infra``
+
+* Install CDK dependencies:
+
+``poetry install``
+
+* [OPTIONAL] Set AWS_DEFAULT_REGION to define the region the Ray Test environment will deploy into. You may want to choose a region which you don't currently use:
+
+``export AWS_DEFAULT_REGION=ap-northeast-1``
+
+* Go to the ``scripts`` directory
+
+``cd scripts``
+
+* Deploy the `ray` CDK stack.
+
+``./deploy-stack.sh ray``
+
+* Configure Ray Cluster 
+
+``vi ray-cluster-config.yaml`` 
+
+```
+# Update the following file to match your enviroment
+# The following is an example
+cluster_name: ray-cluster
+
+initial_workers: 2
+min_workers: 2
+max_workers: 2
+
+provider:
+    type: aws
+    region: us-east-1 # change region as required
+    availability_zone: us-east-1a,us-east-1b,us-east-1c # change azs as required
+    security_group:
+        GroupName: ray_client_security_group
+    cache_stopped_nodes: False
+
+available_node_types:
+  ray.head.default:
+    node_config:
+      InstanceType: r5n.2xlarge # change instance type as required
+      IamInstanceProfile:
+        Arn: arn:aws:iam::{UPDATE YOUR ACCOUNT ID HERE}:instance-profile/ray-cluster-instance-profile
+      ImageId: ami-0ea510fcb67686b48 # latest ray images -> https://github.com/amzn/amazon-ray#amazon-ray-images 
+      NetworkInterfaces:
+        - AssociatePublicIpAddress: True
+          SubnetId: {replace with subnet within above AZs}
+          Groups: [{ID of group `ray_client_security_group` created by the step above}]
+          DeviceIndex: 0
+
+  ray.worker.default:
+      min_workers: 2
+      max_workers: 2
+      node_config:
+        InstanceType: r5n.2xlarge
+        IamInstanceProfile:
+          Arn: arn:aws:iam::{UPDATE YOUR ACCOUNT ID HERE}:instance-profile/ray-cluster-instance-profile
+        ImageId: ami-0ea510fcb67686b48 # latest ray images -> https://github.com/amzn/amazon-ray#amazon-ray-images 
+        NetworkInterfaces:
+          - AssociatePublicIpAddress: True
+            SubnetId: {replace with subnet within above AZs}
+            Groups: [{ID of group `ray_client_security_group` created by the step above}]
+            DeviceIndex: 0
+
+setup_commands:
+- pip install "awswrangler[distributed]==3.0.0a2"
+- pip install pytest
+
+```
+
+* Create Ray Cluster 
+``ray up -y ray-cluster-config.yaml``
+
+* Push Load Tests to Ray Cluster
+``ray rsync-up ray-cluster-config.yaml tests/load /home/ubuntu/``
+
+* Submit Pytest Run to Ray Cluster
+```
+echo '''
+import os
+
+import pytest
+
+args = "-v load/"
+
+if not os.getenv("AWS_DEFAULT_REGION"):
+    os.environ["AWS_DEFAULT_REGION"] = "us-east-1" # Set your region as necessary
+
+result = pytest.main(args.split(" "))
+
+print(f"result: {result}")
+''' > handler.py
+ray submit ray-cluster-config.yaml handler.py
+```
+
+* Teardown Cluster 
+``ray down -y ray-cluster-config.yaml``
+
+[More on launching Ray Clusters on AWS](https://docs.ray.io/en/master/cluster/vms/user-guides/launching-clusters/aws.html#)
+
+
 ## Recommended Visual Studio Code Recommended setting
 
 ```json
diff --git a/test_infra/app.py b/test_infra/app.py
@@ -4,6 +4,7 @@
 from stacks.databases_stack import DatabasesStack
 from stacks.lakeformation_stack import LakeFormationStack
 from stacks.opensearch_stack import OpenSearchStack
+from stacks.ray_stack import RayStack
 
 app = App()
 
@@ -27,4 +28,7 @@
     base.get_key,
 )
 
+RayStack(app, "aws-sdk-pandas-ray")
+
+
 app.synth()
diff --git a/test_infra/stacks/ray_stack.py b/test_infra/stacks/ray_stack.py
@@ -0,0 +1,58 @@
+from aws_cdk import Stack
+from aws_cdk import aws_iam as iam
+from constructs import Construct
+
+
+class RayStack(Stack):  # type: ignore
+    def __init__(self, scope: Construct, construct_id: str, **kwargs: str) -> None:
+        """
+        Ray Cluster Infrastructure.
+        Includes IAM role and instance profile.
+        """
+        super().__init__(scope, construct_id, **kwargs)
+
+        # Ray execution Role
+        ray_exec_role = iam.Role(
+            self,
+            "ray-execution-role",
+            assumed_by=iam.ServicePrincipal("ec2.amazonaws.com"),
+            managed_policies=[
+                iam.ManagedPolicy.from_aws_managed_policy_name("AmazonEC2FullAccess"),
+                iam.ManagedPolicy.from_aws_managed_policy_name("AmazonS3FullAccess"),
+                iam.ManagedPolicy.from_aws_managed_policy_name("CloudWatchFullAccess"),
+                iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSSMFullAccess"),
+            ],
+        )
+
+        # Add IAM pass role for a head instance to launch worker nodes
+        # w/ an instance profile
+        iam.Policy(
+            self,
+            "ray-execution-role-policy-pass-role",
+            policy_name="IAMPassRole",
+            roles=[ray_exec_role],
+            statements=[
+                iam.PolicyStatement(
+                    effect=iam.Effect.ALLOW, actions=["iam:PassRole"], resources=[ray_exec_role.role_arn]
+                ),
+            ],
+        )
+
+        # Add additional permissions for Pandas SDK Load Tests
+        iam.Policy(
+            self,
+            "ray-load-test-permissions",
+            policy_name="AdditionalLoadTestPermissions",
+            roles=[ray_exec_role],
+            statements=[
+                iam.PolicyStatement(effect=iam.Effect.ALLOW, actions=["timestream:WriteRecords"], resources=["*"]),
+            ],
+        )
+
+        # Add instance profile
+        iam.CfnInstanceProfile(
+            self,
+            "ray-instance-profile",
+            roles=[ray_exec_role.role_name],
+            instance_profile_name="ray-cluster-instance-profile",
+        )
diff --git a/tutorials/006 - Amazon Athena.ipynb b/tutorials/006 - Amazon Athena.ipynb
@@ -119,7 +119,7 @@
     "cols = [\"id\", \"dt\", \"element\", \"value\", \"m_flag\", \"q_flag\", \"s_flag\", \"obs_time\"]\n",
     "\n",
     "df = wr.s3.read_csv(\n",
-    "    path=\"s3://noaa-ghcn-pds/csv/189\",\n",
+    "    path=\"s3://noaa-ghcn-pds/csv/by_year/189\",\n",
     "    names=cols,\n",
     "    parse_dates=[\"dt\", \"obs_time\"])  # Read 10 files from the 1890 decade (~1GB)\n",
     "\n",
@@ -381,4 +381,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
+}
diff --git a/tutorials/008 - Redshift - Copy & Unload.ipynb b/tutorials/008 - Redshift - Copy & Unload.ipynb
@@ -276,7 +276,7 @@
     "cols = [\"id\", \"dt\", \"element\", \"value\", \"m_flag\", \"q_flag\", \"s_flag\", \"obs_time\"]\n",
     "\n",
     "df = wr.s3.read_csv(\n",
-    "    path=\"s3://noaa-ghcn-pds/csv/1897.csv\",\n",
+    "    path=\"s3://noaa-ghcn-pds/csv/by_year/1897.csv\",\n",
     "    names=cols,\n",
     "    parse_dates=[\"dt\", \"obs_time\"])  # ~127MB, ~4MM rows\n",
     "\n",
diff --git a/tutorials/010 - Parquet Crawler.ipynb b/tutorials/010 - Parquet Crawler.ipynb
@@ -244,7 +244,7 @@
     "cols = [\"id\", \"dt\", \"element\", \"value\", \"m_flag\", \"q_flag\", \"s_flag\", \"obs_time\"]\n",
     "\n",
     "df = wr.s3.read_csv(\n",
-    "    path=\"s3://noaa-ghcn-pds/csv/189\",\n",
+    "    path=\"s3://noaa-ghcn-pds/csv/by_year/189\",\n",
     "    names=cols,\n",
     "    parse_dates=[\"dt\", \"obs_time\"])  # Read 10 files from the 1890 decade (~1GB)\n",
     "\n",
diff --git a/tutorials/019 - Athena Cache.ipynb b/tutorials/019 - Athena Cache.ipynb
@@ -272,7 +272,7 @@
     "cols = [\"id\", \"dt\", \"element\", \"value\", \"m_flag\", \"q_flag\", \"s_flag\", \"obs_time\"]\n",
     "\n",
     "df = wr.s3.read_csv(\n",
-    "    path=\"s3://noaa-ghcn-pds/csv/189\",\n",
+    "    path=\"s3://noaa-ghcn-pds/csv/by_year/189\",\n",
     "    names=cols,\n",
     "    parse_dates=[\"dt\", \"obs_time\"])  # Read 10 files from the 1890 decade (~1GB)\n",
     "\n",
diff --git a/tutorials/022 - Writing Partitions Concurrently.ipynb b/tutorials/022 - Writing Partitions Concurrently.ipynb
@@ -75,7 +75,7 @@
     }
    ],
    "source": [
-    "noaa_path = \"s3://noaa-ghcn-pds/csv/193\"\n",
+    "noaa_path = \"s3://noaa-ghcn-pds/csv/by_year/193\"\n",
     "\n",
     "cols = [\"id\", \"dt\", \"element\", \"value\", \"m_flag\", \"q_flag\", \"s_flag\", \"obs_time\"]\n",
     "dates = [\"dt\", \"obs_time\"]\n",

Original file line number	Diff line number	Diff line change
`@@ -75,7 +75,7 @@`
`75`	`75`	`}`
`76`	`76`	`],`
`77`	`77`	`"source": [`
`78`		`- "noaa_path = \"s3://noaa-ghcn-pds/csv/193\"\n",`
	`78`	`+ "noaa_path = \"s3://noaa-ghcn-pds/csv/by_year/193\"\n",`
`79`	`79`	`"\n",`
`80`	`80`	`"cols = [\"id\", \"dt\", \"element\", \"value\", \"m_flag\", \"q_flag\", \"s_flag\", \"obs_time\"]\n",`
`81`	`81`	`"dates = [\"dt\", \"obs_time\"]\n",`