Skip to content

Commit 8c4d7ca

Browse files
committed
feat(gcloud): Enhance CUJ framework and add advanced use cases
This commit builds upon the foundational CUJ framework by ingesting battle-tested logic from numerous sources and implementing the initial set of comprehensive, production-like Critical User Journeys. The framework is now enhanced with a powerful, modular library and the first advanced CUJs, making it a robust tool for end-to-end testing. Key Enhancements: * **Modular Library (`lib/`)**: The monolithic `common.sh` is refactored into a modular library with components organized by function (`_core.sh`, `_network.sh`, `_dataproc.sh`, `_database.sh`, `_security.sh`). This incorporates advanced, parameterized, and idempotent functions for managing a wide range of GCP resources. * **Advanced Onboarding (`onboarding/`)**: New scripts are added to provision persistent, shared infrastructure, including a High-Availability Cloud SQL instance with VPC Peering and a dual-NIC Squid Proxy VM, following GCP best practices. * **New Critical User Journeys (`cuj/`)**: * `gce/standard`: This CUJ is enhanced to provision a full, NAT-based network environment. * `gce/proxy-egress`: A new CUJ is added to test Dataproc clusters that use a proxy for all outbound traffic. * `gke/standard`: A new CUJ is added for the standard Dataproc on GKE use case. * **Enhanced CI/CD (`ci/`)**: `pristine_check.sh` is upgraded to use a robust, tag-based cleanup strategy, making it scalable to any number of CUJs without modification. * **Finalized Configuration (`env.json`)**: The `env.json.sample` file is finalized with a simplified structure that defines the shared test environment and a `cuj_set` for test orchestration, abstracting implementation details from the user. * **Comprehensive Documentation (`README.md`)**: The README is updated to be a complete guide for the new framework, explaining its philosophy and providing a clear "Getting Started" workflow for new users.
1 parent 1d2ffb5 commit 8c4d7ca

File tree

17 files changed

+1408
-376
lines changed

17 files changed

+1408
-376
lines changed

gcloud/README.md

Lines changed: 76 additions & 185 deletions
Original file line numberDiff line numberDiff line change
@@ -16,200 +16,91 @@ limitations under the License.
1616
1717
-->
1818

19-
## Introduction
19+
Of course. Here is a new `gcloud/README.md` file that explains the purpose and usage of the Critical User Journey (CUJ) framework we have designed together.
2020

21-
This README file describes how to use this collection of gcloud bash examples to
22-
reproduce common Dataproc cluster creation problems relating to the GCE startup
23-
script, Dataproc startup script, and Dataproc initialization-actions scripts.
21+
It covers the framework's philosophy, a step-by-step guide for new users, and an overview of the available CUJs, incorporating all of our recent design decisions.
2422

25-
## Clone the git repository
23+
---
2624

27-
```
28-
$ git clone git@github.com:GoogleCloudDataproc/cloud-dataproc
29-
$ cd cloud-dataproc/gcloud
30-
$ cp env.json.sample env.json
31-
$ vi env.json
32-
```
25+
# Dataproc Critical User Journey (CUJ) Framework
3326

34-
## Environment configuration
27+
This directory contains a collection of scripts that form a test framework for exercising Critical User Journeys (CUJs) on Google Cloud Dataproc. The goal of this framework is to provide a robust, maintainable, and automated way to reproduce and validate the common and complex use cases that are essential for our customers.
3528

36-
First, copy `env.json.sample` to `env.json` and modify the environment
37-
variable names and their values in `env.json` to match your
38-
environment:
29+
This framework replaces the previous monolithic scripts with a modular, scalable, and self-documenting structure designed for both interactive use and CI/CD automation.
3930

31+
## Framework Overview
32+
33+
The framework is organized into several key directories, each with a distinct purpose:
34+
35+
* **`onboarding/`**: Contains idempotent scripts to set up persistent, shared infrastructure that multiple CUJs might depend on. These are typically run once per project. Examples include setting up a shared Cloud SQL instance or a Squid proxy VM.
36+
37+
* **`cuj/`**: The heart of the framework. This directory contains the individual, self-contained CUJs, grouped by the Dataproc platform (`gce`, `gke`, `s8s`). Each CUJ represents a specific, testable customer scenario.
38+
39+
* **`lib/`**: A collection of modular bash script libraries (`_core.sh`, `_network.sh`, `_database.sh`, etc.). These files contain all the powerful, reusable functions for creating and managing GCP resources, forming a shared API for all `onboarding` and `cuj` scripts.
40+
41+
* **`ci/`**: Includes scripts specifically for CI/CD automation. The `pristine_check.sh` script is designed to enforce a clean project state before and after test runs, preventing bitrot and ensuring reproducibility.
42+
43+
## Getting Started
44+
45+
Follow these steps to configure your environment and run your first CUJ.
46+
47+
### 1. Prerequisites
48+
49+
Ensure you have the following tools installed and configured:
50+
* `gcloud` CLI (authenticated to your Google account)
51+
* `jq`
52+
* A Google Cloud project with billing enabled.
53+
54+
### 2. Configure Your Environment
55+
56+
Copy the sample configuration file and edit it to match your environment.
57+
58+
```bash
59+
cp gcloud/env.json.sample gcloud/env.json
60+
vi gcloud/env.json
4061
```
41-
{
42-
"PROJECT_ID":"ldap-example-yyyy-nn",
43-
"ORG_NUMBER":"100000000001",
44-
"DOMAIN": "your-domain-goes-here.com",
45-
"BILLING_ACCOUNT":"100000-000000-000001",
46-
"FOLDER_NUMBER":"100000000001",
47-
"REGION":"us-west4",
48-
"RANGE":"10.00.01.0/24",
49-
"IDLE_TIMEOUT":"30m",
50-
"ASN_NUMBER":"65531",
51-
"IMAGE_VERSION":"2.2,
52-
"BIGTABLE_INSTANCE":"my-bigtable"
53-
}
54-
```
5562

56-
The values that you enter here will be used to build reasonable defaults in
57-
`lib/env.sh` ; you can view and modify `lib/env.sh` to more finely tune your
58-
environment. The code in lib/env.sh is sourced and executed at the head of many
59-
scripts in this suite to ensure that the environment is tuned for use with this
60-
reproduction.
61-
62-
#### Dataproc on GCE
63-
64-
To tune the reproduction environment for your (customer's) GCE use case, review
65-
the `create_dpgce_cluster` function in the `lib/shared-functions.sh` file. This
66-
is where you can select which arguments are passed to the `gcloud dataproc
67-
clusters create ${CLUSTER_NAME}` command. There exist many examples in the
68-
comments of common use cases below the call to gcloud itself.
69-
70-
## creation phase
71-
72-
When reviewing `lib/shared-functions.sh`, pay attention to the
73-
`--metadata startup-script="..."` and `--initialization-actions
74-
"${INIT_ACTIONS_ROOT}/<script-name>"` arguments. These can be used to
75-
execute arbitrary code during the creation of Dataproc clusters. Many
76-
Google Cloud Support cases relate to failures during either a)
77-
Dataproc's internal startup script, which runs after the `--metadata
78-
startup-script="..."`, or b) scripts passed using the
79-
`--initialization-actions` cluster creation argument.
80-
81-
## creating the environment and cluster
82-
83-
Once you have altered `env.json` and have reviewed the function names in
84-
`lib/shared-functions.sh`, you can create your cluster environment and launch
85-
your cluster by running `bin/create-dpgce`. Although the function should be
86-
idempotent, users should not plan to run this more than once for a single
87-
reproduction, as it may configure the environment in a way which renders the
88-
environment non-functional.
89-
90-
Running the `bin/create-dpgce` script will create the staging bucket, enable the
91-
required services, create a dedicated VPC network, router, NAT, subnet, firewall
92-
rules, and finally, the cluster itself.
93-
94-
By default, your cluster will time out and be destroyed after 30 minutes of
95-
inactivity. Activity is defined by receipt of a job using the `gcloud dataproc
96-
jobs submit` command. You can change this default of 30 minutes by altering the
97-
value of IDLE_TIMEOUT in `env.json`. This saves your project and your org
98-
operating costs on reproduction clusters which are not being used to actively
99-
reproduce problems. It also gives you a half of an hour to do your work before
100-
worrying that your cluster will be brought down.
101-
102-
## recreating the cluster
103-
104-
If your cluster has been destroyed either by timeout or manually calling
105-
`gcloud dataproc clusters delete` you can re-create it by running
106-
`bin/recreate-dpgce`. This script does not re-create any of the resources the
107-
cluster depends on such as network, router, staging bucket, etc. It only
108-
deletes and re-creates the cluster that's already been defined in `env.json` and
109-
previously provisioned using `bin/create-dpgce`
110-
111-
## deleting the environment and cluster
112-
113-
If you need to delete the entire environment, you can run `bin/destroy-dpgce` ;
114-
this will delete the cluster, remove the firewall rules, subnet, NAT, router,
115-
VPC network, and staging bucket. To re-create a deleted environment, you may
116-
run `bin/create-dpgce` after `bin/destroy-dpgce` completes successfully.
117-
118-
### Metadata store
119-
120-
All startup-scripts run on GCE instances, including Dataproc GCE cluster nodes,
121-
may make use of the `/usr/share/google/get_metadata_value` script to look up
122-
information in the metadata store. The information available in the metadata
123-
server includes some of the arguments passed when creating the cluster using the
124-
`--metadata` argument.
125-
126-
For instance, if you were to call `gcloud dataproc clusters create
127-
${CLUSTER_NAME}` with the argument `--metadata
128-
init-actions-repo=${INIT_ACTIONS_ROOT}`, then you can find this value by running
129-
`/usr/share/google/get_metadata_value "attributes/init-actions-repo"`. By
130-
default, there are some attributes which are set for dataproc. Some important
131-
ones follow:
132-
133-
* attributes/dataproc-role
134-
- value: `Master` for master nodes
135-
- value: `Worker` for primary and secondary worker nodes
136-
* attributes/dataproc-cluster-name
137-
* attributes/dataproc-bucket
138-
* attributes/dataproc-cluster-uuid
139-
* attributes/dataproc-region
140-
* hostname (FQDN)
141-
* name (short hostname)
142-
* machine-type
143-
144-
### GCE Startup script
145-
146-
Before reading this section, please become familiar with the documentation in
147-
the GCE library for the
148-
[startup-script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux)
149-
metadata argument
150-
151-
The content of the startup-script, if passed as a string, is stored as
152-
`attributes/startup-script` in the metadata store. If passed as a url, the url
153-
can be found as `attributes/startup-script-url`.
154-
155-
The GCE startup script runs prior to the Dataproc Agent. This script can be
156-
used to make small modifications to the environment prior to starting Dataproc
157-
services on the host.
158-
159-
### Dataproc Startup script
160-
161-
The Dataproc agent is responsible for launching the [Dataproc startup
162-
script](https://cs/piper///depot/google3/cloud/hadoop/services/images/startup-script.sh)
163-
and the [initialization
164-
actions](https://github.com/GoogleCloudDataproc/initialization-actions) in order
165-
of specification.
166-
167-
The Dataproc startup script runs before the initialization actions, and logs its
168-
output to `/var/log/dataproc-startup-script.log`. It is linked to by
169-
`/usr/local/share/google/dataproc/startup-script.sh` on all dataproc nodes. The
170-
tasks which the startup script run are influenced by the following arguments.
171-
This is not an exhaustive list. If you are troubleshooting startup errors,
172-
determine whether any arguments or properties are being supplied to the
173-
`clusters create` command, especially any similar to the following.
63+
You only need to edit the universal and onboarding settings. The `load_config` function in the library will dynamically generate a `PROJECT_ID` if the default value is present.
64+
65+
### 3. Run Onboarding Scripts
66+
67+
Before running any CUJs, you must set up the shared infrastructure for your project. These scripts are idempotent and can be run multiple times safely.
68+
69+
```bash
70+
# Set up the shared Cloud SQL instance with VPC Peering
71+
bash gcloud/onboarding/create_cloudsql_instance.sh
17472

73+
# Set up the shared Squid Proxy VM and its networking
74+
bash gcloud/onboarding/create_squid_proxy.sh
17575
```
176-
* `--optional-components`
177-
* `--enable-component-gateway`
178-
* `--properties 'dataproc:conda.*=...'`
179-
* `--properties 'dataproc:pip.*=...'`
180-
* `--properties 'dataproc:kerberos.*=...'`
181-
* `--properties 'dataproc:ranger.*=...'`
182-
* `--properties 'dataproc:druid.*=...'`
183-
* `--properties 'dataproc:kafka.*=...'`
184-
* `--properties 'dataproc:yarn.docker.*=...'`
185-
* `--properties 'dataproc:solr.*=...'`
186-
* `--properties 'dataproc:jupyter.*=...'`
187-
* `--properties 'dataproc:zeppelin.*=...'`
76+
77+
### 4. Run a Critical User Journey
78+
79+
Navigate to the directory of the CUJ you want to run and use its `manage.sh` script.
80+
81+
**Example: Running the standard GCE cluster CUJ**
82+
83+
```bash
84+
# Navigate to the CUJ directory
85+
cd gcloud/cuj/gce/standard/
86+
87+
# Create all resources for this CUJ
88+
./manage.sh up
89+
90+
# When finished, tear down all resources for this CUJ
91+
./manage.sh down
18892
```
18993

190-
On Dataproc images prior to 2.3, the Startup script is responsible for
191-
configuring the optional components which the customer has selected in the way
192-
that the customer has specified with properties. Errors indicating
193-
dataproc-startup-script.log often have to do with configuration of optional
194-
components and their services.
195-
196-
### Dataproc Initialization Actions scripts
197-
198-
Documentation for the
199-
[initialization-actions](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions)
200-
argument to the `gcloud dataproc clusters create` command can be found in the
201-
Dataproc library. You may also want to review the
202-
[README.md](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md)
203-
from the public initialization-actions repo on GitHub.
204-
205-
Do note that you can specify multiple initialization actions scripts. They will
206-
be executed in the order of specification. The initialization-actions scripts
207-
are stored to
208-
`/etc/google-dataproc/startup-scripts/dataproc-initialization-script-${INDEX}`
209-
on the filesystem of each cluster node, where ${INDEX} is the script number,
210-
starting with 0, and incrementing for each additional script. The URL of the
211-
script can be found by querying the metadata server for
212-
`attributes/dataproc-initialization-action-script-${INDEX}`. From within the
213-
script itself, you can refer to `attributes/$0`.
214-
215-
Logs for each initialization action script are created under /var/log
94+
Each `manage.sh` script supports several commands:
95+
* **`up`**: Creates all resources for the CUJ.
96+
* **`down`**: Deletes all resources created by this CUJ.
97+
* **`rebuild`**: Runs `down` and then `up` for a full cycle.
98+
* **`validate`**: Checks for prerequisites, such as required APIs or shared infrastructure.
99+
100+
## Available CUJs
101+
102+
This framework includes the following initial CUJs:
103+
104+
* **`gce/standard`**: Creates a standard Dataproc on GCE cluster in a dedicated VPC with a Cloud NAT gateway for secure internet egress.
105+
* **`gce/proxy-egress`**: Creates a Dataproc on GCE cluster in a private network configured to use the shared Squid proxy for all outbound internet traffic.
106+
* **`gke/standard`**: Creates a standard Dataproc on GKE virtual cluster on a new GKE cluster.

0 commit comments

Comments
 (0)