Skip to content

Commit f7226a7

Browse files
feat: Support for Dataproc on GKE (#6143) (#4474)
Signed-off-by: Modular Magician <[email protected]>
1 parent 7bc2ac6 commit f7226a7

File tree

3 files changed

+166
-1
lines changed

3 files changed

+166
-1
lines changed

.changelog/6143.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
```release-note:enhancement
2+
dataproc: Added Support for Dataproc on GKE in `google_dataproc_cluster` (GA only)
3+
```

google-beta/resource_dataproc_cluster.go

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -889,6 +889,7 @@ func resourceDataprocClusterCreate(d *schema.ResourceData, meta interface{}) err
889889
}
890890

891891
cluster.Config, err = expandClusterConfig(d, config)
892+
892893
if err != nil {
893894
return err
894895
}
@@ -1433,7 +1434,10 @@ func resourceDataprocClusterRead(d *schema.ResourceData, meta interface{}) error
14331434
return fmt.Errorf("Error setting labels: %s", err)
14341435
}
14351436

1436-
cfg, err := flattenClusterConfig(d, cluster.Config)
1437+
var cfg []map[string]interface{}
1438+
1439+
cfg, err = flattenClusterConfig(d, cluster.Config)
1440+
14371441
if err != nil {
14381442
return err
14391443
}

website/docs/r/dataproc_cluster.html.markdown

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,9 @@ resource "google_dataproc_cluster" "accelerated_cluster" {
136136
instances in the cluster. GCP generates some itself including `goog-dataproc-cluster-name`
137137
which is the name of the cluster.
138138

139+
* `virtual_cluster_config` - (Optional) Allows you to configure a virtual Dataproc on GKE cluster.
140+
Structure [defined below](#nested_virtual_cluster_config).
141+
139142
* `cluster_config` - (Optional) Allows you to configure various aspects of the cluster.
140143
Structure [defined below](#nested_cluster_config).
141144

@@ -149,6 +152,161 @@ resource "google_dataproc_cluster" "accelerated_cluster" {
149152
For more context see the [docs](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/patch#query-parameters)
150153
- - -
151154

155+
<a name="nested_virtual_cluster_config"></a>The `virtual_cluster_config` block supports:
156+
157+
```hcl
158+
virtual_cluster_config {
159+
auxiliary_services_config { ... }
160+
kubernetes_cluster_config { ... }
161+
}
162+
```
163+
164+
* `staging_bucket` - (Optional) The Cloud Storage staging bucket used to stage files,
165+
such as Hadoop jars, between client machines and the cluster.
166+
Note: If you don't explicitly specify a `staging_bucket`
167+
then GCP will auto create / assign one for you. However, you are not guaranteed
168+
an auto generated bucket which is solely dedicated to your cluster; it may be shared
169+
with other clusters in the same region/zone also choosing to use the auto generation
170+
option.
171+
172+
* `auxiliary_services_config` (Optional) Configuration of auxiliary services used by this cluster.
173+
Structure [defined below](#nested_auxiliary_services_config).
174+
175+
* `kubernetes_cluster_config` (Required) The configuration for running the Dataproc cluster on Kubernetes.
176+
Structure [defined below](#nested_kubernetes_cluster_config).
177+
- - -
178+
179+
<a name="nested_auxiliary_services_config"></a>The `auxiliary_services_config` block supports:
180+
181+
```hcl
182+
virtual_cluster_config {
183+
auxiliary_services_config {
184+
metastore_config {
185+
dataproc_metastore_service = google_dataproc_metastore_service.metastore_service.id
186+
}
187+
188+
spark_history_server_config {
189+
dataproc_cluster = google_dataproc_cluster.dataproc_cluster.id
190+
}
191+
}
192+
}
193+
```
194+
195+
* `metastore_config` (Optional) The Hive Metastore configuration for this workload.
196+
197+
* `dataproc_metastore_service` (Required) Resource name of an existing Dataproc Metastore service.
198+
199+
* `spark_history_server_config` (Optional) The Spark History Server configuration for the workload.
200+
201+
* `dataproc_cluster` (Optional) Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.
202+
- - -
203+
204+
<a name="nested_kubernetes_cluster_config"></a>The `kubernetes_cluster_config` block supports:
205+
206+
```hcl
207+
virtual_cluster_config {
208+
kubernetes_cluster_config {
209+
kubernetes_namespace = "foobar"
210+
211+
kubernetes_software_config {
212+
component_version = {
213+
"SPARK" : "3.1-dataproc-7"
214+
}
215+
216+
properties = {
217+
"spark:spark.eventLog.enabled": "true"
218+
}
219+
}
220+
221+
gke_cluster_config {
222+
gke_cluster_target = google_container_cluster.primary.id
223+
224+
node_pool_target {
225+
node_pool = "dpgke"
226+
roles = ["DEFAULT"]
227+
228+
node_pool_config {
229+
autoscaling {
230+
min_node_count = 1
231+
max_node_count = 6
232+
}
233+
234+
config {
235+
machine_type = "n1-standard-4"
236+
preemptible = true
237+
local_ssd_count = 1
238+
min_cpu_platform = "Intel Sandy Bridge"
239+
}
240+
241+
locations = ["us-central1-c"]
242+
}
243+
}
244+
}
245+
}
246+
}
247+
```
248+
249+
* `kubernetes_namespace` (Optional) A namespace within the Kubernetes cluster to deploy into.
250+
If this namespace does not exist, it is created.
251+
If it exists, Dataproc verifies that another Dataproc VirtualCluster is not installed into it.
252+
If not specified, the name of the Dataproc Cluster is used.
253+
254+
* `kubernetes_software_config` (Required) The software configuration for this Dataproc cluster running on Kubernetes.
255+
256+
* `component_version` (Required) The components that should be installed in this Dataproc cluster. The key must be a string from the
257+
KubernetesComponent enumeration. The value is the version of the software to be installed. At least one entry must be specified.
258+
* **NOTE** : `component_version[SPARK]` is mandatory to set, or the creation of the cluster will fail.
259+
260+
* `properties` (Optional) The properties to set on daemon config files. Property keys are specified in prefix:property format,
261+
for example spark:spark.kubernetes.container.image.
262+
263+
* `gke_cluster_config` (Required) The configuration for running the Dataproc cluster on GKE.
264+
265+
* `gke_cluster_target` (Optional) A target GKE cluster to deploy to. It must be in the same project and region as the Dataproc cluster
266+
(the GKE cluster can be zonal or regional)
267+
268+
* `node_pool_target` (Optional) GKE node pools where workloads will be scheduled. At least one node pool must be assigned the `DEFAULT`
269+
GkeNodePoolTarget.Role. If a GkeNodePoolTarget is not specified, Dataproc constructs a `DEFAULT` GkeNodePoolTarget.
270+
Each role can be given to only one GkeNodePoolTarget. All node pools must have the same location settings.
271+
272+
* `node_pool` (Required) The target GKE node pool.
273+
274+
* `roles` (Required) The roles associated with the GKE node pool.
275+
One of `"DEFAULT"`, `"CONTROLLER"`, `"SPARK_DRIVER"` or `"SPARK_EXECUTOR"`.
276+
277+
* `node_pool_config` (Input only) The configuration for the GKE node pool.
278+
If specified, Dataproc attempts to create a node pool with the specified shape.
279+
If one with the same name already exists, it is verified against all specified fields.
280+
If a field differs, the virtual cluster creation will fail.
281+
282+
* `autoscaling` (Optional) The autoscaler configuration for this node pool.
283+
The autoscaler is enabled only when a valid configuration is present.
284+
285+
* `min_node_count` (Optional) The minimum number of nodes in the node pool. Must be >= 0 and <= maxNodeCount.
286+
287+
* `max_node_count` (Optional) The maximum number of nodes in the node pool. Must be >= minNodeCount, and must be > 0.
288+
289+
* `config` (Optional) The node pool configuration.
290+
291+
* `machine_type` (Optional) The name of a Compute Engine machine type.
292+
293+
* `local_ssd_count` (Optional) The number of local SSD disks to attach to the node,
294+
which is limited by the maximum number of disks allowable per zone.
295+
296+
* `preemptible` (Optional) Whether the nodes are created as preemptible VM instances.
297+
Preemptible nodes cannot be used in a node pool with the CONTROLLER role or in the DEFAULT node pool if the
298+
CONTROLLER role is not assigned (the DEFAULT node pool will assume the CONTROLLER role).
299+
300+
* `min_cpu_platform` (Optional) Minimum CPU platform to be used by this instance.
301+
The instance may be scheduled on the specified or a newer CPU platform.
302+
Specify the friendly names of CPU platforms, such as "Intel Haswell" or "Intel Sandy Bridge".
303+
304+
* `spot` (Optional) Spot flag for enabling Spot VM, which is a rebrand of the existing preemptible flag.
305+
306+
* `locations` (Optional) The list of Compute Engine zones where node pool nodes associated
307+
with a Dataproc on GKE virtual cluster will be located.
308+
- - -
309+
152310
<a name="nested_cluster_config"></a>The `cluster_config` block supports:
153311

154312
```hcl

0 commit comments

Comments
 (0)