Skip to content

Commit aec8618

Browse files
authored
Merge pull request #29634 from AugustinasS/blog-data-duplication
add blog post on handling data duplication in data-heavy environments
2 parents fc04dbe + a7662f4 commit aec8618

File tree

3 files changed

+243
-0
lines changed

3 files changed

+243
-0
lines changed
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
---
2+
layout: blog
3+
title: "How to Handle Data Duplication in Data-Heavy Kubernetes Environments"
4+
date: 2021-09-29
5+
slug: how-to-handle-data-duplication-in-data-heavy-kubernetes-environments
6+
---
7+
8+
**Authors:**
9+
Augustinas Stirbis (CAST AI)
10+
11+
## Why Duplicate Data?
12+
13+
It’s convenient to create a copy of your application with a copy of its state for each team.
14+
For example, you might want a separate database copy to test some significant schema changes
15+
or develop other disruptive operations like bulk insert/delete/update...
16+
17+
**Duplicating data takes a lot of time.** That’s because you need first to download
18+
all the data from a source block storage provider to compute and then send
19+
it back to a storage provider again. There’s a lot of network traffic and CPU/RAM used in this process.
20+
Hardware acceleration by offloading certain expensive operations to dedicated hardware is
21+
**always a huge performance boost**. It reduces the time required to complete an operation by orders
22+
of magnitude.
23+
24+
## Volume Snapshots to the rescue
25+
26+
Kubernetes introduced [VolumeSnapshots](/docs/concepts/storage/volume-snapshots/) as alpha in 1.12,
27+
beta in 1.17, and the Generally Available version in 1.20.
28+
VolumeSnapshots use specialized APIs from storage providers to duplicate volume of data.
29+
30+
Since data is already in the same storage device (array of devices), duplicating data is usually
31+
a metadata operation for storage providers with local snapshots (majority of on-premise storage providers).
32+
All you need to do is point a new disk to an immutable snapshot and only
33+
save deltas (or let it do a full-disk copy). As an operation that is inside the storage back-end,
34+
it’s much quicker and usually doesn’t involve sending traffic over the network.
35+
Public Clouds storage providers under the hood work a bit differently. They save snapshots
36+
to Object Storage and then copy back from Object storage to Block storage when "duplicating" disk.
37+
Technically there is a lot of Compute and network resources spent on Cloud providers side,
38+
but from Kubernetes user perspective VolumeSnapshots work the same way whether is it local or
39+
remote snapshot storage provider and no Compute and Network resources are involved in this operation.
40+
41+
## Sounds like we have our solution, right?
42+
43+
Actually, VolumeSnapshots are namespaced, and Kubernetes protects namespaced data from
44+
being shared between tenants (Namespaces). This Kubernetes limitation is a conscious design
45+
decision so that a Pod running in a different namespace can’t mount another application’s
46+
[PersistentVolumeClaim](/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) (PVC).
47+
48+
One way around it would be to create multiple volumes with duplicate data in one namespace.
49+
However, you could easily reference the wrong copy.
50+
51+
So the idea is to separate teams/initiatives by namespaces to avoid that and generally
52+
limit access to the production namespace.
53+
54+
## Solution? Creating a Golden Snapshot externally
55+
56+
Another way around this design limitation is to create Snapshot externally (not through Kubernetes).
57+
This is also called pre-provisioning a snapshot manually. Next, I will import it
58+
as a multi-tenant golden snapshot that can be used for many namespaces. Below illustration will be
59+
for AWS EBS (Elastic Block Storage) and GCE PD (Persistent Disk) services.
60+
61+
### High-level plan for preparing the Golden Snapshot
62+
63+
1. Identify Disk (EBS/Persistent Disk) that you want to clone with data in the cloud provider
64+
2. Make a Disk Snapshot (in cloud provider console)
65+
3. Get Disk Snapshot ID
66+
67+
### High-level plan for cloning data for each team
68+
69+
1. Create Namespace “sandbox01”
70+
2. Import Disk Snapshot (ID) as VolumeSnapshotContent to Kubernetes
71+
3. Create VolumeSnapshot in the Namespace "sandbox01" mapped to VolumeSnapshotContent
72+
4. Create the PersistentVolumeClaim from VolumeSnapshot
73+
5. Install Deployment or StatefulSet with PVC
74+
75+
## Step 1: Identify Disk
76+
77+
First, you need to identify your golden source. In my case, it’s a PostgreSQL database
78+
on PersistentVolumeClaim “postgres-pv-claim” in the “production” namespace.
79+
80+
```terminal
81+
kubectl -n <namespace> get pvc <pvc-name> -o jsonpath='{.spec.volumeName}'
82+
```
83+
84+
The output will look similar to:
85+
```
86+
pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d90
87+
```
88+
89+
## Step 2: Prepare your golden source
90+
91+
You need to do this once or every time you want to refresh your golden data.
92+
93+
### Make a Disk Snapshot
94+
95+
Go to AWS EC2 or GCP Compute Engine console and search for an EBS volume
96+
(on AWS) or Persistent Disk (on GCP), that has a label matching the last output.
97+
In this case I saw: `pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d9`.
98+
99+
Click on Create snapshot and give it a name. You can do it in Console manually,
100+
in AWS CloudShell / Google Cloud Shell, or in the terminal. To create a snapshot in the
101+
terminal you must have the AWS CLI tool (`aws`) or Google's CLI (`gcloud`)
102+
installed and configured.
103+
104+
Here’s the command to create snapshot on GCP:
105+
106+
```terminal
107+
gcloud compute disks snapshot <cloud-disk-id> --project=<gcp-project-id> --snapshot-names=<set-new-snapshot-name> --zone=<availability-zone> --storage-location=<region>
108+
```
109+
{{< figure src="/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/create-volume-snapshot-gcp.png" alt="Screenshot of a terminal showing volume snapshot creation on GCP" title="GCP snapshot creation" >}}
110+
111+
112+
GCP identifies the disk by its PVC name, so it’s direct mapping. In AWS, you need to
113+
find volume by the CSIVolumeName AWS tag with PVC name value first that will be used for snapshot creation.
114+
115+
{{< figure src="/images/blog/2021-09-07-data-duplication-in-data-heavy-k8s-env/identify-volume-aws.png" alt="Screenshot of AWS web console, showing EBS volume identification" title="Identify disk ID on AWS" >}}
116+
117+
Mark done Volume (volume-id) ```vol-00c7ecd873c6fb3ec``` and ether create EBS snapshot in AWS Console, or use ```aws cli```.
118+
119+
```terminal
120+
aws ec2 create-snapshot --volume-id '<volume-id>' --description '<set-new-snapshot-name>' --tag-specifications 'ResourceType=snapshot'
121+
```
122+
123+
## Step 3: Get your Disk Snapshot ID
124+
125+
In AWS, the command above will output something similar to:
126+
```terminal
127+
"SnapshotId": "snap-09ed24a70bc19bbe4"
128+
```
129+
130+
If you’re using the GCP cloud, you can get the snapshot ID from the gcloud command by querying for the snapshot’s given name:
131+
132+
```terminal
133+
gcloud compute snapshots --project=<gcp-project-id> describe <new-snapshot-name> | grep id:
134+
```
135+
You should get similar output to:
136+
```
137+
id: 6645363163809389170
138+
```
139+
140+
## Step 4: Create a development environment for each team
141+
142+
Now I have my Golden Snapshot, which is immutable data. Each team will get a copy
143+
of this data, and team members can modify it as they see fit, given that a new EBS/persistent
144+
disk will be created for each team.
145+
146+
Below I will define a manifest for each namespace. To save time, you can replace
147+
the namespace name (such as changing “sandbox01” → “sandbox42”) using tools
148+
such as `sed` or `yq`, with Kubernetes-aware templating tools like
149+
[Kustomize](/docs/tasks/manage-kubernetes-objects/kustomization/),
150+
or using variable substitution in a CI/CD pipeline.
151+
152+
Here's an example manifest:
153+
154+
```yaml
155+
---
156+
apiVersion: snapshot.storage.k8s.io/v1
157+
kind: VolumeSnapshotContent
158+
metadata:
159+
name: postgresql-orders-db-sandbox01
160+
namespace: sandbox01
161+
spec:
162+
deletionPolicy: Retain
163+
driver: pd.csi.storage.gke.io
164+
source:
165+
snapshotHandle: 'gcp/projects/staging-eu-castai-vt5hy2/global/snapshots/6645363163809389170'
166+
volumeSnapshotRef:
167+
kind: VolumeSnapshot
168+
name: postgresql-orders-db-snap
169+
namespace: sandbox01
170+
---
171+
apiVersion: snapshot.storage.k8s.io/v1
172+
kind: VolumeSnapshot
173+
metadata:
174+
name: postgresql-orders-db-snap
175+
namespace: sandbox01
176+
spec:
177+
source:
178+
volumeSnapshotContentName: postgresql-orders-db-sandbox01
179+
```
180+
181+
In Kubernetes, VolumeSnapshotContent (VSC) objects are not namespaced.
182+
However, I need a separate VSC for each different namespace to use, so the
183+
`metadata.name` of each VSC must also be different. To make that straightfoward,
184+
I used the target namespace as part of the name.
185+
186+
Now it’s time to replace the driver field with the CSI (Container Storage Interface) driver
187+
installed in your K8s cluster. Major cloud providers have CSI driver for block storage that
188+
support VolumeSnapshots but quite often CSI drivers are not installed by default, consult
189+
with your Kubernetes provider.
190+
191+
That manifest above defines a VSC that works on GCP.
192+
On AWS, driver and SnashotHandle values might look like:
193+
194+
```YAML
195+
driver: ebs.csi.aws.com
196+
source:
197+
snapshotHandle: "snap-07ff83d328c981c98"
198+
```
199+
200+
At this point, I need to use the *Retain* policy, so that the CSI driver doesn’t try to
201+
delete my manually created EBS disk snapshot.
202+
203+
For GCP, you will have to build this string by hand - add a full project ID and snapshot ID.
204+
For AWS, it’s just a plain snapshot ID.
205+
206+
VSC also requires specifying which VolumeSnapshot (VS) will use it, so VSC and VS are
207+
referencing each other.
208+
209+
Now I can create PersistentVolumeClaim from VS above. It’s important to set this first:
210+
211+
212+
```yaml
213+
---
214+
apiVersion: v1
215+
kind: PersistentVolumeClaim
216+
metadata:
217+
name: postgres-pv-claim
218+
namespace: sandbox01
219+
spec:
220+
dataSource:
221+
kind: VolumeSnapshot
222+
name: postgresql-orders-db-snap
223+
apiGroup: snapshot.storage.k8s.io
224+
accessModes:
225+
- ReadWriteOnce
226+
resources:
227+
requests:
228+
storage: 21Gi
229+
```
230+
231+
If default StorageClass has [WaitForFirstConsumer](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode) policy,
232+
then the actual Cloud Disk will be created from the Golden Snapshot only when some Pod bounds that PVC.
233+
234+
Now I assign that PVC to my Pod (in my case, it’s Postgresql) as I would with any other PVC.
235+
236+
```terminal
237+
kubectl -n <namespace> get volumesnapshotContent,volumesnapshot,pvc,pod
238+
```
239+
240+
Both VS and VSC should be *READYTOUSE* true, PVC bound, and the Pod (from Deployment or StatefulSet) running.
241+
242+
**To keep on using data from my Golden Snapshot, I just need to repeat this for the
243+
next namespace and voilà! No need to waste time and compute resources on the duplication process.**
68.2 KB
Loading
297 KB
Loading

0 commit comments

Comments
 (0)