Skip to content

Commit 255c1db

Browse files
committed
oops - latest DESIGN and reversion of copy proof of concept iteration in creator job
1 parent 0771dda commit 255c1db

File tree

2 files changed

+445
-0
lines changed

2 files changed

+445
-0
lines changed

DESIGN.md

Lines changed: 336 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,336 @@
1+
# Jobs
2+
3+
## v1/current
4+
5+
### ebs-snapshot-creator.py v1
6+
7+
- Creates primary snapshots (located in the same region as an instance).
8+
- Runs once daily (but can be ran more often if desired)
9+
- A creator should be running in every region that has source instances
10+
that have EBS volumes and are (or may be) configured to get snapshots
11+
- Features
12+
- Look for instances (in same region as Lambda is running in) that have
13+
backups (snapshots) enabled
14+
- Find all connected EBS volumes on relevant instances
15+
- Tell AWS to create a snapshot of all found EBS volumes
16+
- Look for backup retention configuration on each relevant instance
17+
- If backup retention configuration found, use it to determine and set
18+
the expiration date on the created snapshots
19+
- If no backup retention configuration found, use a default of 7 days and
20+
set the expiration date on the created snapshots
21+
- For all created snapshots, sets a tag indicating they were
22+
automatically created by this tool
23+
- Differentiators
24+
- Designed to run 100% from AWS Lambdas
25+
- Automatically handles snapshot expiration, not just creation
26+
- Entirely configured on a per instance basis using AWS's built-in tags
27+
- No external configuration files
28+
- No hardcoded configuration
29+
- No external database / file storage dependencies
30+
- Easy to customize per instance configuration
31+
- Easy to manage per instance configuration in orchestration tools
32+
- Uses the following tags:
33+
- TBD
34+
35+
### ebs-snapshot-manager.py v1
36+
37+
- Deletes snapshots (located in the same region as Lambda is running in)
38+
that have hit their expiration date
39+
- Features
40+
- Looks for snapshots (located in the same region as Lambda is running
41+
in) that that have hit their expiration date and have a tag indicating
42+
they were created by this tool.
43+
- Pulls all needed information from snapshots themselves (which are, in
44+
turn, automatically tagged/configured from instance configuration)
45+
46+
## v2
47+
48+
- Differentiators (beyond already noted in v1)
49+
- No inter-region dependencies
50+
- All snapshot creation for instances in a given region handled by a
51+
Lambda in the same region
52+
- Service problems in one region do not impact cretion jobs in other
53+
regions
54+
- Multi-region snapshot creation (and replication, using new copier
55+
job) enabled without centralizing job management in one region
56+
- All jobs operate autonomously within region
57+
- All jobs operate autonomously across regions
58+
59+
### ebs-snapshot-creator.py v2
60+
61+
- No code changes over v1 other than tag adjustments to accomodate cleaner
62+
tag naming/design.
63+
- Usage instructions added on how to use in situations where instances are
64+
running in multiple regions (hint: run a creator job in each relevant
65+
region)
66+
67+
### ebs-snapshot-manager.py v2
68+
69+
- No code changes over v1 other than tag adjustments to accomodate cleaner
70+
tag naming/design.
71+
- Usage instructions added on how to use in situations where instances are
72+
running in multiple regions (hint: run a manager job in each relevant
73+
region)
74+
- Usage instructions added on how to use in situations where snapshots are
75+
enabled for replication into additional regions (hint: run a manager job
76+
in each relevant region)
77+
78+
### ebs-snapshot-copier.py v2
79+
80+
- Copies any snapshots (located in the same region as Lamdba is running in)
81+
that are enabled for copying into additional regions, into those
82+
additional specified regions
83+
- Runs once daily (but can be ran more often if desired)
84+
- A copier should be running in every region that has primary snapshots /
85+
source instances that are (or may be) configured for replication into
86+
additional regions
87+
- Features:
88+
- Pulls all configuration from snapshots themselves (which are, in turn,
89+
automatically tagged/configured from instance configuration)
90+
- Only copies snapshots that are in a completed state to avoid creating
91+
unusable snapshots
92+
- Does not copy non-primary snapshots (i.e. those not located in the same
93+
region as an instance) to prevent copying snapshots that have already
94+
been copied (and would otherwise create infinitely copying loops)
95+
- Replicates primary snapshot configuration (e.g. expiration dates)
96+
- Retains sufficient information to trace snapshot copies back to
97+
original (primary) snapshots and source instances (because AWS snapshot
98+
copying processes do not retain this information and, worse, set bogus
99+
VolumeIds on out-of-region copied snapshots)
100+
101+
## v2 - original brainstorming
102+
103+
### ebs-snapshot-creator.py
104+
105+
- Creates primary snapshots (located in the same region as an instance).
106+
- Runs once daily (but can we ran more often if desired)
107+
- Uses the following tags:
108+
- TBD
109+
110+
### ebs-snapshot-copier.py
111+
112+
Copies primary snapshots to an additional region.
113+
Uses the following tags:
114+
- Snapshots: { PrimarySnapshotID (set), Description (copied), DeleteOn (copied), Type (copied) }
115+
- Instances: { SnapshotsEnabled, SnapshotsRetention, SnapshotsExtraRegion1 }
116+
117+
### ebs-snapshot-manager.py
118+
119+
----
120+
121+
# Tags - Tracking
122+
123+
## Snapshots
124+
125+
XXX Description instance_name - vol_id (dev_name) Set by creator
126+
Description instance_name (instance_id) - vol_id (dev_name) - source_region Set by creator
127+
DeleteOn YYYY-mm-dd Set by creator
128+
Type Automated Set by creator
129+
Source_SnapshotId Source SnapshotId in source region (if applicable) Set by copier
130+
131+
# Tags - Configuration
132+
133+
## Snapshots
134+
135+
KeepForever Yes Set by operator/user if snapshot is to be retained outside of retention schedule
136+
137+
## Instances
138+
139+
Backup Yes Whether to snapshot (backup) volumes attached to this instance
140+
Retention # of days (7 is the default if not specified) The number of days to retain snapshots for volumes attached to this instance
141+
Copy_Dest Name of an AWS region An additional region to copy snapshots to for volumes attached to this instance
142+
Name (whatever) Used in snapshot description
143+
144+
----
145+
146+
# Functionality
147+
148+
## Overview & Status
149+
150+
- Not requiring the running of a management server instance to host, run, or trigger jobs IMPLEMENTED
151+
- Simple configuration based 100% on tagging of instances and snapshots IMPLEMENTED, mostly
152+
- Creating snapshots IMPLEMENTED
153+
- Crash-consistent snapshots IMPLEMENTED
154+
- Copying snapshots to another region IMPLEMENTED, partially
155+
- Expiring snapshots after specified period passes IMPLEMENTED
156+
- Retaining select snapshots indefinitely IMPLEMENTED
157+
- Managing snapshots copied to another region in same way as original snapshots IN PROGRESS
158+
- Automatically running job that handles creating snapshots on a specified schedule IMPLEMENTED
159+
- Automatically running job that handles copying snapshots to another region on a specified schedule IMPLEMENTED, but refinement necessary
160+
- Automatically running job that handles expiring snapshots on a specified schedule IMPLEMENTED, but refinmement may be necessary
161+
- Simple provisioning of jobs, scheduling, and permissions TODO
162+
- Try to run after database dumps have completed TODO - EXTRA CREDIT
163+
- Try to run after quieting filesystem (not sure yet) TODO - EXTRA CREDIT
164+
- Application-consistent snapshots TODO - EXTRA CREDIT
165+
- Reporting (not sure yet) TODO
166+
- Errors (not sure yet) TODO
167+
168+
## Not requiring the running of a management server instance to host, run, or trigger jobs
169+
170+
It is implemented as a set of Python based functions intended to run in AWS Lambda (which also handles
171+
the job scheduling). This makes it self-contained and easier to setup, without any external resources
172+
needed.
173+
174+
## Simple configuration based 100% on tagging of instances and snapshots
175+
176+
Easily readable (by humans and machines) AWS object tags are used to track as well as configure all
177+
aspects of the jobs. This permits easy management and changes without adding anything outside of
178+
AWS nor relying on external configuration files or hardcoded configuration of any sort.
179+
180+
## Creating snapshots
181+
182+
The creator job (implemented in ebs-snapshot-creator.py) creates the primary snapshots (the ones in
183+
the same region as the instance volumes they are associated with). Instances which should have their
184+
volumed backed up are configured by adding a tag with the name "Backup" and the value "Yes" to the
185+
instance. All volumes attached to a backup enabled instance of the type EBS will have snapshots
186+
taken of them.
187+
188+
## Crash-consistent snapshots IMPLEMENTED
189+
190+
EBS snapshots are point-in-time back ups of the data on an EBS volume. This means that the snapshots
191+
are exact copies of the data frozen at a specific point in time (i.e. all data across the volume is
192+
"snapshotted" at a single point in time for consistency across the filesystem).
193+
194+
HOWEVER, there is a gotcha: If the instance is running (and it generally is for folks relying on
195+
snapshots as part of their backup strategy), EBS snapshots are crash-consistent. That is, whatever
196+
is in memory on the instance is lost. It's as if someone pulled out the power cord of the computer,
197+
pulled the volume out and copied it (the snapshot), then turned the computer back on. Of course
198+
in EC2 your instance will just keep running so triggering EBS snapshots against running instances
199+
is only *as if* someone did this.
200+
201+
For some situations this is good enough. For situations it isn't.
202+
203+
Modern filesystems with journaling support attempt to recover and deal with issues such as what
204+
happens when a "power cord is pulled". The same goes for database systems with recovery mechanisms
205+
for similar situations. This, however, does not mean that data isn't lost. It simply means that
206+
the filesystems and databases can return to a functioning reliable and probably good working state.
207+
The cost for getting to this state is that a bit of data (for a narrow time window) may be tossed out
208+
if it's suspect or incomplete.
209+
210+
Concerns can often be worked around by doing things like running database dumps via cron (that are just
211+
static files on-disk representing the contents of the database that then are effectively guaranteed
212+
to be in a quiet/frozen state when the EBS snapshot runs ...whereas the raw live database data
213+
files themselves are *not* likely to be.)
214+
215+
Crash-consistent backups, and in turn EBS snapshots of running instances, are better than just copying
216+
files (well, doing both are even better perhaps). The reason is because EBS snapshots are still
217+
point-in-time based, insuring consistency across the filesystem, whereas simply copying files has
218+
time delays between when each file is copied.
219+
220+
Interesting links:
221+
222+
- [1](http://www.n2ws.com/blog/ebs-snapshots-crash-consistent-vs-application-consistent.html)
223+
- [2](https://www.veeam.com/blog/how-to-create-a-consistent-vm-backup.html)
224+
- [3](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html)
225+
226+
## Copying snapshots to another region
227+
228+
The copier job (implemented in ebs-snapshot-copier.py) copies completed primary snapshots (the ones
229+
in the same region as the instance volumes they are associated with) to additional regions (only one
230+
additional is currently supported). Instances where it is desired that snapshots be copied to another
231+
region are configured by adding a tag with the named "Auto_Snapshot_Copy1" and the value "us-east-1"
232+
or similar region name. All volumes attached to a backup enabled instance with this parameter set to
233+
a valid region name will have their snapshots copied to this additional region.
234+
235+
Snapshots in the primary region are checked for counterparts in other regions (by looking for snapshots
236+
in the other regions with the tag named "Source_SnapshotId" with a value equal to the primary snapshot's
237+
SnapshotId[1].
238+
239+
Since we cannot copy primary snapshots that are in progress we attempt to workaround that issue in four
240+
ways:
241+
242+
- the copier job is independent of the creator job
243+
- we schedule the creator job to run a fair bit after the creator job
244+
- we only copy snapshots in a completed state
245+
- we pick up snapshots that were still not complete in the next job run
246+
247+
[1] We don't just compare snapshot Ids or even Volume IDs from the snapshots because AWS's copy snapshot
248+
function doesn't carry these over; the copied snapshot IDs are different than their originals and the
249+
volume Ids are bogus (I believe this is because internally they are actually copied to intermediate
250+
volumes for copying purposes only).
251+
252+
## Expiring snapshots after specified period passes
253+
254+
The retention period for snapshots of a given instance are specified by adding a tag to the instance with
255+
the name "Retention" and the value of "#" of days to retain snapshots for. The default is 7 days if
256+
not specified.
257+
258+
The creator job looks for this Retention tag and based on it's setting (or using the default of 7 days if
259+
it's not specified) sets a tag named "DeleteOn" on each snapshot at creation time, which specifies when
260+
that snapshot is to be kept until.
261+
262+
The manager job looks for snapshots tagged for deletion (per the "DeleteOn" tag) on the date it is
263+
running. It also, for safety, ignores snapshots not tagged as being automated (tag Type=Automated) as
264+
well as those tagged with the tag "KeepForever".
265+
266+
The copier job doesn't do anything special for expiration, other than replicating the same tags (e.g.
267+
DeleteOn) and their values on all snapshot copies as they were set on the original snapshots.
268+
269+
## Retaining select snapshots indefinitely
270+
271+
A user/operator (or third-party script) can add a tag with the name "KeepForever" to any automated
272+
snapshot and it will be retained indefinitely regardless of the expiration/retention configuration
273+
for the instance that that snapshot is associated with.
274+
275+
The manager job looks for snapshots tagged as "KeepForever" and ignores them for processing entirely.
276+
277+
The copier job replicates the KeepForever tag (if set) on primary snapshots to their counterparts
278+
in other regions (although this not not done in real-time so it is best for the user/operator to
279+
set it manually on the copies too if they want to be certain).
280+
281+
## Managing snapshots copied to another region in same way as original snapshots IN PROGRESS
282+
283+
The copier job makes sure that snapshots that are copied indicate which primary snapshots they are
284+
duplicates of by way of adding the tag named Source_SnapshotId with the relevant primary SnapshotId (we
285+
have to do this because snapshot copies to other regions do not have relevant associated volume Ids nor
286+
any other reference to the source snapshots. So we create our own reference.)
287+
288+
The copier job also duplicates the other key tags (DeleteOn and KeepForever).
289+
290+
The manager job checks for instances that have additional regions configured for copying, then
291+
processes snapshots in those regions in the same way as the primary instance region.
292+
293+
## Automatically running job that handles creating snapshots on a specified schedule IMPLEMENTED, but refinement necessary
294+
295+
Currently we rely on AWS Lambda's built-in CloudEvent event of "rate(1 day)" to run daily.
296+
297+
To ensure the creator job runs before the copier we may need to be more specific with a time-of-day.
298+
299+
## Automatically running job that handles copying snapshots to another region on a specified schedule IMPLEMENTED, but refinement necessary
300+
301+
Currently we rely on AWS Lambda's built-in CloudEvent event of "rate(1 day)" to run daily.
302+
303+
To ensure the copier job runs well after the creator job we may need to be more specific with a time-of-day.
304+
305+
## Automatically running job that handles expiring snapshots on a specified schedule IMPLEMENTED, but refinmement may be necessary
306+
307+
Currently we rely on AWS Lambda's built-in CloudEvent event of "rate(1 day)" to run daily.
308+
309+
It doesn't seem like it really matters wheher the manager runs before, after, or even during the creator or
310+
copier jobs.
311+
312+
## Simple provisioning of jobs, scheduling, and permissions TODO
313+
314+
Phase 1: Permissions should be in JSON
315+
Phase 2: S3 source for loading jobs
316+
Phase 3: CloudFormation for IAM provisioning + provisioning jobs + loading function code from S3 + event scheduling
317+
318+
## Try to run after database dumps have completed TODO - EXTRA CREDIT
319+
320+
Phase I: Job scheduling coordination + we'll at worst have the prior day's dump
321+
Phase II: webhook? some other trigger for the job?
322+
323+
## Try to run after quieting filesystem (not sure yet) TODO - EXTRA CREDIT
324+
325+
Phase I: Ignore
326+
Phase II: fsfreeze + webhook/some other trigger for the job?
327+
328+
## Application-consistent snapshots TODO - EXTRA CREDIT
329+
330+
If the fsfreeze mentioned in the prior section is implemented, then in theory this is do-able.
331+
For Windows boxes, VSS equivalent may be an option.
332+
Since EBS snapshots are point-in-time, once the snapshot starts, the freeze operation can be released ...even prior to the completion of the snapshot.
333+
[1](http://www.n2ws.com/blog/ebs-snapshots-crash-consistent-vs-application-consistent.html)
334+
335+
## Reporting (not sure yet) TODO
336+
## Errors (not sure yet) TODO

0 commit comments

Comments
 (0)