|
| 1 | +# Jobs |
| 2 | + |
| 3 | +## v1/current |
| 4 | + |
| 5 | +### ebs-snapshot-creator.py v1 |
| 6 | + |
| 7 | +- Creates primary snapshots (located in the same region as an instance). |
| 8 | +- Runs once daily (but can be ran more often if desired) |
| 9 | +- A creator should be running in every region that has source instances |
| 10 | + that have EBS volumes and are (or may be) configured to get snapshots |
| 11 | +- Features |
| 12 | + - Look for instances (in same region as Lambda is running in) that have |
| 13 | + backups (snapshots) enabled |
| 14 | + - Find all connected EBS volumes on relevant instances |
| 15 | + - Tell AWS to create a snapshot of all found EBS volumes |
| 16 | + - Look for backup retention configuration on each relevant instance |
| 17 | + - If backup retention configuration found, use it to determine and set |
| 18 | + the expiration date on the created snapshots |
| 19 | + - If no backup retention configuration found, use a default of 7 days and |
| 20 | + set the expiration date on the created snapshots |
| 21 | + - For all created snapshots, sets a tag indicating they were |
| 22 | + automatically created by this tool |
| 23 | +- Differentiators |
| 24 | + - Designed to run 100% from AWS Lambdas |
| 25 | + - Automatically handles snapshot expiration, not just creation |
| 26 | + - Entirely configured on a per instance basis using AWS's built-in tags |
| 27 | + - No external configuration files |
| 28 | + - No hardcoded configuration |
| 29 | + - No external database / file storage dependencies |
| 30 | + - Easy to customize per instance configuration |
| 31 | + - Easy to manage per instance configuration in orchestration tools |
| 32 | +- Uses the following tags: |
| 33 | + - TBD |
| 34 | + |
| 35 | +### ebs-snapshot-manager.py v1 |
| 36 | + |
| 37 | +- Deletes snapshots (located in the same region as Lambda is running in) |
| 38 | + that have hit their expiration date |
| 39 | +- Features |
| 40 | + - Looks for snapshots (located in the same region as Lambda is running |
| 41 | + in) that that have hit their expiration date and have a tag indicating |
| 42 | + they were created by this tool. |
| 43 | + - Pulls all needed information from snapshots themselves (which are, in |
| 44 | + turn, automatically tagged/configured from instance configuration) |
| 45 | + |
| 46 | +## v2 |
| 47 | + |
| 48 | +- Differentiators (beyond already noted in v1) |
| 49 | + - No inter-region dependencies |
| 50 | + - All snapshot creation for instances in a given region handled by a |
| 51 | + Lambda in the same region |
| 52 | + - Service problems in one region do not impact cretion jobs in other |
| 53 | + regions |
| 54 | + - Multi-region snapshot creation (and replication, using new copier |
| 55 | + job) enabled without centralizing job management in one region |
| 56 | + - All jobs operate autonomously within region |
| 57 | + - All jobs operate autonomously across regions |
| 58 | + |
| 59 | +### ebs-snapshot-creator.py v2 |
| 60 | + |
| 61 | +- No code changes over v1 other than tag adjustments to accomodate cleaner |
| 62 | + tag naming/design. |
| 63 | +- Usage instructions added on how to use in situations where instances are |
| 64 | + running in multiple regions (hint: run a creator job in each relevant |
| 65 | + region) |
| 66 | + |
| 67 | +### ebs-snapshot-manager.py v2 |
| 68 | + |
| 69 | +- No code changes over v1 other than tag adjustments to accomodate cleaner |
| 70 | + tag naming/design. |
| 71 | +- Usage instructions added on how to use in situations where instances are |
| 72 | + running in multiple regions (hint: run a manager job in each relevant |
| 73 | + region) |
| 74 | +- Usage instructions added on how to use in situations where snapshots are |
| 75 | + enabled for replication into additional regions (hint: run a manager job |
| 76 | + in each relevant region) |
| 77 | + |
| 78 | +### ebs-snapshot-copier.py v2 |
| 79 | + |
| 80 | +- Copies any snapshots (located in the same region as Lamdba is running in) |
| 81 | + that are enabled for copying into additional regions, into those |
| 82 | + additional specified regions |
| 83 | +- Runs once daily (but can be ran more often if desired) |
| 84 | +- A copier should be running in every region that has primary snapshots / |
| 85 | + source instances that are (or may be) configured for replication into |
| 86 | + additional regions |
| 87 | +- Features: |
| 88 | + - Pulls all configuration from snapshots themselves (which are, in turn, |
| 89 | + automatically tagged/configured from instance configuration) |
| 90 | + - Only copies snapshots that are in a completed state to avoid creating |
| 91 | + unusable snapshots |
| 92 | + - Does not copy non-primary snapshots (i.e. those not located in the same |
| 93 | + region as an instance) to prevent copying snapshots that have already |
| 94 | + been copied (and would otherwise create infinitely copying loops) |
| 95 | + - Replicates primary snapshot configuration (e.g. expiration dates) |
| 96 | + - Retains sufficient information to trace snapshot copies back to |
| 97 | + original (primary) snapshots and source instances (because AWS snapshot |
| 98 | + copying processes do not retain this information and, worse, set bogus |
| 99 | + VolumeIds on out-of-region copied snapshots) |
| 100 | + |
| 101 | +## v2 - original brainstorming |
| 102 | + |
| 103 | +### ebs-snapshot-creator.py |
| 104 | + |
| 105 | +- Creates primary snapshots (located in the same region as an instance). |
| 106 | +- Runs once daily (but can we ran more often if desired) |
| 107 | +- Uses the following tags: |
| 108 | + - TBD |
| 109 | + |
| 110 | +### ebs-snapshot-copier.py |
| 111 | + |
| 112 | +Copies primary snapshots to an additional region. |
| 113 | +Uses the following tags: |
| 114 | +- Snapshots: { PrimarySnapshotID (set), Description (copied), DeleteOn (copied), Type (copied) } |
| 115 | +- Instances: { SnapshotsEnabled, SnapshotsRetention, SnapshotsExtraRegion1 } |
| 116 | + |
| 117 | +### ebs-snapshot-manager.py |
| 118 | + |
| 119 | +---- |
| 120 | + |
| 121 | +# Tags - Tracking |
| 122 | + |
| 123 | +## Snapshots |
| 124 | + |
| 125 | +XXX Description instance_name - vol_id (dev_name) Set by creator |
| 126 | +Description instance_name (instance_id) - vol_id (dev_name) - source_region Set by creator |
| 127 | +DeleteOn YYYY-mm-dd Set by creator |
| 128 | +Type Automated Set by creator |
| 129 | +Source_SnapshotId Source SnapshotId in source region (if applicable) Set by copier |
| 130 | + |
| 131 | +# Tags - Configuration |
| 132 | + |
| 133 | +## Snapshots |
| 134 | + |
| 135 | +KeepForever Yes Set by operator/user if snapshot is to be retained outside of retention schedule |
| 136 | + |
| 137 | +## Instances |
| 138 | + |
| 139 | +Backup Yes Whether to snapshot (backup) volumes attached to this instance |
| 140 | +Retention # of days (7 is the default if not specified) The number of days to retain snapshots for volumes attached to this instance |
| 141 | +Copy_Dest Name of an AWS region An additional region to copy snapshots to for volumes attached to this instance |
| 142 | +Name (whatever) Used in snapshot description |
| 143 | + |
| 144 | +---- |
| 145 | + |
| 146 | +# Functionality |
| 147 | + |
| 148 | +## Overview & Status |
| 149 | + |
| 150 | +- Not requiring the running of a management server instance to host, run, or trigger jobs IMPLEMENTED |
| 151 | +- Simple configuration based 100% on tagging of instances and snapshots IMPLEMENTED, mostly |
| 152 | +- Creating snapshots IMPLEMENTED |
| 153 | +- Crash-consistent snapshots IMPLEMENTED |
| 154 | +- Copying snapshots to another region IMPLEMENTED, partially |
| 155 | +- Expiring snapshots after specified period passes IMPLEMENTED |
| 156 | +- Retaining select snapshots indefinitely IMPLEMENTED |
| 157 | +- Managing snapshots copied to another region in same way as original snapshots IN PROGRESS |
| 158 | +- Automatically running job that handles creating snapshots on a specified schedule IMPLEMENTED |
| 159 | +- Automatically running job that handles copying snapshots to another region on a specified schedule IMPLEMENTED, but refinement necessary |
| 160 | +- Automatically running job that handles expiring snapshots on a specified schedule IMPLEMENTED, but refinmement may be necessary |
| 161 | +- Simple provisioning of jobs, scheduling, and permissions TODO |
| 162 | +- Try to run after database dumps have completed TODO - EXTRA CREDIT |
| 163 | +- Try to run after quieting filesystem (not sure yet) TODO - EXTRA CREDIT |
| 164 | +- Application-consistent snapshots TODO - EXTRA CREDIT |
| 165 | +- Reporting (not sure yet) TODO |
| 166 | +- Errors (not sure yet) TODO |
| 167 | + |
| 168 | +## Not requiring the running of a management server instance to host, run, or trigger jobs |
| 169 | + |
| 170 | +It is implemented as a set of Python based functions intended to run in AWS Lambda (which also handles |
| 171 | +the job scheduling). This makes it self-contained and easier to setup, without any external resources |
| 172 | +needed. |
| 173 | + |
| 174 | +## Simple configuration based 100% on tagging of instances and snapshots |
| 175 | + |
| 176 | +Easily readable (by humans and machines) AWS object tags are used to track as well as configure all |
| 177 | +aspects of the jobs. This permits easy management and changes without adding anything outside of |
| 178 | +AWS nor relying on external configuration files or hardcoded configuration of any sort. |
| 179 | + |
| 180 | +## Creating snapshots |
| 181 | + |
| 182 | +The creator job (implemented in ebs-snapshot-creator.py) creates the primary snapshots (the ones in |
| 183 | +the same region as the instance volumes they are associated with). Instances which should have their |
| 184 | +volumed backed up are configured by adding a tag with the name "Backup" and the value "Yes" to the |
| 185 | +instance. All volumes attached to a backup enabled instance of the type EBS will have snapshots |
| 186 | +taken of them. |
| 187 | + |
| 188 | +## Crash-consistent snapshots IMPLEMENTED |
| 189 | + |
| 190 | +EBS snapshots are point-in-time back ups of the data on an EBS volume. This means that the snapshots |
| 191 | +are exact copies of the data frozen at a specific point in time (i.e. all data across the volume is |
| 192 | +"snapshotted" at a single point in time for consistency across the filesystem). |
| 193 | + |
| 194 | +HOWEVER, there is a gotcha: If the instance is running (and it generally is for folks relying on |
| 195 | +snapshots as part of their backup strategy), EBS snapshots are crash-consistent. That is, whatever |
| 196 | +is in memory on the instance is lost. It's as if someone pulled out the power cord of the computer, |
| 197 | +pulled the volume out and copied it (the snapshot), then turned the computer back on. Of course |
| 198 | +in EC2 your instance will just keep running so triggering EBS snapshots against running instances |
| 199 | +is only *as if* someone did this. |
| 200 | + |
| 201 | +For some situations this is good enough. For situations it isn't. |
| 202 | + |
| 203 | +Modern filesystems with journaling support attempt to recover and deal with issues such as what |
| 204 | +happens when a "power cord is pulled". The same goes for database systems with recovery mechanisms |
| 205 | +for similar situations. This, however, does not mean that data isn't lost. It simply means that |
| 206 | +the filesystems and databases can return to a functioning reliable and probably good working state. |
| 207 | +The cost for getting to this state is that a bit of data (for a narrow time window) may be tossed out |
| 208 | +if it's suspect or incomplete. |
| 209 | + |
| 210 | +Concerns can often be worked around by doing things like running database dumps via cron (that are just |
| 211 | +static files on-disk representing the contents of the database that then are effectively guaranteed |
| 212 | +to be in a quiet/frozen state when the EBS snapshot runs ...whereas the raw live database data |
| 213 | +files themselves are *not* likely to be.) |
| 214 | + |
| 215 | +Crash-consistent backups, and in turn EBS snapshots of running instances, are better than just copying |
| 216 | +files (well, doing both are even better perhaps). The reason is because EBS snapshots are still |
| 217 | +point-in-time based, insuring consistency across the filesystem, whereas simply copying files has |
| 218 | +time delays between when each file is copied. |
| 219 | + |
| 220 | +Interesting links: |
| 221 | + |
| 222 | +- [1](http://www.n2ws.com/blog/ebs-snapshots-crash-consistent-vs-application-consistent.html) |
| 223 | +- [2](https://www.veeam.com/blog/how-to-create-a-consistent-vm-backup.html) |
| 224 | +- [3](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html) |
| 225 | + |
| 226 | +## Copying snapshots to another region |
| 227 | + |
| 228 | +The copier job (implemented in ebs-snapshot-copier.py) copies completed primary snapshots (the ones |
| 229 | +in the same region as the instance volumes they are associated with) to additional regions (only one |
| 230 | +additional is currently supported). Instances where it is desired that snapshots be copied to another |
| 231 | +region are configured by adding a tag with the named "Auto_Snapshot_Copy1" and the value "us-east-1" |
| 232 | +or similar region name. All volumes attached to a backup enabled instance with this parameter set to |
| 233 | +a valid region name will have their snapshots copied to this additional region. |
| 234 | + |
| 235 | +Snapshots in the primary region are checked for counterparts in other regions (by looking for snapshots |
| 236 | +in the other regions with the tag named "Source_SnapshotId" with a value equal to the primary snapshot's |
| 237 | +SnapshotId[1]. |
| 238 | + |
| 239 | +Since we cannot copy primary snapshots that are in progress we attempt to workaround that issue in four |
| 240 | +ways: |
| 241 | + |
| 242 | +- the copier job is independent of the creator job |
| 243 | +- we schedule the creator job to run a fair bit after the creator job |
| 244 | +- we only copy snapshots in a completed state |
| 245 | +- we pick up snapshots that were still not complete in the next job run |
| 246 | + |
| 247 | +[1] We don't just compare snapshot Ids or even Volume IDs from the snapshots because AWS's copy snapshot |
| 248 | +function doesn't carry these over; the copied snapshot IDs are different than their originals and the |
| 249 | +volume Ids are bogus (I believe this is because internally they are actually copied to intermediate |
| 250 | +volumes for copying purposes only). |
| 251 | + |
| 252 | +## Expiring snapshots after specified period passes |
| 253 | + |
| 254 | +The retention period for snapshots of a given instance are specified by adding a tag to the instance with |
| 255 | +the name "Retention" and the value of "#" of days to retain snapshots for. The default is 7 days if |
| 256 | +not specified. |
| 257 | + |
| 258 | +The creator job looks for this Retention tag and based on it's setting (or using the default of 7 days if |
| 259 | +it's not specified) sets a tag named "DeleteOn" on each snapshot at creation time, which specifies when |
| 260 | +that snapshot is to be kept until. |
| 261 | + |
| 262 | +The manager job looks for snapshots tagged for deletion (per the "DeleteOn" tag) on the date it is |
| 263 | +running. It also, for safety, ignores snapshots not tagged as being automated (tag Type=Automated) as |
| 264 | +well as those tagged with the tag "KeepForever". |
| 265 | + |
| 266 | +The copier job doesn't do anything special for expiration, other than replicating the same tags (e.g. |
| 267 | +DeleteOn) and their values on all snapshot copies as they were set on the original snapshots. |
| 268 | + |
| 269 | +## Retaining select snapshots indefinitely |
| 270 | + |
| 271 | +A user/operator (or third-party script) can add a tag with the name "KeepForever" to any automated |
| 272 | +snapshot and it will be retained indefinitely regardless of the expiration/retention configuration |
| 273 | +for the instance that that snapshot is associated with. |
| 274 | + |
| 275 | +The manager job looks for snapshots tagged as "KeepForever" and ignores them for processing entirely. |
| 276 | + |
| 277 | +The copier job replicates the KeepForever tag (if set) on primary snapshots to their counterparts |
| 278 | +in other regions (although this not not done in real-time so it is best for the user/operator to |
| 279 | +set it manually on the copies too if they want to be certain). |
| 280 | + |
| 281 | +## Managing snapshots copied to another region in same way as original snapshots IN PROGRESS |
| 282 | + |
| 283 | +The copier job makes sure that snapshots that are copied indicate which primary snapshots they are |
| 284 | +duplicates of by way of adding the tag named Source_SnapshotId with the relevant primary SnapshotId (we |
| 285 | +have to do this because snapshot copies to other regions do not have relevant associated volume Ids nor |
| 286 | +any other reference to the source snapshots. So we create our own reference.) |
| 287 | + |
| 288 | +The copier job also duplicates the other key tags (DeleteOn and KeepForever). |
| 289 | + |
| 290 | +The manager job checks for instances that have additional regions configured for copying, then |
| 291 | +processes snapshots in those regions in the same way as the primary instance region. |
| 292 | + |
| 293 | +## Automatically running job that handles creating snapshots on a specified schedule IMPLEMENTED, but refinement necessary |
| 294 | + |
| 295 | +Currently we rely on AWS Lambda's built-in CloudEvent event of "rate(1 day)" to run daily. |
| 296 | + |
| 297 | +To ensure the creator job runs before the copier we may need to be more specific with a time-of-day. |
| 298 | + |
| 299 | +## Automatically running job that handles copying snapshots to another region on a specified schedule IMPLEMENTED, but refinement necessary |
| 300 | + |
| 301 | +Currently we rely on AWS Lambda's built-in CloudEvent event of "rate(1 day)" to run daily. |
| 302 | + |
| 303 | +To ensure the copier job runs well after the creator job we may need to be more specific with a time-of-day. |
| 304 | + |
| 305 | +## Automatically running job that handles expiring snapshots on a specified schedule IMPLEMENTED, but refinmement may be necessary |
| 306 | + |
| 307 | +Currently we rely on AWS Lambda's built-in CloudEvent event of "rate(1 day)" to run daily. |
| 308 | + |
| 309 | +It doesn't seem like it really matters wheher the manager runs before, after, or even during the creator or |
| 310 | +copier jobs. |
| 311 | + |
| 312 | +## Simple provisioning of jobs, scheduling, and permissions TODO |
| 313 | + |
| 314 | +Phase 1: Permissions should be in JSON |
| 315 | +Phase 2: S3 source for loading jobs |
| 316 | +Phase 3: CloudFormation for IAM provisioning + provisioning jobs + loading function code from S3 + event scheduling |
| 317 | + |
| 318 | +## Try to run after database dumps have completed TODO - EXTRA CREDIT |
| 319 | + |
| 320 | +Phase I: Job scheduling coordination + we'll at worst have the prior day's dump |
| 321 | +Phase II: webhook? some other trigger for the job? |
| 322 | + |
| 323 | +## Try to run after quieting filesystem (not sure yet) TODO - EXTRA CREDIT |
| 324 | + |
| 325 | +Phase I: Ignore |
| 326 | +Phase II: fsfreeze + webhook/some other trigger for the job? |
| 327 | + |
| 328 | +## Application-consistent snapshots TODO - EXTRA CREDIT |
| 329 | + |
| 330 | +If the fsfreeze mentioned in the prior section is implemented, then in theory this is do-able. |
| 331 | +For Windows boxes, VSS equivalent may be an option. |
| 332 | +Since EBS snapshots are point-in-time, once the snapshot starts, the freeze operation can be released ...even prior to the completion of the snapshot. |
| 333 | +[1](http://www.n2ws.com/blog/ebs-snapshots-crash-consistent-vs-application-consistent.html) |
| 334 | + |
| 335 | +## Reporting (not sure yet) TODO |
| 336 | +## Errors (not sure yet) TODO |
0 commit comments