Reliably keep AWS databases stopped when not needed, to save money
You can keep an EC2 compute instance stopped as long as you want, but it's not possible to stop an RDS or Aurora database longer than 7 days. After AWS starts your database on the 7th day, this tool automatically stops it again.
- testing
- development
- infrequent reference
- old databases kept just in case
- vacation or leave beyond one week
If it would cost too much to keep a database running but take too long to re-create it, this tool might save you money, time, or both. AWS does not charge for database instance hours while an RDS database instance is stopped or an Aurora database cluster is stopped. (Other charges, such as for storage and snapshots, continue.)
Jump to: Get Started • Multi-Account, Multi-Region • Terraform • Security
[NEW! github.com/sqlxpert/step-stay-stopped-aws-rds-aurora is a low-code, Step Function-based implementation of the same process.]
-
You do not need to set any opt-in or opt-out tags. If a database has been running continuously, it will keep running. If it was stopped for 7 days, Stay-Stopped will stop it again. The tool responds to RDS-EVENT-0154 (RDS database instance) and RDS-EVENT-0153 (Aurora database cluster).
-
Before you start a database manually, wait until it has been stopped for 10 minutes.
-
Stopping stuff is inherently idempotent: keep trying until it stops! Stay-Stopped tries every 9 minutes until the database is stopped, an unexpected error occurs, or 24 hours pass.
Many alternatives (including AI-generated ones from Amazon Q Developer) introduce a latent bug (a race condition) by checking status before trying to stop a database, always expecting to catch the database while it's
available
, or not waiting long enough. To understand why this matters and what can go wrong, see Perspective, below. -
It's not enough to call
stop_db_instance
orstop_db_cluster
and hope for the best. This tool handles error cases. Look for a queue message or a log entry, in case something unexpected prevented stopping your database. Budget alerts and cost anomaly detection are still essential. -
Once in a while it's still important to start a database before its maintenance window and leave it running until the window closes.
Click to view the architecture diagram and flowchart:
-
Log in to the AWS Console as an administrator. Choose an AWS account and a region where you have an RDS or Aurora database that is normally stopped, or that you can stop now and leave stopped for 8 days.
-
Create a CloudFormation stack "With new resources (standard)". Select "Upload a template file", then select "Choose file" and navigate to a locally-saved copy of stay_stopped_aws_rds_aurora.yaml [right-click to save as...]. On the next page, set:
- Stack name:
StayStoppedRdsAurora
- Stack name:
-
Wait 8 days, then check that your RDS or Aurora database is stopped. After clicking the RDS database instance name or the Aurora database cluster name, open the "Logs & events" tab and scroll to "Recent events". At the right, click to change "Last 1 day" to "Last 2 weeks". The "System notes" column should include the following entries, listed here from newest to oldest. There might be other entries in between.
RDS Aurora DB instance stopped DB cluster stopped DB instance started DB cluster started DB instance is being started due to it exceeding the maximum allowed time being stopped. DB cluster is being started due to it exceeding the maximum allowed time being stopped. If you don't want to wait 8 days, see Testing, below.
For reliability, Stay-Stopped works independently in each region, in each AWS account. To deploy in multiple regions and/or multiple AWS accounts,
-
Delete any standalone
StayStoppedRdsAurora
CloudFormation stacks in your target regions and/or AWS accounts. -
Complete the prerequisites for creating a StackSet with service-managed permissions.
-
In the management AWS account (or a delegated administrator account), create a CloudFormation StackSet. Select "Upload a template file", then select "Choose file" and upload a locally-saved copy of stay_stopped_aws_rds_aurora.yaml [right-click to save as...]. On the next page, set:
- StackSet name:
StayStoppedRdsAurora
- StackSet name:
-
Two pages later, under "Deployment targets", select "Deploy to Organizational Units". Enter your target
ou-
identifier. Stay-Stopped will be deployed in all AWS accounts in your target OU. Toward the bottom of the page, specify your target region(s).
Terraform users are often willing to wrap a CloudFormation stack in HashiCorp Configuration Language, because AWS supplies tools in the form of CloudFormation templates. See aws_cloudformation_stack .
Wrapping a CloudFormation StackSet in HCL is much easier than configuring and using Terraform to deploy and maintain identical resources in multiple regions and/or AWS accounts. See aws_cloudformation_stack_set .
In accordance with the software license, nothing in this document establishes indemnification, a warranty, assumption of liability, etc. Use this software entirely at your own risk. You are encouraged to review the source code.
Security details...
-
A least-privilege role for the AWS Lambda function.
-
Least-privilege queue policies. The main queue can only consume messages from EventBridge and produce messages for the Lambda function, or for the error (dead letter) queue if there is a problem. Encryption in transit is required.
-
Optional encryption at rest with the AWS Key Management System, for the queues and the log. This can protect EventBridge events containing database identifiers and metadata, such as tags. KMS keys housed in a different AWS account, and multi-region keys, are supported.
-
No data storage other than in the queues and the log, both of which have configurable retention periods.
-
A retry mechanism (every 9 minutes) and limit (160 total times, which is 24 hours), to increase the likelihood that a database will be stopped as intended but prevent endless retries.
-
A concurrency limit, to prevent exhaustion of available Lambda resources.
-
A 24-hour event date/time expiry check, to prevent processing of accumulated stale events, if any.
-
Readable Identity and Access Management policies, formatted as CloudFormation YAML rather than JSON, and broken down into discrete statements by service, resource or principal.
-
Prevent people from modifying components of this tool, most of which can be identified by
StayStoppedRdsAurora
in ARNs and in the automaticaws:cloudformation:stack-name
tag. -
Log infrastructure changes using CloudTrail, and set up alerts.
-
Prevent people from directly invoking the Lambda function and from passing the function role to arbitrary functions.
-
Separate production workloads. Although this tool only stops databases that AWS is starting after they've been stopped for 7 days, the Lambda function could stop any database if invoked directly, with a contrived event as input. You might choose not to deploy this tool in AWS accounts used for production, or you might add a custom IAM policy to the function role, denying authority to stop certain production databases (
AttachLocalPolicy
in CloudFormation).- Tagging an RDS database instance or an Aurora database cluster with
StayStopped-Exclude
(seeExcludeTagKey
in CloudFormation) prevents the Lambda function role from being misused to stop that database. ⚠ Do not rely on attribute-based access control unless you also prevent people and systems from adding, changing and deleting ABAC tags.
- Tagging an RDS database instance or an Aurora database cluster with
-
Enable the test mode only in a non-critical AWS account and region, and turn the test mode off again as quickly as possible.
-
Monitor the error (dead letter) queue, and monitor the log for
ERROR
-level entries. -
Configure budget alerts and use cost anomaly detection.
-
Occasionally start a database before its maintenance window and leave it running, to catch up with RDS and Aurora security updates.
Check the:
-
StayStoppedRdsAurora-LambdaFn CloudWatch log group
-
Scrutinize log entries at the
ERROR
level:InvalidDBInstanceState
orInvalidDBClusterStateFault
:- One time: A database could not be stopped because it was in an unexpected state.
- Multiple times for the same database: The database was in an unexpected but potentially recoverable state. Stay-Stopped retries every 9 minutes, until 24 hours have passed.
-
Log entries are JSON objects.
- Stay-Stopped includes
"level"
,"type"
and"value"
keys. - Other software components may use different keys.
- Stay-Stopped includes
-
For more data, change the
LogLevel
in CloudFormation.
-
-
StayStoppedRdsAurora-ErrorQueue
(dead letter) SQS queue- A message in this queue means that Stay-Stopped did not stop a database, usually after trying for 24 hours.
- The message will usually be the original EventBridge event from when AWS started the database after it had been stopped for 7 days.
- Rarely, a message in this queue indicates that the local security configuration is denying necessary access to SQS or Lambda.
-
- CloudTrail events with an "Error code" may indicate permissions problems, typically due to the local security configuration.
- To see more events, change "Read-only" from
false
totrue
.
Testing details...
An RDS database instance ( db.t4g.micro
, 20
GiB of gp3 storage, 0
days'
worth of automated backups) is cheaper than a typical Aurora cluster, not to
mention faster to create, stop, and start.
AWS starts RDS and Aurora databases that have been stopped for 7 days, but we need a faster mechanism for realistic, end-to-end testing. Temporarily change these parameters in CloudFormation:
Parameter | Normal | Test |
---|---|---|
Test |
false |
true |
LogLevel |
ERROR |
INFO |
QueueDelaySecs |
540 |
60 |
→ Equivalent in minutes | 9 minutes | 1 minute |
QueueVisibilityTimeoutSecs |
540 |
60 |
QueueMaxReceiveCount |
160 |
30 |
→ Equivalent time | 24 hours | 30 minutes |
⚠ Exit test mode as quickly as possible, given the operational and security risks explained below. If your test database is ready, several minutes should be sufficient.
In test mode, Stay-Stopped responds to user-initiated, non-forced database starts, too: RDS-EVENT-0088 (RDS database instance) and RDS-EVENT-0151 (Aurora database cluster). Although it won't stop databases that are already running and remain running, ⚠ while in test mode Stay-Stopped will stop databases that you start manually. To test, manually start a stopped RDS or Aurora database.
In test mode, Stay-Stopped also receives RDS-EVENT-0088 (Aurora database instance). Internally, the code ignores it in favor of the cluster-level event.
Test mode relaxes the queue policy for the main SQS queue, allowing sources
other than EventBridge, and targets other than the Lambda function or the
error (dead letter) queue. Test by using the AWS Console to send a simulated
EventBridge event message. In the list of
SQS queues,
select StayStoppedRdsAurora-MainQueue
and then select the "Send and receive
messages" button above the list. You can:
- "Send message", or
- "Poll for messages", select a message, read it and delete it, or
- "Purge" all messages.
Edit the database names and date/time strings (must be within the past
QueueMaxReceiveCount
× QueueVisibilityTimeoutSecs
and end in Z
for
UTC)
in these test messages:
{
"detail": {
"SourceIdentifier": "Name-Of-Your-RDS-Database-Instance",
"Date": "2025-06-06T04:30Z",
"SourceType": "DB_INSTANCE",
"EventID": "RDS-EVENT-0154"
},
"detail-type": "RDS DB Instance Event",
"source": "aws.rds",
"version": "0"
}
{
"detail": {
"SourceIdentifier": "Name-Of-Your-Aurora-Database-Cluster",
"Date": "2025-06-06T04:30Z",
"SourceType": "CLUSTER",
"EventID": "RDS-EVENT-0153"
},
"detail-type": "RDS DB Cluster Event",
"source": "aws.rds",
"version": "0"
}
Depending on locally-determined permissions, you may also be able to invoke
the
StayStopped Lambda function
manually. Edit the database names and date/time strings (must be within the
past QueueMaxReceiveCount
× QueueVisibilityTimeoutSecs
and end in Z
for
UTC)
in this Lambda test event:
{
"Records": [
{
"body": "{ \"detail\": { \"SourceIdentifier\": \"Name-Of-Your-RDS-Database-Instance\", \"Date\": \"2025-06-06T04:30Z\", \"SourceType\": \"DB_INSTANCE\", \"EventID\": \"RDS-EVENT-0154\" }, \"detail-type\": \"RDS DB Instance Event\", \"source\": \"aws.rds\", \"version\": \"0\"}",
"messageId": "test-message-1-rds"
},
{
"body": "{ \"detail\": { \"SourceIdentifier\": \"Name-Of-Your-Aurora-Database-Cluster\", \"Date\": \"2025-06-06T04:30Z\", \"SourceType\": \"CLUSTER\", \"EventID\": \"RDS-EVENT-0153\" }, \"detail-type\": \"RDS DB Cluster Event\", \"source\": \"aws.rds\", \"version\": \"0\"}",
"messageId": "test-message-2-aurora"
}
]
}
After following the troubleshooting steps and ruling out local issues such as permissions — especially hidden controls such as Service and Resource control policies (SCPs and RCPs) — please report bugs. Thank you!
As noted in the Design section, many alternative solutions (including
AI-generated ones from Amazon Q Developer) introduce a latent bug (a
race condition)
by checking status before trying to stop a database, always expecting to
catch the database while it's available
, or not waiting long enough.
About idempotence, race conditions, and latent bugs...
Let's compare two thoughtful alternative solutions, described as of May, 2025, then Stay-Stopped, and finally, a series of AI-generated solutions from June, 2025...
Stop Amazon RDS/Aurora Whenever They Start
[code]
by shimo, from the AWS in Plain English blog on Medium, comprises a single
Lambda function, which checks that the database is available
before stopping
it
(L48-L51).
If not, the code waits
(L63-L65)
and checks again
(L76-L78).
What if the database takes a long time to start? Startup "can take minutes to
hours", according to the
RDS User Guide.
Lambda has a 15-minute maximum timeout.
The function might never get a chance to request that the database be stopped.
Waiting within the Lambda function might seem wasteful, but 15 minutes costs less than 2¢ — negligible for a function triggered once per database per week. Though Lambda's maximum timeout is too short for this application, I appreciate the author's instinct for minimal infrastructure.
Stopping an Automatically Started Database Instance
[code]
by Islam Ghanim, on AWS's own Architecture Blog, uses an AWS Step Function.
Before attempting to stop the database, the state machine waits as long as
necessary for the database to become available
; long maintenance
etc.
would be accommodated. After the database finishes starting
and becomes
available
, what if someone notices and stops it manually, putting it in
stopping
status before the next status check? Barring an error, available
is the only way out of the status check loop
(stop-rds-instance-state-machine.json L30-L40).
No
state machine timeout
is defined
(L1-L4).
The Step Function would keep checking every 5 minutes for a status that won't
recur until AWS starts the database again in 7 days or, worse yet, someone
starts the database manually with the intention of using it.
What I appreciate about this author's solution is that once the stop request is made, the state machine sees it through until the database's status changes from
stopping
tostopped
.
[ Note: The Step Function solution discussed above is not related to my own github.com/sqlxpert/step-stay-stopped-aws-rds-aurora .]
Stay-Stopped requires only one Lambda function, but inserts an SQS queue between EventBridge and Lambda. Waiting occurs outside the Lambda function. SQS counts up toward a first-time message delivery delay. Later, SQS counts up toward a message [in]visibility timeout, making it possible to periodically retry the Lambda function, with the original EventBridge event message, until the return value indicates success. If maxReceiveCount is reached instead, SQS gives up and moves the message to a dead letter queue. Between the first-time delivery delay, the [in]visibility timeout, and the receive count, SQS maintains all the state that's needed.
Given that the Lambda function receives the original event message again
and again, how does Stay-Stopped track the database's progress from starting
to available
(the only status from which it can be stopped) and then to
stopped
(or another final status)? It doesn't. One idempotent Lambda
function does the same thing each time it's invoked, avoiding the need for a
Step Function state machine.
Interestingly, the RDS API is eventually consistent, not strongly consistent. After RDS emits a "DB cluster is being started" or "DB instance is being started" event,
stop_db_cluster
orstop_db_instance
anddescribe_db_instances
might still see the database's stalestopped
status rather than its currentstarting
status. The first-time message delivery delay has been added for this reason.
Each time the Lambda function is invoked, it tries to stop the database by
calling stop_db_cluster
(for an Aurora event) or stop_db_instance
(for
RDS). Unlike a request to stop an EC2 compute instance, which succeeds even if
the EC2 instance is stopping or already stopped, a request to stop an RDS
database instance or an Aurora database cluster fails if the database is
stopping
or already stopped
. More importantly, it also fails if the
database is in maintenance
or another similar status, and not ready to be
stopped.
-
Aurora mentions whatever offending database status in the error message:
An error occurred (InvalidDBClusterStateFault) when calling the StopDBCluster operation: DbCluster Name-Of-Your-Aurora-Database-Cluster is in stopping state but expected it to be one of available.
There is no point in checking the status of an Aurora database, separately and non-atomically, when the goal is to stop it. Keep trying to stop it, and the error message will reveal when it is finally stopped.
-
RDS, on the other hand, omits the offending database status:
An error occurred (InvalidDBInstanceState) when calling the StopDBInstance operation: Instance Name-Of-Your-RDS-Database-Instance is not in available state.
After receiving this error, the Stay-Stopped Lambda function calls
describe_db_instances
to find out the status of the RDS database. Does the fact that the stop request and the status request are always separate, non-atomic operations (with no provision for locking control of the database in between) make a race condition inevitable with RDS? As long as we always stop first and ask questions later, we have done our best.
A success response from stop_db_cluster
or stop_db_instance
is not
success for the Lambda function. Unless the database is in a final status such
as stopped
, the Lambda function
returns a batch item failure.
Batches are unlikely in this application, but partial batch responses provide
a way to provoke retries, short of raising an exception or calling
sys.exit(1)
, either of which would needlessly
provoke the shutdown and re-initialization of the Lambda runtime environment.
If someone starts the database manually after it enters
stopped
status but before the next and final retry, Stay-Stopped will stop the database another time — a race condition, yes, but documented, and not one that prevents the tool from doing its job of stopping databases! Before manually starting a database, wait until it has been stopped for 10 minutes (based on the tool's default [in]visibility timeout, 9 minutes). Or, changeFollowUntilStopped
tofalse
in CloudFormation.
After finishing Stay-Stopped, I decided to check whether Amazon Q Developer might have helped with its development. This section is so dispiriting that I'm folding it. Read on, if you dare!
Details of the AI-generated solutions...
Jump to:
- An Unnecessary Call for Every Database
- The Wrong Event
- Waiting within the Lambda
- "Fixing" a Race Condition by Adding Another
- Spaghetti Code and Meatballs
- Still Unnecessary Code
- Leaving a Bug for Later
- Would Amazon Q Developer Have Helped?
Amazon Q Developer's initial response to my prompt to write a Lambda function
that keeps RDS databases stopped longer than 7 days didn't handle events at
all. It drew a list of databases from describe_db_instances
and called
stop_db_instance
on available
ones that had been created more than 7 days
ago — disaster!
The code repeated mistakes found all over the Internet and in official AWS solutions on re:Post. The following exchange is telling:
Why does the Lambda function call
list_tags_for_resource
after callingdescribe_db_instances
?
[...] because AWS RDS API doesn't include tags in the response of
describe_db_instances by default. [...]
• Tags are considered metadata that's managed separately from the core instance
properties
• It helps keep the primary describe call response size smaller and more focused
• It follows AWS's general pattern of separating resource information from tag
information across many services
I knew from my own work that this had changed years ago.
According to boto3 / RDS / Client / describe_db_instances,
describe_db_instances
does include tags in its response.
You're absolutely right, and I apologize for my error. Looking at the boto3
documentation for describe_db_instances, it does indeed include tags in its
response.
I asked about events. Amazon Q Developer drafted this:
Pattern:
source:
- aws.rds
detail-type:
- 'RDS DB Instance Event'
detail:
EventCategories:
- availability
Message:
- prefix: "DB instance restarted"
- prefix: "DB instance has been restarted"
- prefix: "DB instance is available"
The last two messages don't exist, which leaves one matching event, RDS-EVENT-0006, "DB instance restarted." It occurs in situations other than when AWS starts a database that has been stopped for 7 days.
I took pity on Amazon Q Developer and mentioned RDS-EVENT-0154, "DB instance is being started due to it exceeding the maximum allowed time being stopped."
But how long does the RDS documentation say that it can take to start a stopped DB instance?
According to the AWS RDS documentation, starting a stopped DB instance
typically takes several minutes. The exact time can vary depending on the
instance size, storage type, and database engine, but generally it's in the
range of 5-10 minutes for most standard configurations.
Nonsense!
RDS User Guide / Starting an Amazon RDS DB instance that was previously stopped says "The startup process can take minutes to hours."
Looking at the AWS documentation you've linked, you're absolutely right - the
RDS startup process can take anywhere from "minutes to hours" depending on
various factors.
[...]
Given Lambda's execution time limitations (maximum of 15 minutes), we can't
reliably wait for an RDS instance to complete startup within a single Lambda
invocation. Here are our best options:
### Option 1: Step Functions State Machine
[...]
### Option 2: SQS-based Retry with Dead Letter Queue
[...]
### Option 3: EventBridge Scheduled Rule
1. When RDS-EVENT-0154 is received, tag the instance and create a new
EventBridge rule
2. The rule runs every 5-10 minutes and targets a Lambda function
3. The Lambda checks if the tagged instance is available and stops it
4. Once successful, the Lambda deletes the EventBridge rule
[...]
### Option 4: DynamoDB-based Tracking
1. Store instance IDs in DynamoDB with timestamp and attempt count
2. Run a Lambda on a schedule to check all instances in the table
3. Update attempt count and last check time after each attempt
[...]
Option 3 requires risky self-modifying infrastructure, and Option 4 is like a
"distractor" response for a test question meant to distinguish system
administration from DevOps. If a cron
job is no longer right, why would
running "a Lambda on a schedule" make sense?
Does this Lambda function have a race condition?
Looking at the Lambda function code, there is a potential race condition in how
it handles RDS instance state changes.
[...]
1. Event Processing Delay: There might be a delay between when AWS restarts the
instance and when our Lambda processes the event
2. Concurrent Executions: If multiple events for the same instance are
triggered close together, we might have concurrent Lambda executions trying to
stop the same instance
3. No State Locking: The function doesn't implement any locking mechanism to
prevent concurrent operations on the same instance
Amazon Q Developer offered to add a second describe_db_instances
call,
unwittingly adding a second race condition!
try:
instance_response = rds.describe_db_instances(DBInstanceIdentifier=source_id)
# [...]
instance = instance_response['DBInstances'][0]
instance_id = instance['DBInstanceIdentifier']
instance_status = instance['DBInstanceStatus']
# [...]
# Check if instance is available (running)
if instance_status == 'available':
logger.info(f"Instance {instance_id} is available, stopping it")
# [...]
try:
# Double-check instance state before stopping
current_state_response = rds.describe_db_instances(DBInstanceIdentifier=instance_id)
current_state = current_state_response['DBInstances'][0]['DBInstanceStatus']
# [...]
if current_state != 'available':
I claim that this approach still has a fundamental race condition. Does this article give you a hint? Making retries safe with idempotent APIs
You're absolutely right to bring up idempotency concerns. Looking at the AWS
Builder's Library article on "Making retries safe with idempotent APIs," I can
see the race condition in our approach.
[...] we need to:
1. Use a client token or request ID that remains consistent for retries of the
same logical operation
2. Track the state of operations to avoid duplicate actions
3. Handle all possible states the instance could be in, not just "available"
For example, if we see the instance is in "stopping" state, we should recognize
this as a successful previous attempt and not try to stop it again.
Does a client token really help? What if, while SQS and Lambda are waiting between invocations (during the visibility timeout), a human being notices that the DB instance is available, and stops it manually?
You've raised an excellent point about the limitations of client tokens [...]
If a human manually stops the instance while our system is waiting between
invocations during the visibility timeout, a client token wouldn't detect this
external intervention. The client token only helps ensure that our specific API
call isn't processed multiple times by AWS, but it doesn't help us detect or
handle state changes caused by external actors.
The true idempotent approach here needs to be state-based rather than token-
based. What matters is the desired end state (the instance being stopped), not
whether our specific API call was the one that caused it.
Static analysis tools are excellent sources of free advice. I'd expect the free variant of a $19 per person per month artificial intelligence bot to be smart enough to use them. pylint said of Amazon Q Developer's code,
- Too many local variables (22/15) (
too-many-locals
) - Too many nested blocks (8/5) (
too-many-nested-blocks
) - Too many branches (28/12) (
too-many-branches
)
At least there are comments to help. I would never have guessed the purpose of
a stop_db_instance
call bracketed by "Stopping RDS instance" and
"Successfully initiated stop" log strings.
# Stop the instance
logger.info(f"Stopping RDS instance {instance_id}")
try:
rds.stop_db_instance(DBInstanceIdentifier=instance_id)
stopped_instances.append({"id": instance_id, "event": "auto-restart"})
logger.info(f"Successfully initiated stop for instance {instance_id}")
except rds.exceptions.InvalidDBInstanceStateFault as state_error:
# Handle the case where instance state changed after our check
logger.warning(f"Instance {instance_id} state changed before stop operation could complete: {str(state_error)}")
skipped_instances.append({"id": instance_id, "reason": f"State changed during stop operation: {str(state_error)}"})
except rds.exceptions.DBInstanceNotFoundFault as not_found_error:
logger.warning(f"Instance {instance_id} not found when attempting to stop: {str(not_found_error)}")
skipped_instances.append({"id": instance_id, "reason": f"Instance not found during stop operation"})
except Exception as e:
logger.error(f"Error stopping {instance_id}: {str(e)}")
skipped_instances.append({"id": instance_id, "reason": f"Error during stop: {str(e)}"})
except Exception as e:
logger.error(f"Error processing {instance_id}: {str(e)}")
skipped_instances.append({"id": instance_id, "reason": f"Processing error: {str(e)}"})
else:
logger.info(f"DRY RUN: Would have stopped RDS instance {instance_id}")
stopped_instances.append({"id": instance_id, "event": "auto-restart", "dry_run": True})
else:
logger.info(f"Instance {instance_id} is not in 'available' state (current: {instance_status}), skipping")
skipped_instances.append({"id": instance_id, "reason": f"Not in 'available' state (current: {instance_status})"})
except Exception as e:
logger.error(f"Error processing instance {source_id}: {str(e)}")
return {
"statusCode": 500,
"message": f"Error processing instance {source_id}: {str(e)}"
}
else:
logger.info(f"Event for {source_id} is not a restart event, skipping")
return {
"statusCode": 200,
"message": "Event is not a restart event, no action taken"
}
When the goal is to stop databases that had already been stopped for 7 days,
tags cannot add any information. A previously stopped database is included,
thanks to RDS-EVENT-0154
. A continuously running database is excluded,
because no event is generated for it. (The only benefit of tags is
attribute-based access control,
which is far beyond the level of solutions typically found on the Internet or
initially proposed by Amazon Q Developer.
github.com/sqlxpert/lights-off-aws uses ABAC
and I've added it to Stay-Stopped as well. It's moot unless you broadly
restrict the right to add, change and delete ABAC tags.)
According to Amazon Q Developer, "The final solution represents a robust, production-ready approach that properly handles the complexities of keeping RDS instances stopped even after AWS automatically restarts them." The term "final solution" is sensitive and should never be used by a code generation bot. Isn't awareness of context part of intelligence? In any case, the final version still included:
EXCLUDE_TAGS = os.environ.get('EXCLUDE_TAGS', 'AutoStop=false').split(',')
# [...]
def should_exclude(tags):
"""Check if instance should be excluded based on tags"""
for tag_filter in EXCLUDE_TAGS:
if '=' in tag_filter:
key, value = tag_filter.split('=')
tag_value = get_tag_value(tags, key)
if tag_value and tag_value.lower() == value.lower():
return True
return False
# [...]
# Get instance details to check tags
response = rds.describe_db_instances(DBInstanceIdentifier=source_id)
# [...]
instance = response['DBInstances'][0]
instance_id = instance['DBInstanceIdentifier']
instance_arn = instance['DBInstanceArn']
tags = instance.get('TagList', [])
# Check if instance should be excluded based on tags
if should_exclude(tags):
# [...]
An RDS-EVENT-0154
follows. I logged it while testing Stay-Stopped. The
EventBridge to SQS to Lambda architecture affords not one but two zero-code,
zero-effort opportunities to filter based on event properties, so long as
the criteria are static. Instead of excluding databases tagged
AutoStop=false
declaratively, by adding one line of CloudFormation YAML to
the existing
Events::Rule EventPattern,
or a few lines to a new
Lambda::EventSourceMapping FilterCriteria
entry, Amazon Q Developer proceeded imperatively, adding an environment
variable, a function, and a describe_db_instances
call, comprising 14+ extra
lines of executable Python code. My earlier complaint that
describe_db_instances
does indeed return tags seems to have biased the bot
against list_tags_for_resource
, which would be appropriate this time —
if it were necessary to fetch tags and if tags made sense for this
application.
{
"version": "0",
"id": "e2a1ff83-facf-130b-0a13-852949c50174",
"detail-type": "RDS DB Instance Event",
"source": "aws.rds",
"account": "111222333444",
"time": "2025-06-08T04:54:48Z",
"region": "us-west-2",
"resources": [
"arn:aws:rds:us-west-2:111222333444:db:Name-Of-Your-RDS-Database-Instance"
],
"detail": {
"EventCategories": [
"notification"
],
"SourceType": "DB_INSTANCE",
"SourceArn": "arn:aws:rds:us-west-2:111222333444:db:Name-Of-Your-RDS-Database-Instance",
"Date": "2025-06-08T04:54:48.420Z",
"Message": "DB instance is being started due to it exceeding the maximum allowed time being stopped.",
"SourceIdentifier": "Name-Of-Your-RDS-Database-Instance",
"EventID": "RDS-EVENT-0154",
"Tags": {
"test-tag-key": "test-tag-value"
}
}
}
When I added ABAC to the Stay-Stopped Lambda function role, I took the liberty of using the same declarative CloudFormation code to condition the event rule on database tags. I was able to add support for a parameterized exclusion tag, a parameterized inclusion tag, a mix of both (databases explicitly included, and some explicitly excluded), or no tags. There is no need to add or change Lambda function Python code to support tags.
One thing I liked about the generated code at first glance was the encoding scheme for a list of tag=value pairs — even though, as explained above, tags are a distraction in this application.
EXCLUDE_TAGS = os.environ.get('EXCLUDE_TAGS', 'AutoStop=false').split(',')
# [...]
key, value = tag_filter.split('=')
Unlike the rest of the generated code, this encoding is syntactically economic,
both for the user and for the programmer. Unfortunately, the choice of =
as
a delimiter hides a bug and demonstrates that Amazon Q Developer is ignoring
AWS documentation that was available during training and is germane and
queryable now.
Tag restrictions differ from one AWS service to another, and the restrictions
are not documented for all services. RDS tag keys and values have rather
limited character sets, which are documented in
Tagging Amazon RDS resources: Tag structure in Amazon RDS,
in the RDS User Guide. =
is allowed in RDS tag keys, where it wouldn't make
much sense to humans, and in RDS tag values, where it does make sense. The
original prompt was all about RDS.
My other project,
github.com/sqlxpert/lights-off-aws ,
processes schedule expressions in tag values. For clarity to the user as well
as to the programmer, I replaced cron
's positional system with label=value
pairs like d=01 H=07 M=30
, the labels corresponding to
strftime
fields. =
can also appear in the query part of a URI. Tagging AWS resources
with links to dashboards, documentation or Jira tickets is a good practice.
The simple AutoStop=false
example works, but an unexpected error would occur
if the user followed RDS's tag rules and included a tag key or value with =
when setting EXCLUDE_TAGS
. Looking ahead, if the generated split
s entered
a larger codebase and were reused in a broader context, debugging would become
quite difficult. =
as a delimiter is incorrect for RDS because it's allowed
inside RDS tag keys and tag values. ,
as a delimiter would be incorrect for
EC2, because it's allowed inside EC2 tag keys and tag values. See
Tag your Amazon EC2 resources: Tag restrictions
in the EC2 User Guide.
When I knowingly introduce code that circumscribes a documented capability, it
is my responsibility to warn the user and the programmer. Amazon Q Developer
loves to generate comments, and this would be a good use for one, although the
generated comments are typically manipulations of tokens that speak to
the "what", not to the "why". If Amazon Q Developer preemptively provided, in
the parameter description and in a code comment, a link to the RDS tag
specification and a warning that =
is not allowed in tag keys or tag values
for this application, the bot would save me time. Better yet, how about
generating reusable code and preventing future bugs by choosing delimiters
that are consistent with the rules in AWS service documentation?
No. After multiple revision rounds in which I told Amazon Q Developer about problems and solutions it should have anticipated based on AWS's own documentation, the bot's lack of depth was clear. It could have helped with the form of resource definitions, but not with correct content. If you don't know the extent of the documentation for the AWS services you use, and haven't read it yourself, you will not be able to assess the accuracy of Amazon Q Developer's claims. If you don't know distributed systems programming practices, you will not be able to assess the reliability of the code that Amazon Q Developer generates. If you don't know general programming principles, you risk accepting generated code that is long, repetitive, and hard to maintain.
I have edited my prompts for brevity and reduced the indentation of the generated code excerpts for readability. Originals are available on request. Because the goal was to see whether artificial intelligence could develop a solution from scratch, replacing an experienced human developer or at least orienting a novice, I did not provide the Stay-Stopped code as context. "You can start an entirely new project...", according to the Amazon Q User Guide. I did not find attribution information while using Amazon Q Developer. If you claim credit for any part of the generated code and would like me to acknowledge your work, please get in touch.
-
Making retries safe with idempotent APIs by Malcolm Featonby, in the Amazon Builder's Library
-
Idempotence: Doing It More than Once, by yours truly
-
"Constant work and self-healing" in Reliability, constant work, and a good cup of coffee by Colm MacCárthaigh (another Builder's Library article)
Stopping a cloud database is not so simple; it's a distributed computing problem. Each professional who tackles a complex problem contributes a piece of the puzzle. By publishing our work on an open-source basis, we can learn from each other. Please get in touch with ideas for improving Stay-Stopped!
Thank you to:
- Andrew, who asked the question that led me to develop Stay-Stopped. User feedback matters!
- shimo and Islam Ghanim, developers who open-sourced alternative solutions.
- Corey, who shared Stay-Stopped with the community in Last Week in AWS Newsletter issue 427 (June 16, 2025).
Scope | Link | Included Copy |
---|---|---|
Source code, and source code in documentation | GNU General Public License (GPL) 3.0 | LICENSE-CODE.md |
Documentation, including this ReadMe file | GNU Free Documentation License (FDL) 1.3 | LICENSE-DOC.md |
Copyright Paul Marcelin
Contact: marcelin
at cmu.edu
(replace "at" with @
)