Skip to content

Commit ce68572

Browse files
authored
updated documentation (#143)
1 parent 63a409a commit ce68572

19 files changed

+228
-70
lines changed

documentation/DCP-documentation/_toc.yml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
# Table of contents
22

33
format: jb-book
4-
root: overview
54
parts:
5+
- caption: FAQ
6+
chapters:
7+
- file: overview
8+
- file: overview_2
9+
- file: costs
610
- caption: Running DCP
711
chapters:
812
- file: step_0_prep
@@ -15,7 +19,8 @@ parts:
1519
- file: step_4_monitor
1620
- caption:
1721
chapters:
22+
- file: dashboard
1823
- file: advanced_configuration
1924
- file: AWS_hygiene_scripts
2025
- file: troubleshooting_runs
21-
- file: versions
26+
- file: versions
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# What does Distributed-CellProfiler cost?
2+
3+
Distributed-CellProfiler is run by a series of three commands, only one of which incurs costs at typical scale of usage:
4+
5+
[`setup`](step_1_configuration.md) creates a queue in SQS and a cluster, service, and task definition in ECS.
6+
ECS is entirely free.
7+
SQS queues are free to create and use up to 1 million requests/month.
8+
9+
[`submitJobs`](step_2_submit_jobs.md) places messages in the SQS queue which is free (under 1 million requests/month).
10+
11+
[`startCluster`](step_3_start_cluster.md) is the only command that incurs costs with initiation of your spot fleet request, creating machine alarms, and optionally creating a run dashboard.
12+
13+
The spot fleet is the major cost of running Distributed-CellProfiler, exact pricing of which depends on the number of machines, type of machines, and duration of use.
14+
Your bid is configured in the [config file](step_1_configuration.md).
15+
16+
Spot fleet costs can be minimized/stopped in multiple ways:
17+
1) We encourage the use of [`monitor`](step_4_monitor.md) during your job to help minimize the spot fleet cost as it automatically scales down your spot fleet request as your job queue empties and cancels your spot fleet request when you have no more jobs in the queue.
18+
Note that you can also perform a more aggressive downscaling of your fleet by monitor by engaging Cheapest mode (see [`more information here`](step_4_monitor.md)).
19+
2) If your job is finished, you can still initiate [`monitor`](step_4_monitor.md) to perform the same cleanup (without the automatic scaling).
20+
3) If you want to abort and clean up a run, you can purge your SQS queue in the [AWS SQS console](https://console.aws.amazon.com/sqs/) (by selecting your queue and pressing Actions => Purge) and then initiate [`monitor`](step_4_monitor.md) to perform the same cleanup.
21+
4) You can stop the spot fleet request directly in the [AWS EC2 console](https://console.aws.amazon.com/ec2/) by going to Instances => Spot Requests, selecting your spot request, and pressing Actions => Cancel Spot Request.
22+
23+
After the spot fleet has started, a Cloudwatch instance alarm is automatically placed on each instance in the fleet.
24+
Cloudwatch instance alarms [are currently $0.10/alarm/month](https://aws.amazon.com/cloudwatch/pricing/).
25+
Cloudwatch instance alarm costs can be minimized/stopped in multiple ways:
26+
1) If you run monitor during your job, it will automatically delete Cloudwatch alarms for any instance that is no longer in use once an hour while running and at the end of a run.
27+
2) If your job is finished, you can still initiate [`monitor`](step_4_monitor.md) to delete Cloudwatch alarms for any instance that is no longer in use.
28+
3) In [AWS Cloudwatch console](https://console.aws.amazon.com/cloudwatch/) you can select unused alarms by going to Alarms => All alarms. Change Any State to Insufficient Data, select all alarms, and then Actions => Delete.
29+
4) We provide a [hygiene script](hygiene.md) that will clean up old alarms for you.
30+
31+
Cloudwatch Dashboards [are currently free](https://aws.amazon.com/cloudwatch/pricing/) for 3 Dashboards with up to 50 metrics per month and are $3 per dashboard per month after that.
32+
Cloudwatch Dashboard costs can be minimized/prevented in multiple ways:
33+
1) You can choose not to have Distributed-CellProfiler create a Dashboard by setting `CREATE_DASHBOARD = 'False'` in your [config file](step_1_configuration.md).
34+
2) We encourage the use of [`monitor`](step_4_monitor.md) during your job as if you have set `CLEAN_DASHBOARD = 'True'` in your [config file](step_1_configuration.md) it will automatically delete your Dashboard when your job is done.
35+
3) If your job is finished, you can still initiate [`monitor`](step_4_monitor.md) to perform the same cleanup (without the automatic scaling).
36+
4) You can manually delete Dashboards in the [Cloudwatch Console]((https://console.aws.amazon.com/cloudwatch/)) by going to Dashboards, selecting your Dashboard, and selecting Delete.
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# AWS Cloudwatch Dashboard
2+
![Cloudwatch Dashboard Overview](images/dashboard_overview.png)
3+
4+
AWS Cloudwatch Dashboards are “customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view and create customized views of the metrics and alarms for your AWS resources.”
5+
A Dashboard is full of widgets, each of which you create and customize to report on a separate AWS metric.
6+
Distributed-CellProfiler has the option to auto-create a Cloudwatch Dashboard for each run and the option to clean it up when you are done.
7+
These options are set in [your config file](step_1_configuration.md).
8+
9+
The Dashboard setup that DS auto-populates is helfpul for monitoring a run as it is occurring or for a post-mortem to better understand a previous run.
10+
Some things you can see include: whether your machines are sized appropriately for your jobs, how stable your spot fleet is, whether your jobs are failing and if so if they’re failing in a consistent manner.
11+
All told, this can help you understand and optimize your resource usage, thus saving you time and money
12+
13+
## FulfilledCapacity:
14+
![Fulfilled Capacity widget](images/fulfilledcapacity.png)
15+
16+
This widget shows the number of machines in your spot fleet that are fulfilled, i.e. how many machines you actually have at any given point.
17+
After a short spin-up time after initiating a run, you hope to see a straight line at the number of machines requested in your fleet and then a steady decrease at the end of a run as monitor scales your fleet down to match the remaining jobs.
18+
19+
Some number of small dips are all but inevitable as machines crash and are replaced or AWS takes some of your capacity and gives it to a higher bidder.
20+
However, every time there is a dip, it means that a machine that was running a job is no longer running it and any progress on that job is lost.
21+
The job will hang out as “Not Visible” in your SQS queue until it reaches the amount of time set by SQS_MESSAGE_VISIBILITY in [your config file](step_1_configuration.md).
22+
For quick jobs, this doesn’t have much of an impact, but for jobs that take many hours, this can be frustrating and potentially expensive.
23+
24+
If you’re seeing lots of dips or very large dips, you may be able to prevent this in future runs by 1) requesting a different machine type 2) bidding a larger amount for your machines 3) changing regions.
25+
You can also check if blips coincide with AWS outages, in which case there’s nothing you can do, it’s just bad luck (that’s what happened with the large dip in the example above).
26+
27+
## NumberOfMessagesReceived/Deleted
28+
29+
![NumberofMessagesReceived/Deleted](images/messages_deleted_received.png)
30+
31+
This widget shows you in bulk whether your jobs are completing or erroring.
32+
NumberOfMessagesDeleted shows messages deleted from the queue after the job has successfully completed.
33+
NumberOfMessagesReceived shows both messages that are deleted from the queue as well as messages that are put back in the queue because they errored.
34+
You hope to see that the two lines track on top of each other because that means no messages are erroring.
35+
If there are often gaps between the lines then it means a fraction of your jobs are erroring and you’ll need to figure out why (see MemoryUtilization and Show Errors or look directly in your Cloudwatch Logs for insights).
36+
37+
## MemoryUtilization
38+
39+
![Memory Utilization](images/memoryutilization.png)
40+
41+
Insufficient memory is the error that we most often encounter (as we try to use the smallest machines possible for economy’s sake) so we like to look at memory usage.
42+
Note that this is showing memory utilization in bulk for your cluster, not for individual machines.
43+
Because different machines reach memory intensive steps at different points in time, and because we’re looking at an average across 5 minute windows, the max percentage you see is likely to be much less than 100%, even if you are using all the memory in your machines at some points.
44+
45+
# MessagesVisible/NotVisible
46+
47+
![MessagesVisible/NotVisible](images/messages_change_slope.png)
48+
49+
Visible messages are messages waiting in your queue.
50+
Hidden messages (aka MessagesNotVisible) have been started and will remain hidden until either they are completed and therefore removed from the queue or they reach the time set in SQS_MESSAGE_VISIBILITY in your config file, whichever comes first.
51+
([Read more about Message Visibility](SQS_QUEUE_information.md).)
52+
After starting your fleet (and waiting long enough for at least one round of jobs to complete), you hope to see a linear decline in total messages with the number of hidden messages equal to the number of jobs being run (fleet size * tasks per machine * docker cores).
53+
54+
![Blip in MessagesVisible/NotVisible](images/blip_in_messagesnotvisible.png)
55+
56+
Sometimes you’ll see a blip where there is a rapid increase in the number of hidden messages (as pictured above).
57+
This can happen if there is an error on a machine and the hard disk gets full - it rapidly pulls jobs and puts them back until the machine error is caught and rebooted.
58+
This type of error shows in this widget as it happens.
59+
60+
If your spot fleet loses capacity (see FulfilledCapacity), you may see a blip in MessagesVisible/NotVisible where the number of hidden messages rapidly decreases.
61+
This appears in the widget the amount of time set in SQS_MESSAGE_VISIBILITY in your config file after the capacity loss when jobs that were started (i.e. hidden) but not completed return to visible status.
62+
63+
The relative slope of your graph can also be informative.
64+
For the run pictured at top, we discovered that a fraction of our jobs were erroring because the machines were running out of memory.
65+
Midway through 7/12 we upped the memory of the machines in our fleet and you can see from that point on a greater slope as more jobs were finishing in the same amount of time (as fewer were failing to complete because of memory errors.)
66+
67+
## Distinct Logs
68+
69+
![Logs comparison](images/logs_comparison.png)
70+
71+
This widget shows you the number of different specific jobs that start within your given time window by plotting the number of Cloudwatch logs that have your run command in them.
72+
In this example, our run command is "cellprofiler -c".
73+
It is not necessarily informative on its own, but very helpful when compared with the following widget.
74+
75+
## All logs
76+
This widget shows you the number of total times that jobs are started within your log group within the given time window.
77+
Ideally, you want this number to match the number in the previous widget as it means that each job is starting in your software only once.
78+
79+
If this number is consistently larger than the previous widget’s number, it could mean that some of your jobs are erroring and you’ll need to figure out why (see MemoryUtilization and Show Errors or look directly in your Cloudwatch Logs for insights).
80+
81+
## Show Errors
82+
![Show errors](images/expand_error_log.png)
83+
84+
This widget shows you the log entry any time that it contains “Error”.
85+
Ideally, this widget will remain empty.
86+
If it is logging errors, you can toggle each row for more information - it will show the job that errored in @logStream and the actual error message in @message.
87+
88+
## Interacting with a Dashboard:
89+
90+
Once you have your Dashboard created and full of widgets, you can adjust the timescale for which the widget is reporting metrics.
91+
For any of the widgets you can set the absolute or relative time that the widget is showing by selecting the time scale from the upper right corner of the screen.
92+
Zoom in to a particular time selection on a visible widget by drawing a box around that time on the widget itself (note that zooming in doesn’t change what’s plotted, just what part of the plot you can see so metrics like Show Errors won’t update with a zoom).
93+
94+
Some widgets allow you to select/deselect certain metrics plotted in the widget.
95+
To hide a metric without permanently removing it from the widget, simply click the X on the box next to the name of the metric in the legend.
96+
97+
You can move the widgets around on your dashboard by hovering on the upper right or upper left corner of a widget until a 4-direction-arrow icon appears and then dragging and dropping the widget.
98+
You can change the size of a widget by hovering on the lower right corner of the widget until a diagonal arrow icon appears and then dragging the widget to the desired size.
99+
After making changes, make sure to select Save dashboard from the top menu so that they are maintained after refreshing the page.
989 KB
Loading
86.2 KB
Loading
17.7 KB
Loading
418 KB
Loading
118 KB
Loading
46.5 KB
Loading
77.9 KB
Loading

0 commit comments

Comments
 (0)