Skip to content

Commit 78e0607

Browse files
authored
Merge pull request #1692 from cortexproject/aws-guide
Add AWS-specific points to docs
2 parents d7cfe81 + ad15002 commit 78e0607

File tree

2 files changed

+98
-0
lines changed

2 files changed

+98
-0
lines changed

docs/aws-specific.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Running Cortex at AWS
2+
3+
[this is a work in progress]
4+
5+
See also the [Running in Production](running.md) document.
6+
7+
## Credentials
8+
9+
You can supply credentials to Cortex by setting environment variables
10+
`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` (and `AWS_SESSION_TOKEN`
11+
if you use MFA), or use a short-term token solution such as
12+
[kiam](https://github.com/uswitch/kiam).
13+
14+
15+
## Should I use S3 or DynamoDB ?
16+
17+
Note that the choices are: "chunks" of timeseries data in S3 and index
18+
in DynamoDB, or everything in DynamoDB. Using just S3 is not an option.
19+
20+
Broadly S3 is much more expensive to read and write, while DynamoDB is
21+
much more expensive to store over months. S3 charges differently, so
22+
the cross-over will depend on the size of your chunks, and how long
23+
you keep them. Very roughly: for 3KB chunks if you keep them longer
24+
than 8 months then S3 is cheaper.
25+
26+
27+
## DynamoDB capacity provisioning
28+
29+
By default, the Cortex Tablemanager will provision tables with 1,000
30+
units of write capacity and 300 read - these numbers are chosen to be
31+
high enough that most trial installations won't see a bottleneck on
32+
storage, but do note that that AWS will charge you approximately $60
33+
per day for this capacity.
34+
35+
To match your costs to requirements, observe the actual capacity
36+
utilisation via CloudWatch or Prometheus metrics, then adjust the
37+
Tablemanager provision via command-line options
38+
`-dynamodb.chunk-table.write-throughput`, `read-throughput` and
39+
similar with `.periodic-table` which controls the index table.
40+
41+
Tablemanager can even adjust the capacity dynamically, by watching
42+
metrics for DynamoDB throttling and ingester queue length. Here is an
43+
example set of command-line parameters from a fairly modest install:
44+
45+
```
46+
-target=table-manager
47+
-metrics.url=http://prometheus.monitoring.svc.cluster.local./api/prom/
48+
-metrics.target-queue-length=100000
49+
-dynamodb.url=dynamodb://us-east-1/
50+
-dynamodb.use-periodic-tables=true
51+
52+
-dynamodb.periodic-table.prefix=cortex_index_
53+
-dynamodb.periodic-table.from=2019-05-02
54+
-dynamodb.periodic-table.write-throughput=1000
55+
-dynamodb.periodic-table.write-throughput.scale.enabled=true
56+
-dynamodb.periodic-table.write-throughput.scale.min-capacity=200
57+
-dynamodb.periodic-table.write-throughput.scale.max-capacity=2000
58+
-dynamodb.periodic-table.write-throughput.scale.out-cooldown=300 # 5 minutes between scale ups
59+
-dynamodb.periodic-table.inactive-enable-ondemand-throughput-mode=true
60+
-dynamodb.periodic-table.read-throughput=300
61+
-dynamodb.periodic-table.tag=product_area=cortex
62+
63+
-dynamodb.chunk-table.from=2019-05-02
64+
-dynamodb.chunk-table.prefix=cortex_data_
65+
-dynamodb.chunk-table.write-throughput=800
66+
-dynamodb.chunk-table.write-throughput.scale.enabled=true
67+
-dynamodb.chunk-table.write-throughput.scale.min-capacity=200
68+
-dynamodb.chunk-table.write-throughput.scale.max-capacity=1000
69+
-dynamodb.chunk-table.write-throughput.scale.out-cooldown=300 # 5 minutes between scale ups
70+
-dynamodb.chunk-table.inactive-enable-ondemand-throughput-mode=true
71+
-dynamodb.chunk-table.read-throughput=300
72+
-dynamodb.chunk-table.tag=product_area=cortex
73+
```
74+
75+
Several things to note here:
76+
- `-metrics.url` points at a Prometheus server running within the
77+
cluster, scraping Cortex. Currently it is not possible to use
78+
Cortex itself as the target here.
79+
- `-metrics.target-queue-length`: when the ingester queue is below
80+
this level, Tablemanager will not scale up. When the queue is
81+
growing above this level, Tablemanager will scale up whatever
82+
table is being throttled.
83+
- The plain `throughput` values are used when the tables are first
84+
created. Scale-up to any level up to this value will be very quick,
85+
but if you go higher than this initial value, AWS may take tens of
86+
minutes to finish scaling. In the config above they are set
87+
- `ondemand-throughput-mode` tells AWS to charge for what you use, as
88+
opposed to continuous provisioning. This mode is cost-effective for
89+
older data, which is never written and only read sporadically
90+

docs/running.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
This document assumes you have read the
44
[architecture](architecture.md) document.
55

6+
In addition to the general advice in this document, please see these
7+
platform-specific notes:
8+
- [AWS](aws-specific.md)
9+
610
## Planning
711

812
### Tenants
@@ -220,3 +224,7 @@ the same data:
220224
-ingester.chunk-age-jitter=0
221225

222226
Add a chunk cache via `-memcached.hostname` to allow writes to be de-duplicated.
227+
228+
As recommended under [Chunk encoding](#chunk_encoding), use Bigchunk:
229+
230+
-ingester.chunk-encoding=3 # bigchunk

0 commit comments

Comments
 (0)