|
| 1 | +# Running Cortex at AWS |
| 2 | + |
| 3 | +[this is a work in progress] |
| 4 | + |
| 5 | +See also the [Running in Production](running.md) document. |
| 6 | + |
| 7 | +## Credentials |
| 8 | + |
| 9 | +You can supply credentials to Cortex by setting environment variables |
| 10 | +`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` (and `AWS_SESSION_TOKEN` |
| 11 | +if you use MFA), or use a short-term token solution such as |
| 12 | +[kiam](https://github.com/uswitch/kiam). |
| 13 | + |
| 14 | + |
| 15 | +## Should I use S3 or DynamoDB ? |
| 16 | + |
| 17 | +Note that the choices are: "chunks" of timeseries data in S3 and index |
| 18 | +in DynamoDB, or everything in DynamoDB. Using just S3 is not an option. |
| 19 | + |
| 20 | +Broadly S3 is much more expensive to read and write, while DynamoDB is |
| 21 | +much more expensive to store over months. S3 charges differently, so |
| 22 | +the cross-over will depend on the size of your chunks, and how long |
| 23 | +you keep them. Very roughly: for 3KB chunks if you keep them longer |
| 24 | +than 8 months then S3 is cheaper. |
| 25 | + |
| 26 | + |
| 27 | +## DynamoDB capacity provisioning |
| 28 | + |
| 29 | +By default, the Cortex Tablemanager will provision tables with 1,000 |
| 30 | +units of write capacity and 300 read - these numbers are chosen to be |
| 31 | +high enough that most trial installations won't see a bottleneck on |
| 32 | +storage, but do note that that AWS will charge you approximately $60 |
| 33 | +per day for this capacity. |
| 34 | + |
| 35 | +To match your costs to requirements, observe the actual capacity |
| 36 | +utilisation via CloudWatch or Prometheus metrics, then adjust the |
| 37 | +Tablemanager provision via command-line options |
| 38 | +`-dynamodb.chunk-table.write-throughput`, `read-throughput` and |
| 39 | +similar with `.periodic-table` which controls the index table. |
| 40 | + |
| 41 | +Tablemanager can even adjust the capacity dynamically, by watching |
| 42 | +metrics for DynamoDB throttling and ingester queue length. Here is an |
| 43 | +example set of command-line parameters from a fairly modest install: |
| 44 | + |
| 45 | +``` |
| 46 | + -target=table-manager |
| 47 | + -metrics.url=http://prometheus.monitoring.svc.cluster.local./api/prom/ |
| 48 | + -metrics.target-queue-length=100000 |
| 49 | + -dynamodb.url=dynamodb://us-east-1/ |
| 50 | + -dynamodb.use-periodic-tables=true |
| 51 | +
|
| 52 | + -dynamodb.periodic-table.prefix=cortex_index_ |
| 53 | + -dynamodb.periodic-table.from=2019-05-02 |
| 54 | + -dynamodb.periodic-table.write-throughput=1000 |
| 55 | + -dynamodb.periodic-table.write-throughput.scale.enabled=true |
| 56 | + -dynamodb.periodic-table.write-throughput.scale.min-capacity=200 |
| 57 | + -dynamodb.periodic-table.write-throughput.scale.max-capacity=2000 |
| 58 | + -dynamodb.periodic-table.write-throughput.scale.out-cooldown=300 # 5 minutes between scale ups |
| 59 | + -dynamodb.periodic-table.inactive-enable-ondemand-throughput-mode=true |
| 60 | + -dynamodb.periodic-table.read-throughput=300 |
| 61 | + -dynamodb.periodic-table.tag=product_area=cortex |
| 62 | +
|
| 63 | + -dynamodb.chunk-table.from=2019-05-02 |
| 64 | + -dynamodb.chunk-table.prefix=cortex_data_ |
| 65 | + -dynamodb.chunk-table.write-throughput=800 |
| 66 | + -dynamodb.chunk-table.write-throughput.scale.enabled=true |
| 67 | + -dynamodb.chunk-table.write-throughput.scale.min-capacity=200 |
| 68 | + -dynamodb.chunk-table.write-throughput.scale.max-capacity=1000 |
| 69 | + -dynamodb.chunk-table.write-throughput.scale.out-cooldown=300 # 5 minutes between scale ups |
| 70 | + -dynamodb.chunk-table.inactive-enable-ondemand-throughput-mode=true |
| 71 | + -dynamodb.chunk-table.read-throughput=300 |
| 72 | + -dynamodb.chunk-table.tag=product_area=cortex |
| 73 | +``` |
| 74 | + |
| 75 | +Several things to note here: |
| 76 | + - `-metrics.url` points at a Prometheus server running within the |
| 77 | + cluster, scraping Cortex. Currently it is not possible to use |
| 78 | + Cortex itself as the target here. |
| 79 | + - `-metrics.target-queue-length`: when the ingester queue is below |
| 80 | + this level, Tablemanager will not scale up. When the queue is |
| 81 | + growing above this level, Tablemanager will scale up whatever |
| 82 | + table is being throttled. |
| 83 | + - The plain `throughput` values are used when the tables are first |
| 84 | + created. Scale-up to any level up to this value will be very quick, |
| 85 | + but if you go higher than this initial value, AWS may take tens of |
| 86 | + minutes to finish scaling. In the config above they are set |
| 87 | + - `ondemand-throughput-mode` tells AWS to charge for what you use, as |
| 88 | + opposed to continuous provisioning. This mode is cost-effective for |
| 89 | + older data, which is never written and only read sporadically |
| 90 | + |
0 commit comments