Skip to content
This repository was archived by the owner on Jun 23, 2021. It is now read-only.

Analytics

Nestor Carvantes edited this page Oct 12, 2019 · 8 revisions

The analytics component

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

Data

Applications/year=2019/month=10/day=11/hour=16/realworld-serverless-application-analytic-Firehose-12J7YC29T8FAY-1-2019-10-11-16-58-58-c0068baf-ab5b-4a61-a9a7-1e100983c696.parquet

Data is partitioned by time. The year=2019 style of prefixes can be Partitioning reduces the amount of data that has to be scanned to execute Athena queries, thus reducing the cost.

Data is stored in .parquet files Parquet is a columnar data storage format of the Apache Hadoop ecosystem. It provides efficient data compression and enhanced performance to handle complex data in bulk. Parquet files are drastically smaller than JSON text files. Using parquet reduces storage and query costs

Athena

Run the following query first to load new partitions:

MSCK REPAIR TABLE applications;

Run a sample query:

SELECT detail.eventname,
         detail.dynamodb.keys.applicationid.s AS applicationid,
         detail.dynamodb.keys.userid.s AS userid,
         detail.dynamodb.newimage.author.s AS author,
         detail.dynamodb.newimage.description.s AS description
FROM applications;
Clone this wiki locally