This project is a worker designed to process Parquet files and store the data in Elasticsearch, following the "ports and adapters" architecture.
Ensure you have the following software installed:
- Docker
- Docker Compose
The system uses a Makefile for easy management. Below are the commands you can use:
Before running the project, you need to build the Docker containers:
make buildAfter building, start the project with:
make startTo process all parquet files:
make curTo check the status of the running containers:
make statusTo view the logs of the running containers:
make logsTo stop all running containers:
make stopPlace your .parquet files in a directory formatted as BILLING_PERIOD=YYYY-MM. After processing, the files will be renamed with the .processed extension. Make sure to organize your files accordingly to ensure proper processing.
| Variable Name | Type | Default Value | Meaning |
|---|---|---|---|
OUTPUT_ADAPTER |
String | opensearch |
Where data will be saved after being processed. opensearch or elasticsearch |
OUTPUT_ADAPTER_HOST |
String | localhost |
The host address of the Elasticsearch server |
OUTPUT_ADAPTER_PORT |
Integer | 9200 |
The port number of the Elasticsearch server |
FILE_THREADS |
Integer | 2 | How many parquet files will be processed at same time |
WORKER_THREADS |
Integer | 2 | How many workers will be used on elasticsearch bulk insert |
STORAGE_TYPE |
String | LOCAL |
Data source to find parquet files. LOCAL or S3 |
AWS_ACCESS_KEY_ID |
String | AWS Access Key to connect on S3 when using STORAGE_TYPE as S3 |
|
AWS_SECRET_ACCESS_KEY |
String | AWS Access Key ID to connect on S3 when using STORAGE_TYPE as S3 |
|
AWS_SESSION_TOKEN |
String | AWS Session Token if necessary | |
AWS_REGION |
String | us-east-1 |
AWS region where S3 Bucket was created |
AWS_BUCKET_NAME |
String | AWS S3 Bucket where parquet files are created | |
AWS_BUCKET_PREFIX |
String | AWS S3 Bucket prefix directory where parquet files are created | |
REPROCESS |
Bool | False |
If True, all processed file will be reprocessed |
After processing the files, you can access Grafana at:
Use the following credentials:
- Username: admin
- Password: password
The CUR (Cost and Usage Reports) processor is configured to read only files under AWS_BUCKET_PREFIX path in Parquet format. It specifically looks for files within the AWS_BUCKET_PREFIX directory in the specified S3 bucket. After processing, the files will be renamed with the .processed extension.
To enable Cost and Usage Report (CUR) v2 in Parquet format and configure the data export to an S3 bucket, follow these steps:
-
Access the AWS Console:
- Log in to your AWS account.
- Navigate to the Billing and Cost Management service.
-
Create a Cost and Usage Report:
- In the navigation pane, click on Data Exports.
- Click the Create button and choose Standard data export
-
Configure the Report:
- Name your report, for example,
monthly-cost-usage-report. - Check the Include resource IDs option to get detailed information.
- On Time granularity, select Monthly
- On Format use Parquet - Parquet
- Name your report, for example,
-
Configure Report Delivery:
- Select an existing S3 bucket or create a new bucket where the report will be stored.
- Ensure the directory structure in the S3 bucket is set to
reports/for the data export configuration. - Click Next.
-
Set Report Path:
- Set the S3 report path prefix to
reports/. - Ensure the path is correctly set to store reports in the desired directory.
- Set the S3 report path prefix to
-
Review and Complete:
- Review your settings.
- Click Save and complete to finish the setup.
Once the CUR is configured and active, the processor will read the monthly Parquet files from the reports/ directory in the S3 bucket for processing.
Follow these instructions to set up, run, and manage the AWS Cur V2 Processor. Ensure your environment meets the prerequisites and that you follow the steps in the correct order.