Skip to content

Commit 67dbce9

Browse files
authored
feature: support GCP cloud storage (#17)
* feature: support GCP cloud storage This changes leverage the contribution by cklingspor, which added a storage factory and azure support to add support for gcp. This is a work in progress and need more testing. * Fix requirements-dev.txt Include the main requirements file instead of duplicating the versions inside requirements-dev.txt * Fix pylint errors --------- Signed-off-by: Leonardo Rodrigues de Mello <lrodriguesdemello@zendesk.com>
1 parent 52f8fe2 commit 67dbce9

File tree

7 files changed

+165
-34
lines changed

7 files changed

+165
-34
lines changed

README.md

Lines changed: 26 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -3,43 +3,49 @@ Export OpenCost data in parquet format
33

44
This script was created to export data from opencost in PARQUET format.
55

6-
It supports exporting the data to S3 and local directory.
6+
It supports exporting the data to S3, Azure Blob Storage, GCP Cloud Storage, and local directory.
77

88
# Dependencies
9-
This script depends on boto3, pandas, numpy and python-dateutil.
9+
This script depends on boto3, pandas, numpy, python-dateutil, azure-identity, azure-storage-blob, and google-cloud-storage.
1010

1111
The file requirements.txt has all the dependencies specified.
1212

1313
# Configuration:
1414
The script supports the following environment variables:
15-
* OPENCOST_PARQUET_SVC_HOSTNAME: Hostname of the opencost service. By default it assume the opencost service is on localhost.
16-
* OPENCOST_PARQUET_SVC_PORT: Port of the opencost service, by default it assume it is 9003
17-
* OPENCOST_PARQUET_WINDOW_START: Start window for the export, by default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format. i.e `2024-05-27T00:00:00Z`.
18-
* OPENCOST_PARQUET_WINDOW_END: End of export window, by default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format. i.e `2024-05-27T00:00:00Z`.
19-
* OPENCOST_PARQUET_S3_BUCKET: S3 bucket that will be used to store the export. By default this is None, and S3 export is not done. If set to a bucket use s3://bucket-name and make sure there is an AWS Role with access to the s3 bucket attached to the container that is running the export. This also respect the environment variables AWS_PROFILE, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. see: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
20-
* OPENCOST_PARQUET_FILE_KEY_PREFIX: This is the prefix used for the export, by default it is '/tmp'. The export is going to be saved inside this prefix, in the following structure: year=window_start.year/month=window_start.month/day=window_start.day , ex: tmp/year=2024/month=1/date=15
21-
* OPENCOST_PARQUET_AGGREGATE: This is the dimentions used to aggregate the data. by default we use "namespace,pod,container" which is the same dimensions used for the CSV native export.
22-
* OPENCOST_PARQUET_STEP: This is the Step for the export, by default we use 1h steps, which result in 24 steps in a day and make easier to match the exported data to AWS CUR, since cur also export on hourly base.
23-
* OPENCOST_PARQUET_RESOLUTION: Duration to use as resolution in Prometheus queries. Smaller values (i.e. higher resolutions) will provide better accuracy, but worse performance (i.e. slower query time, higher memory use). Larger values (i.e. lower resolutions) will perform better, but at the expense of lower accuracy for short-running workloads.
24-
* OPENCOST_PARQUET_ACCUMULATE: If `"true"`, sum the entire range of time intervals into a single set. Default value is `"false"`.
15+
* OPENCOST_PARQUET_SVC_HOSTNAME: Hostname of the opencost service. By default, it assumes the opencost service is on localhost.
16+
* OPENCOST_PARQUET_SVC_PORT: Port of the opencost service, by default it assumes it is 9003.
17+
* OPENCOST_PARQUET_WINDOW_START: Start window for the export. By default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format, e.g., `2024-05-27T00:00:00Z`.
18+
* OPENCOST_PARQUET_WINDOW_END: End of the export window. By default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format, e.g., `2024-05-27T23:59:59Z`.
19+
* OPENCOST_PARQUET_S3_BUCKET: S3 bucket that will be used to store the export. By default this is None, and S3 export is not done. If set to a bucket, use `s3://bucket-name` and make sure there is an AWS Role with access to the S3 bucket attached to the container running the export. This also respects the environment variables AWS_PROFILE, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. See: [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).
20+
* OPENCOST_PARQUET_FILE_KEY_PREFIX: This is the prefix used for the export. By default it is `/tmp`. The export will be saved inside this prefix in the following structure: `year=window_start.year/month=window_start.month/day=window_start.day`, e.g., `tmp/year=2024/month=1/day=15`.
21+
* OPENCOST_PARQUET_AGGREGATE: Dimensions used to aggregate the data. By default, "namespace,pod,container" which is the same dimensions used for the CSV native export.
22+
* OPENCOST_PARQUET_STEP: Step size for the export. By default, we use 1h steps, which results in 24 steps in a day and makes it easier to match the exported data to AWS CUR since CUR also exports on an hourly basis.
23+
* OPENCOST_PARQUET_RESOLUTION: Duration to use as resolution in Prometheus queries. Smaller values (i.e., higher resolutions) will provide better accuracy, but worse performance (i.e., slower query time, higher memory use). Larger values (i.e., lower resolutions) will perform better but at the expense of lower accuracy for short-running workloads.
24+
* OPENCOST_PARQUET_ACCUMULATE: If `"true"`, sum the entire range of time intervals into a single set. Default value is `"false"`.
2525
* OPENCOST_PARQUET_INCLUDE_IDLE: Whether to return the calculated __idle__ field for the query. Default is `"false"`.
26-
* OPENCOST_PARQUET_IDLE_BY_NODE: If `"true"`, idle allocations are created on a per node basis. Which will result in different values when shared and more idle allocations when split. Default is `"false"`.
27-
* OPENCOST_PARQUET_STORAGE_BACKEND: The storage backend to use. Supports `aws`, `azure`. See below for Azure specific variables.
28-
* OPENCOST_PARQUET_JSON_SEPARATOR: The OpenCost API returns nested objects. The used [JSON normalization method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) allows for a custom separator. Use this to specify the separator of your choice.
26+
* OPENCOST_PARQUET_IDLE_BY_NODE: If `"true"`, idle allocations are created on a per-node basis, which will result in different values when shared and more idle allocations when split. Default is `"false"`.
27+
* OPENCOST_PARQUET_STORAGE_BACKEND: The storage backend to use. Supports `aws`, `azure`, `gcp`. See below for Azure and GCP-specific variables.
28+
* OPENCOST_PARQUET_JSON_SEPARATOR: The OpenCost API returns nested objects. The used [JSON normalization method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) allows for a custom separator. Use this to specify the separator of your choice.
2929

3030
## Azure Specific Environment Variables
3131
* OPENCOST_PARQUET_AZURE_STORAGE_ACCOUNT_NAME: Name of the Azure Storage Account you want to export the data to.
32-
* OPENCOST_PARQUET_AZURE_CONTAINER_NAME: The container within the storage account you want to save the data to. The service principal requires write permissions on the container
33-
* OPENCOST_PARQUET_AZURE_TENANT: You Azure Tenant ID
34-
* OPENCOST_PARQUET_AZURE_APPLICATION_ID: ClientID of the Service Principal
35-
* OPENCOST_PARQUET_AZURE_APPLICATION_SECRET: Secret of the Service Principal
32+
* OPENCOST_PARQUET_AZURE_CONTAINER_NAME: The container within the storage account you want to save the data to. The service principal requires write permissions on the container.
33+
* OPENCOST_PARQUET_AZURE_TENANT: Your Azure Tenant ID.
34+
* OPENCOST_PARQUET_AZURE_APPLICATION_ID: Client ID of the Service Principal.
35+
* OPENCOST_PARQUET_AZURE_APPLICATION_SECRET: Secret of the Service Principal.
36+
37+
## GCP Specific Environment Variables
38+
* OPENCOST_PARQUET_GCP_BUCKET_NAME: Name of the GCP bucket you want to export the data to.
39+
* OPENCOST_PARQUET_GCP_CREDENTIALS_JSON: JSON-formatted string of your GCP credentials (optional, uses `GOOGLE_APPLICATION_CREDENTIALS` if not set).
3640

3741
# Prerequisites
3842
## AWS IAM
3943

4044
## Azure RBAC
41-
The current implementation allows for authentication via [Service Principals](https://learn.microsoft.com/en-us/entra/identity-platform/app-objects-and-service-principals?tabs=browser) on the Azure Storage Account. Therefore, to use the Azure storage backend you need an existing service principal with according role assignments. Azure RBAC has built-in roles for Storage Account Blob Storage operations. The [Storage-Blob-Data-Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-contributor) allows to write data to a Azure Storage Account container. A less permissivie custom role can be built and is encouraged!
45+
The current implementation allows for authentication via [Service Principals](https://learn.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals) on the Azure Storage Account. Therefore, to use the Azure storage backend, you need an existing service principal with the appropriate role assignments. Azure RBAC has built-in roles for Storage Account Blob Storage operations. The [Storage Blob Data Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-contributor) allows writing data to an Azure Storage Account container. A less permissive custom role can be built and is encouraged!
4246

47+
## GCP IAM
48+
The current implementation allows for authentication using service account keys or Workload Identity. Ensure that the service account has the `Storage Object Creator` role or equivalent permissions to write data to the GCP bucket.
4349

4450
# Usage:
4551

requirements-dev.txt

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,4 @@
1-
numpy==1.26.3
2-
pandas==2.2.3
3-
boto3==1.35.16
4-
requests==2.32.0
5-
python-dateutil==2.8.2
6-
pytz==2023.3.post1
7-
six==1.16.0
8-
tzdata==2023.4
9-
pyarrow==14.0.1
10-
azure-storage-blob==12.19.1
11-
azure-identity==1.16.1
1+
-r requirements.txt
122
# The dependencies bellow are only used for development.
133
freezegun==1.4.0
144
pylint==3.0.3

requirements.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,10 @@ tzdata==2023.4
99
pyarrow==14.0.1
1010
azure-storage-blob==12.19.1
1111
azure-identity==1.16.1
12+
google-api-core==2.19.2
13+
google-auth==2.34.0
14+
google-cloud-core==2.4.1
15+
google-cloud-storage==2.18.2
16+
google-crc32c==1.6.0
17+
google-resumable-media==2.7.2
18+
googleapis-common-protos==1.65.0

src/opencost_parquet_exporter.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,12 @@ def get_config(
145145
'azure_application_id': os.environ.get('OPENCOST_PARQUET_AZURE_APPLICATION_ID'),
146146
'azure_application_secret': os.environ.get('OPENCOST_PARQUET_AZURE_APPLICATION_SECRET'),
147147
})
148+
if config['storage_backend'] == 'gcp':
149+
config.update({
150+
# pylint: disable=C0301
151+
'gcp_bucket_name': os.environ.get('OPENCOST_PARQUET_GCP_BUCKET_NAME'),
152+
'gcp_credentials': json.loads(os.environ.get('OPENCOST_PARQUET_GCP_CREDENTIALS_JSON', '{}')),
153+
})
148154

149155
# If window is not specified assume we want yesterday data.
150156
if window_start is None or window_end is None:

src/storage/gcp_storage.py

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
"""
2+
This module provides an implementation of the BaseStorage class for Google Cloud Storage.
3+
"""
4+
5+
from io import BytesIO
6+
import logging
7+
from google.cloud import storage
8+
from google.oauth2 import service_account
9+
from google.api_core import exceptions as gcp_exceptions
10+
import pandas as pd
11+
from .base_storage import BaseStorage
12+
13+
logger = logging.getLogger(__name__)
14+
logger.setLevel(logging.INFO)
15+
16+
17+
# pylint: disable=R0903
18+
class GCPStorage(BaseStorage):
19+
"""
20+
A class to handle data storage in Google Cloud Storage.
21+
"""
22+
23+
def _get_client(self, config) -> storage.Client:
24+
"""
25+
Returns a Google Cloud Storage client using credentials provided in the config.
26+
27+
Parameters:
28+
config (dict): Configuration dictionary that may contain 'gcp_credentials'
29+
for service account keys and other authentication-related keys.
30+
31+
Returns:
32+
storage.Client: An authenticated Google Cloud Storage client.
33+
"""
34+
if 'gcp_credentials' in config:
35+
credentials_info = config['gcp_credentials']
36+
credentials = service_account.Credentials.from_service_account_info(
37+
credentials_info)
38+
client = storage.Client(credentials=credentials)
39+
else:
40+
# Use default credentials
41+
client = storage.Client()
42+
43+
return client
44+
45+
def save_data(self, data: pd.core.frame.DataFrame, config) -> str | None:
46+
"""
47+
Saves a DataFrame to Google Cloud Storage.
48+
49+
Parameters:
50+
data (pd.core.frame.DataFrame): The DataFrame to be saved.
51+
config (dict): Configuration dictionary containing necessary information for storage.
52+
Expected keys include 'gcp_bucket_name',
53+
'file_key_prefix', and 'window_start'.
54+
55+
Returns:
56+
str | None: The URL of the saved object if successful, None otherwise.
57+
"""
58+
client = self._get_client(config)
59+
60+
file_name = 'k8s_opencost.parquet'
61+
window = pd.to_datetime(config['window_start'])
62+
blob_prefix = f"{config['file_key_prefix']}/{window.year}/{window.month}/{window.day}"
63+
bucket_name = config['gcp_bucket_name']
64+
blob_name = f"{blob_prefix}/{file_name}"
65+
66+
bucket = client.bucket(bucket_name)
67+
blob = bucket.blob(blob_name)
68+
parquet_file = BytesIO()
69+
data.to_parquet(parquet_file, engine='pyarrow', index=False)
70+
parquet_file.seek(0)
71+
72+
try:
73+
blob.upload_from_file(
74+
parquet_file, content_type='application/octet-stream')
75+
return blob.public_url
76+
except gcp_exceptions.BadRequest as e:
77+
logger.error("Bad Request Error: %s", e)
78+
except gcp_exceptions.Forbidden as e:
79+
logger.error("Forbidden Error: %s", e)
80+
except gcp_exceptions.NotFound as e:
81+
logger.error("Not Found Error: %s", e)
82+
except gcp_exceptions.TooManyRequests as e:
83+
logger.error("Too Many Requests Error: %s", e)
84+
except gcp_exceptions.InternalServerError as e:
85+
logger.error("Internal Server Error: %s", e)
86+
except gcp_exceptions.GoogleAPIError as e:
87+
logger.error("Google API Error: %s", e)
88+
89+
return None

src/storage_factory.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,18 @@
55

66
from storage.aws_s3_storage import S3Storage
77
from storage.azure_storage import AzureStorage
8+
from storage.gcp_storage import GCPStorage # New import
89

910

1011
def get_storage(storage_backend):
1112
"""
1213
Factory function to create and return a storage object based on the given backend.
1314
14-
This function abstracts the creation of storage objectss. It supports 'azure' for
15-
Azure Storage and 's3' for AWS S3 Storage.
15+
This function abstracts the creation of storage objects. It supports 'azure' for
16+
Azure Storage, 's3' for AWS S3 Storage, and 'gcp' for Google Cloud Storage.
1617
1718
Parameters:
18-
storage_backend (str): The name of the storage backend. SUpported:'azure','s3'.
19+
storage_backend (str): The name of the storage backend. Supported: 'azure', 's3', 'gcp'.
1920
2021
Returns:
2122
An instance of the specified storage backend class.
@@ -27,5 +28,7 @@ def get_storage(storage_backend):
2728
return AzureStorage()
2829
if storage_backend in ['s3', 'aws']:
2930
return S3Storage()
31+
if storage_backend == 'gcp':
32+
return GCPStorage()
3033

3134
raise ValueError("Unsupported storage backend")

src/test_opencost_parquet_exporter.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,36 @@ def test_get_azure_config_with_env_vars(self):
6969
self.assertEqual(config['params'][1][1], 'true')
7070
self.assertEqual(config['params'][2][1], 'true')
7171

72+
def test_get_gcp_config_with_env_vars(self):
73+
"""Test get_config returns correct configurations based on environment variables."""
74+
with patch.dict(os.environ, {
75+
'OPENCOST_PARQUET_SVC_HOSTNAME': 'testhost',
76+
'OPENCOST_PARQUET_SVC_PORT': '8080',
77+
'OPENCOST_PARQUET_WINDOW_START': '2020-01-01T00:00:00Z',
78+
'OPENCOST_PARQUET_WINDOW_END': '2020-01-01T23:59:59Z',
79+
'OPENCOST_PARQUET_S3_BUCKET': 's3://test-bucket',
80+
'OPENCOST_PARQUET_FILE_KEY_PREFIX': 'test-prefix/',
81+
'OPENCOST_PARQUET_AGGREGATE': 'namespace',
82+
'OPENCOST_PARQUET_STEP': '1m',
83+
'OPENCOST_PARQUET_STORAGE_BACKEND': 'gcp',
84+
'OPENCOST_PARQUET_GCP_BUCKET_NAME': 'testbucket',
85+
'OPENCOST_PARQUET_GCP_CREDENTIALS_JSON': '{"type": "service_account"}',
86+
'OPENCOST_PARQUET_IDLE_BY_NODE': 'true',
87+
'OPENCOST_PARQUET_INCLUDE_IDLE': 'true'}, clear=True):
88+
config = get_config()
89+
90+
self.assertEqual(
91+
config['url'], 'http://testhost:8080/allocation/compute')
92+
self.assertEqual(config['params'][0][1],
93+
'2020-01-01T00:00:00Z,2020-01-01T23:59:59Z')
94+
self.assertEqual(config['storage_backend'], 'gcp')
95+
self.assertEqual(
96+
config['gcp_bucket_name'], 'testbucket')
97+
self.assertEqual(config['gcp_credentials'], {
98+
'type': 'service_account'})
99+
self.assertEqual(config['params'][1][1], 'true')
100+
self.assertEqual(config['params'][2][1], 'true')
101+
72102
@freeze_time("2024-01-31")
73103
def test_get_config_defaults_last_day_of_month(self):
74104
"""Test get_config returns correct defaults when no env vars are set."""

0 commit comments

Comments
 (0)