feature: support GCP cloud storage (#17)

lmello · web-flow · commit 67dbce9c33b7 · 2025-01-31T11:47:18.000-05:00
* feature: support GCP cloud storage

This changes leverage the contribution by cklingspor, which added a
storage factory and azure support to add support for gcp.

This is a work in progress and need more testing.

* Fix requirements-dev.txt

Include the main requirements file instead of duplicating the versions
inside requirements-dev.txt

* Fix pylint errors

---------

Signed-off-by: Leonardo Rodrigues de Mello &lt;lrodriguesdemello@zendesk.com&gt;
diff --git a/README.md b/README.md
@@ -3,43 +3,49 @@ Export OpenCost data in parquet format
 
 This script was created to export data from opencost in PARQUET format.
 
-It supports exporting the data to S3 and local directory.
+It supports exporting the data to S3, Azure Blob Storage, GCP Cloud Storage, and local directory.
 
 # Dependencies
-This script depends on boto3, pandas, numpy and python-dateutil.
+This script depends on boto3, pandas, numpy, python-dateutil, azure-identity, azure-storage-blob, and google-cloud-storage.
 
 The file requirements.txt has all the dependencies specified.
 
 # Configuration:
 The script supports the following environment variables:
-* OPENCOST_PARQUET_SVC_HOSTNAME: Hostname of the opencost service. By default it assume the opencost service is on localhost.
-* OPENCOST_PARQUET_SVC_PORT: Port of the opencost service, by default it assume it is 9003
-* OPENCOST_PARQUET_WINDOW_START: Start window for the export, by default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format. i.e `2024-05-27T00:00:00Z`.
-* OPENCOST_PARQUET_WINDOW_END: End of export window, by default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format. i.e `2024-05-27T00:00:00Z`.
-* OPENCOST_PARQUET_S3_BUCKET: S3 bucket that will be used to store the export. By default this is None, and S3 export is not done. If set to a bucket use s3://bucket-name and make sure there is an AWS Role  with access to the s3 bucket attached to the container that is running the export. This also respect the environment variables AWS_PROFILE, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. see: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
-* OPENCOST_PARQUET_FILE_KEY_PREFIX: This is the prefix used for the export, by default it is '/tmp'. The export is going to be saved inside this prefix, in the following structure: year=window_start.year/month=window_start.month/day=window_start.day , ex: tmp/year=2024/month=1/date=15
-* OPENCOST_PARQUET_AGGREGATE: This is the dimentions used to aggregate the data. by default we use "namespace,pod,container" which is the same dimensions used for the CSV native export.
-* OPENCOST_PARQUET_STEP: This is the Step for the export, by default we use 1h steps, which result in 24 steps in a day and make easier to match the exported data to AWS CUR, since cur also export on hourly base.
-* OPENCOST_PARQUET_RESOLUTION: Duration to use as resolution in Prometheus queries. Smaller values (i.e. higher resolutions) will provide better accuracy, but worse performance (i.e. slower query time, higher memory use). Larger values (i.e. lower resolutions) will perform better, but at the expense of lower accuracy for short-running workloads. 
-* OPENCOST_PARQUET_ACCUMULATE: If `"true"`, sum the entire range of time intervals into a single set. Default value is `"false"`. 
+* OPENCOST_PARQUET_SVC_HOSTNAME: Hostname of the opencost service. By default, it assumes the opencost service is on localhost.
+* OPENCOST_PARQUET_SVC_PORT: Port of the opencost service, by default it assumes it is 9003.
+* OPENCOST_PARQUET_WINDOW_START: Start window for the export. By default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format, e.g., `2024-05-27T00:00:00Z`.
+* OPENCOST_PARQUET_WINDOW_END: End of the export window. By default it is None, which results in exporting the data for yesterday. Date needs to be set in RFC3339 format, e.g., `2024-05-27T23:59:59Z`.
+* OPENCOST_PARQUET_S3_BUCKET: S3 bucket that will be used to store the export. By default this is None, and S3 export is not done. If set to a bucket, use `s3://bucket-name` and make sure there is an AWS Role with access to the S3 bucket attached to the container running the export. This also respects the environment variables AWS_PROFILE, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. See: [Boto3 Documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).
+* OPENCOST_PARQUET_FILE_KEY_PREFIX: This is the prefix used for the export. By default it is `/tmp`. The export will be saved inside this prefix in the following structure: `year=window_start.year/month=window_start.month/day=window_start.day`, e.g., `tmp/year=2024/month=1/day=15`.
+* OPENCOST_PARQUET_AGGREGATE: Dimensions used to aggregate the data. By default, "namespace,pod,container" which is the same dimensions used for the CSV native export.
+* OPENCOST_PARQUET_STEP: Step size for the export. By default, we use 1h steps, which results in 24 steps in a day and makes it easier to match the exported data to AWS CUR since CUR also exports on an hourly basis.
+* OPENCOST_PARQUET_RESOLUTION: Duration to use as resolution in Prometheus queries. Smaller values (i.e., higher resolutions) will provide better accuracy, but worse performance (i.e., slower query time, higher memory use). Larger values (i.e., lower resolutions) will perform better but at the expense of lower accuracy for short-running workloads.
+* OPENCOST_PARQUET_ACCUMULATE: If `"true"`, sum the entire range of time intervals into a single set. Default value is `"false"`.
 * OPENCOST_PARQUET_INCLUDE_IDLE: Whether to return the calculated __idle__ field for the query. Default is `"false"`.
-* OPENCOST_PARQUET_IDLE_BY_NODE: If `"true"`, idle allocations are created on a per node basis. Which will result in different values when shared and more idle allocations when split. Default is `"false"`.
-* OPENCOST_PARQUET_STORAGE_BACKEND: The storage backend to use. Supports `aws`, `azure`. See below for Azure specific variables.
-* OPENCOST_PARQUET_JSON_SEPARATOR: The OpenCost API returns nested objects. The used [JSON normalization method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) allows for a custom separator. Use this to specify the separator of your choice. 
+* OPENCOST_PARQUET_IDLE_BY_NODE: If `"true"`, idle allocations are created on a per-node basis, which will result in different values when shared and more idle allocations when split. Default is `"false"`.
+* OPENCOST_PARQUET_STORAGE_BACKEND: The storage backend to use. Supports `aws`, `azure`, `gcp`. See below for Azure and GCP-specific variables.
+* OPENCOST_PARQUET_JSON_SEPARATOR: The OpenCost API returns nested objects. The used [JSON normalization method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) allows for a custom separator. Use this to specify the separator of your choice.
 
 ## Azure Specific Environment Variables
 * OPENCOST_PARQUET_AZURE_STORAGE_ACCOUNT_NAME: Name of the Azure Storage Account you want to export the data to.
-* OPENCOST_PARQUET_AZURE_CONTAINER_NAME:  The container within the storage account you want to save the data to. The service principal requires write permissions on the container
-* OPENCOST_PARQUET_AZURE_TENANT: You Azure Tenant ID
-* OPENCOST_PARQUET_AZURE_APPLICATION_ID: ClientID of the Service Principal
-* OPENCOST_PARQUET_AZURE_APPLICATION_SECRET: Secret of the Service Principal
+* OPENCOST_PARQUET_AZURE_CONTAINER_NAME: The container within the storage account you want to save the data to. The service principal requires write permissions on the container.
+* OPENCOST_PARQUET_AZURE_TENANT: Your Azure Tenant ID.
+* OPENCOST_PARQUET_AZURE_APPLICATION_ID: Client ID of the Service Principal.
+* OPENCOST_PARQUET_AZURE_APPLICATION_SECRET: Secret of the Service Principal.
+
+## GCP Specific Environment Variables
+* OPENCOST_PARQUET_GCP_BUCKET_NAME: Name of the GCP bucket you want to export the data to.
+* OPENCOST_PARQUET_GCP_CREDENTIALS_JSON: JSON-formatted string of your GCP credentials (optional, uses `GOOGLE_APPLICATION_CREDENTIALS` if not set).
 
 # Prerequisites
 ## AWS IAM
 
 ## Azure RBAC
-The current implementation allows for authentication via [Service Principals](https://learn.microsoft.com/en-us/entra/identity-platform/app-objects-and-service-principals?tabs=browser) on the Azure Storage Account. Therefore, to use the Azure storage backend you need an existing service principal with according role assignments. Azure RBAC has built-in roles for Storage Account Blob Storage operations. The [Storage-Blob-Data-Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-contributor) allows to write data to a Azure Storage Account container. A less permissivie custom role can be built and is encouraged!
+The current implementation allows for authentication via [Service Principals](https://learn.microsoft.com/en-us/azure/active-directory/develop/app-objects-and-service-principals) on the Azure Storage Account. Therefore, to use the Azure storage backend, you need an existing service principal with the appropriate role assignments. Azure RBAC has built-in roles for Storage Account Blob Storage operations. The [Storage Blob Data Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-contributor) allows writing data to an Azure Storage Account container. A less permissive custom role can be built and is encouraged!
 
+## GCP IAM
+The current implementation allows for authentication using service account keys or Workload Identity. Ensure that the service account has the `Storage Object Creator` role or equivalent permissions to write data to the GCP bucket.
 
 # Usage:
 
diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -1,14 +1,4 @@
-numpy==1.26.3
-pandas==2.2.3
-boto3==1.35.16
-requests==2.32.0
-python-dateutil==2.8.2
-pytz==2023.3.post1
-six==1.16.0
-tzdata==2023.4
-pyarrow==14.0.1
-azure-storage-blob==12.19.1
-azure-identity==1.16.1
+-r requirements.txt
 # The dependencies bellow are only used for development.
 freezegun==1.4.0
 pylint==3.0.3
diff --git a/requirements.txt b/requirements.txt
@@ -9,3 +9,10 @@ tzdata==2023.4
 pyarrow==14.0.1
 azure-storage-blob==12.19.1
 azure-identity==1.16.1
+google-api-core==2.19.2
+google-auth==2.34.0
+google-cloud-core==2.4.1
+google-cloud-storage==2.18.2
+google-crc32c==1.6.0
+google-resumable-media==2.7.2
+googleapis-common-protos==1.65.0
diff --git a/src/opencost_parquet_exporter.py b/src/opencost_parquet_exporter.py
@@ -145,6 +145,12 @@ def get_config(
             'azure_application_id': os.environ.get('OPENCOST_PARQUET_AZURE_APPLICATION_ID'),
             'azure_application_secret': os.environ.get('OPENCOST_PARQUET_AZURE_APPLICATION_SECRET'),
         })
+    if config['storage_backend'] == 'gcp':
+        config.update({
+            # pylint: disable=C0301
+            'gcp_bucket_name': os.environ.get('OPENCOST_PARQUET_GCP_BUCKET_NAME'),
+            'gcp_credentials': json.loads(os.environ.get('OPENCOST_PARQUET_GCP_CREDENTIALS_JSON', '{}')),
+        })
 
     # If window is not specified assume we want yesterday data.
     if window_start is None or window_end is None:
diff --git a/src/storage/gcp_storage.py b/src/storage/gcp_storage.py
@@ -0,0 +1,89 @@
+"""
+This module provides an implementation of the BaseStorage class for Google Cloud Storage.
+"""
+
+from io import BytesIO
+import logging
+from google.cloud import storage
+from google.oauth2 import service_account
+from google.api_core import exceptions as gcp_exceptions
+import pandas as pd
+from .base_storage import BaseStorage
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+
+
+# pylint: disable=R0903
+class GCPStorage(BaseStorage):
+    """
+    A class to handle data storage in Google Cloud Storage.
+    """
+
+    def _get_client(self, config) -> storage.Client:
+        """
+        Returns a Google Cloud Storage client using credentials provided in the config.
+
+        Parameters:
+            config (dict): Configuration dictionary that may contain 'gcp_credentials' 
+                           for service account keys and other authentication-related keys.
+
+        Returns:
+            storage.Client: An authenticated Google Cloud Storage client.
+        """
+        if 'gcp_credentials' in config:
+            credentials_info = config['gcp_credentials']
+            credentials = service_account.Credentials.from_service_account_info(
+                credentials_info)
+            client = storage.Client(credentials=credentials)
+        else:
+            # Use default credentials
+            client = storage.Client()
+
+        return client
+
+    def save_data(self, data: pd.core.frame.DataFrame, config) -> str | None:
+        """
+        Saves a DataFrame to Google Cloud Storage.
+
+        Parameters:
+            data (pd.core.frame.DataFrame): The DataFrame to be saved.
+            config (dict): Configuration dictionary containing necessary information for storage.
+                           Expected keys include 'gcp_bucket_name', 
+                           'file_key_prefix', and 'window_start'.
+
+        Returns:
+            str | None: The URL of the saved object if successful, None otherwise.
+        """
+        client = self._get_client(config)
+
+        file_name = 'k8s_opencost.parquet'
+        window = pd.to_datetime(config['window_start'])
+        blob_prefix = f"{config['file_key_prefix']}/{window.year}/{window.month}/{window.day}"
+        bucket_name = config['gcp_bucket_name']
+        blob_name = f"{blob_prefix}/{file_name}"
+
+        bucket = client.bucket(bucket_name)
+        blob = bucket.blob(blob_name)
+        parquet_file = BytesIO()
+        data.to_parquet(parquet_file, engine='pyarrow', index=False)
+        parquet_file.seek(0)
+
+        try:
+            blob.upload_from_file(
+                parquet_file, content_type='application/octet-stream')
+            return blob.public_url
+        except gcp_exceptions.BadRequest as e:
+            logger.error("Bad Request Error: %s", e)
+        except gcp_exceptions.Forbidden as e:
+            logger.error("Forbidden Error: %s", e)
+        except gcp_exceptions.NotFound as e:
+            logger.error("Not Found Error: %s", e)
+        except gcp_exceptions.TooManyRequests as e:
+            logger.error("Too Many Requests Error: %s", e)
+        except gcp_exceptions.InternalServerError as e:
+            logger.error("Internal Server Error: %s", e)
+        except gcp_exceptions.GoogleAPIError as e:
+            logger.error("Google API Error: %s", e)
+
+        return None
diff --git a/src/storage_factory.py b/src/storage_factory.py
@@ -5,17 +5,18 @@
 
 from storage.aws_s3_storage import S3Storage
 from storage.azure_storage import AzureStorage
+from storage.gcp_storage import GCPStorage  # New import
 
 
 def get_storage(storage_backend):
     """
     Factory function to create and return a storage object based on the given backend.
 
-    This function abstracts the creation of storage objectss. It supports 'azure' for 
-    Azure Storage and 's3' for AWS S3 Storage.
+    This function abstracts the creation of storage objects. It supports 'azure' for
+    Azure Storage, 's3' for AWS S3 Storage, and 'gcp' for Google Cloud Storage.
 
     Parameters:
-        storage_backend (str): The name of the storage backend. SUpported:'azure','s3'.
+        storage_backend (str): The name of the storage backend. Supported: 'azure', 's3', 'gcp'.
 
     Returns:
         An instance of the specified storage backend class.
@@ -27,5 +28,7 @@ def get_storage(storage_backend):
         return AzureStorage()
     if storage_backend in ['s3', 'aws']:
         return S3Storage()
+    if storage_backend == 'gcp':
+        return GCPStorage()
 
     raise ValueError("Unsupported storage backend")
diff --git a/src/test_opencost_parquet_exporter.py b/src/test_opencost_parquet_exporter.py
@@ -69,6 +69,36 @@ def test_get_azure_config_with_env_vars(self):
             self.assertEqual(config['params'][1][1], 'true')
             self.assertEqual(config['params'][2][1], 'true')
 
+    def test_get_gcp_config_with_env_vars(self):
+        """Test get_config returns correct configurations based on environment variables."""
+        with patch.dict(os.environ, {
+            'OPENCOST_PARQUET_SVC_HOSTNAME': 'testhost',
+            'OPENCOST_PARQUET_SVC_PORT': '8080',
+            'OPENCOST_PARQUET_WINDOW_START': '2020-01-01T00:00:00Z',
+            'OPENCOST_PARQUET_WINDOW_END': '2020-01-01T23:59:59Z',
+            'OPENCOST_PARQUET_S3_BUCKET': 's3://test-bucket',
+            'OPENCOST_PARQUET_FILE_KEY_PREFIX': 'test-prefix/',
+            'OPENCOST_PARQUET_AGGREGATE': 'namespace',
+            'OPENCOST_PARQUET_STEP': '1m',
+            'OPENCOST_PARQUET_STORAGE_BACKEND': 'gcp',
+            'OPENCOST_PARQUET_GCP_BUCKET_NAME': 'testbucket',
+            'OPENCOST_PARQUET_GCP_CREDENTIALS_JSON': '{"type": "service_account"}',
+            'OPENCOST_PARQUET_IDLE_BY_NODE': 'true',
+                'OPENCOST_PARQUET_INCLUDE_IDLE': 'true'}, clear=True):
+            config = get_config()
+
+            self.assertEqual(
+                config['url'], 'http://testhost:8080/allocation/compute')
+            self.assertEqual(config['params'][0][1],
+                             '2020-01-01T00:00:00Z,2020-01-01T23:59:59Z')
+            self.assertEqual(config['storage_backend'], 'gcp')
+            self.assertEqual(
+                config['gcp_bucket_name'], 'testbucket')
+            self.assertEqual(config['gcp_credentials'], {
+                             'type': 'service_account'})
+            self.assertEqual(config['params'][1][1], 'true')
+            self.assertEqual(config['params'][2][1], 'true')
+
     @freeze_time("2024-01-31")
     def test_get_config_defaults_last_day_of_month(self):
         """Test get_config returns correct defaults when no env vars are set."""