CUR Ingestion Application

A Spark Java application that reads Cost and Usage Report (CUR) files from Google Cloud Storage (GCS) and ingests them into BigQuery tables with advanced tagging capabilities. The application is designed to run both locally and on Kubernetes using the Spark Operator.

Overview

This application provides a scalable solution for processing and loading CUR data into BigQuery for further analysis. It supports various file formats (CSV, JSON, Parquet) and includes data transformation capabilities.

Features

Read CUR files from Google Cloud Storage
Support for multiple file formats (CSV, JSON, Parquet)
Data transformation and cleaning
Data quality checks
Advanced tagging capabilities with multiple implementation options
Configurable GCP authentication
BigQuery table partitioning
Error handling and logging
Kubernetes deployment support via Spark Operator

Prerequisites

Java 11+
Apache Spark 3.x
Google Cloud SDK
Maven 3.6+
Kubernetes cluster (for K8s deployment)

Tagging Functionality

The application provides multiple tagging approaches to categorize and track AWS resources in your CUR data:

Tagging Modes

Simple Tagging (default)
- Adds a tags array column to each CUR row based on matching rules
- Tags are stored directly in the CUR data table
- No historical tracking of tag changes
SCD Type-2 Tagging
- Implements Slowly Changing Dimension Type-2 pattern for tracking tag history
- Adds columns for effective dates and current status
- Creates new versions of records when tags change
- All history is stored in the CUR data table
Direct Tagging with History
- Embeds tags directly in CUR data for efficient querying
- Maintains a separate cur_resource_tags table for historical tracking
- Uses resource signatures to minimize storage requirements
- Optimized for both query performance and historical analysis
None
- Skips tagging completely

Tagging Rules

Tagging rules are defined in a properties file with the following format:

rule.1.name=EC2Resources
rule.1.field=line_item_product_code
rule.1.operator==
rule.1.value=AmazonEC2
rule.1.tag=Compute

rule.2.name=S3Resources
rule.2.field=line_item_product_code
rule.2.operator==
rule.2.value=AmazonS3
rule.2.tag=Storage

Supported operators include:

== (equals)
!= (not equals)
> (greater than)
< (less than)
>= (greater than or equal)
<= (less than or equal)
contains
startsWith
endsWith
isNull
isNotNull

Code Flow

Main Application Flow

Initialization
- Parse command-line arguments
- Load configuration
- Create Spark session with GCP configuration
Data Ingestion
- Read CUR file from GCS
- Detect file format (CSV, JSON, Parquet)
- Apply schema if needed
Data Transformation
- Clean and normalize data
- Apply data quality checks
- Filter invalid records
Tagging
- Apply tagging based on selected mode:
  - Simple: Add tags column directly
  - SCD: Add tags with effective dates and current status
  - Direct: Add tags and maintain separate history table
  - None: Skip tagging
BigQuery Writing
- Add ingestion timestamp
- Write to BigQuery table
- Write resource tags history to separate table (if using Direct tagging)

Tagging Engine Components

CURTagRule

Represents a single tagging rule
Parses rule definition from properties
Generates Spark SQL expressions for rule conditions

CURTaggingEngine

Loads rules from configuration file
Applies rules to CUR data
Generates tags array column

SCDTaggingEngine

Extends tagging with SCD Type-2 functionality
Adds effective dates and current status
Handles tag history tracking

DirectTaggingEngine

Implements optimized tagging with separate history
Uses resource signatures to minimize storage
Maintains history in separate table
Provides methods for updating tags over time

Usage

Run the application with:

spark-submit --class com.example.CURIngestionApp \
  cur-java-spark.jar \
  <gcs-bucket> \
  <cur-file-path> \
  <project-id> \
  <dataset.table> \
  [config-file] \
  [rules-file] \
  [tagging-mode]

Where:

gcs-bucket: GCS bucket containing CUR files
cur-file-path: Path to CUR file within the bucket
project-id: GCP project ID
dataset.table: BigQuery destination table
config-file: (Optional) GCP configuration file (default: gcp-config.properties)
rules-file: (Optional) Tagging rules file (default: tagging-rules.properties)
tagging-mode: (Optional) One of: none, simple, scd, direct (default: simple)

Testing

The application includes comprehensive unit tests for all components:

CURDataTransformerTest: Tests data transformation and quality checks
CURIngestionAppTest: Tests file reading and configuration
CURFileFormatTest: Tests different file format handling
CURTaggingTest: Tests simple tagging functionality
SCDTaggingTest: Tests SCD Type-2 tagging
DirectTaggingTest: Tests direct tagging with history

Run tests with:

mvn test

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
gradle/wrapper		gradle/wrapper
helm/cur-ingestion-app		helm/cur-ingestion-app
k8s		k8s
src		src
Dockerfile		Dockerfile
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
pom.xml		pom.xml
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUR Ingestion Application

Overview

Features

Prerequisites

Tagging Functionality

Tagging Modes

Tagging Rules

Code Flow

Main Application Flow

Tagging Engine Components

CURTagRule

CURTaggingEngine

SCDTaggingEngine

DirectTaggingEngine

Usage

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mihirgt/curfile

Folders and files

Latest commit

History

Repository files navigation

CUR Ingestion Application

Overview

Features

Prerequisites

Tagging Functionality

Tagging Modes

Tagging Rules

Code Flow

Main Application Flow

Tagging Engine Components

CURTagRule

CURTaggingEngine

SCDTaggingEngine

DirectTaggingEngine

Usage

Testing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages