Apache Flink Connector for Google Cloud Pub/Sub

A high-performance Apache Flink sink connector for publishing streaming data to Google Cloud Pub/Sub. This connector implements the Flink Sink V2 API and provides robust, scalable integration with GCP Pub/Sub topics.

Note: This connector is published under the com.geotab.dna group ID to avoid namespace collision with the official Apache Flink connector at org.apache.flink:flink-connector-gcp-pubsub. This is a Geotab-maintained implementation with additional features and optimizations.

Features

Flink Sink V2 API: Built on Apache Flink's modern Sink API (v2) for improved performance and reliability
Backpressure Management: Configurable in-flight request limits to control throughput and memory usage
Batch Publishing: Support for batching messages by size, count, and delay thresholds to optimize performance
Error Handling: Configurable error handling with retry logic and fatal exception classification
Metrics Integration: Built-in Flink metrics for monitoring bytes out, records out, and error counts
Compression Support: Optional message compression to reduce network bandwidth
Custom Serialization: Flexible serialization schema support for any data type
GCP Authentication: Full support for GCP credentials and authentication mechanisms

Prerequisites

Apache Flink 1.19.1 or later
Java 11 or later
Google Cloud Pub/Sub API access
GCP service account credentials

Installation

Building from Source

# Clone the repository
git clone https://github.com/Geotab/flink-connector-gcp-pubsub.git
cd flink-connector-gcp-pubsub

# Build with Gradle
./gradlew build

# Install to local Maven repository
./gradlew publishToMavenLocal

Maven Dependency

Add the following dependency to your pom.xml:

<dependency>
    <groupId>com.geotab.dna</groupId>
    <artifactId>flink-connector-gcp-pubsub</artifactId>
    <version>1.0.0</version>
</dependency>

Gradle Dependency

Add the following to your build.gradle:

dependencies {
    implementation 'com.geotab.dna:flink-connector-gcp-pubsub:1.0.0'
}

Usage

Basic Example

import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.connector.gcp.pubsub.sink.PubSubSinkV2;
import org.apache.flink.connector.gcp.pubsub.sink.PublisherConfig;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import com.google.auth.oauth2.ServiceAccountCredentials;
import java.io.FileInputStream;

// Create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Create your data stream
DataStream<String> stream = env.fromElements("message1", "message2", "message3");

// Configure GCP credentials
Credentials credentials = ServiceAccountCredentials.fromStream(
    new FileInputStream("/path/to/service-account-key.json")
);

// Build publisher configuration
PublisherConfig publisherConfig = PublisherConfig.builder()
    .credentials(credentials)
    .batchRequestByteThreshold(1000000L)      // 1 MB
    .batchElementCountThreshold(100L)          // 100 messages
    .batchDelayThreshold(Duration.ofMillis(10)) // 10 ms
    .enableCompression(true)
    .build();

// Create the PubSub sink
PubSubSinkV2<String> pubSubSink = PubSubSinkV2.<String>builder()
    .projectId("your-gcp-project-id")
    .topicId("your-topic-id")
    .serializationSchema(new SimpleStringSchema())
    .publisherConfig(publisherConfig)
    .maxInFlightRequests(1000)
    .failOnError(false)
    .build();

// Add sink to the stream
stream.sinkTo(pubSubSink);

// Execute the Flink job
env.execute("Flink GCP Pub/Sub Sink Example");

Advanced Configuration

Custom Serialization

import org.apache.flink.api.common.serialization.SerializationSchema;
import com.fasterxml.jackson.databind.ObjectMapper;

public class CustomSerializer<T> implements SerializationSchema<T> {
    private ObjectMapper mapper = new ObjectMapper();
    
    @Override
    public byte[] serialize(T element) {
        try {
            return mapper.writeValueAsBytes(element);
        } catch (Exception e) {
            throw new RuntimeException("Serialization failed", e);
        }
    }
}

// Use custom serializer
PubSubSinkV2<MyDataClass> sink = PubSubSinkV2.<MyDataClass>builder()
    .serializationSchema(new CustomSerializer<>())
    // ... other configurations
    .build();

Retry Configuration

import com.google.api.gax.retrying.RetrySettings;
import org.threeten.bp.Duration;

RetrySettings retrySettings = RetrySettings.newBuilder()
    .setInitialRetryDelay(Duration.ofMillis(100))
    .setRetryDelayMultiplier(2.0)
    .setMaxRetryDelay(Duration.ofSeconds(60))
    .setMaxAttempts(5)
    .build();

PublisherConfig config = PublisherConfig.builder()
    .credentials(credentials)
    .retrySettings(retrySettings)
    .build();

Configuration Options

PubSubSinkV2 Parameters

Parameter	Type	Required	Default	Description
`projectId`	String	Yes	-	GCP project ID containing the Pub/Sub topic
`topicId`	String	Yes	-	Pub/Sub topic ID to publish messages to
`serializationSchema`	SerializationSchema	Yes	-	Schema to serialize elements into byte arrays
`publisherConfig`	PublisherConfig	Yes	-	Publisher configuration (credentials, batching, etc.)
`maxInFlightRequests`	int	Yes	-	Maximum number of concurrent in-flight requests (must be > 0)
`failOnError`	boolean	Yes	-	If true, fail immediately on errors; if false, retry non-fatal errors

PublisherConfig Parameters

Parameter	Type	Required	Default	Description
`credentials`	Credentials	Yes	-	GCP authentication credentials
`batchRequestByteThreshold`	Long	No	1000 bytes	Maximum batch size in bytes before publishing
`batchElementCountThreshold`	Long	No	100	Maximum number of messages in a batch before publishing
`batchDelayThreshold`	Duration	No	10ms	Maximum delay before publishing a batch
`enableCompression`	Boolean	No	false	Enable message compression
`retrySettings`	RetrySettings	No	null	Custom retry configuration for failed publish attempts

Backpressure and Flow Control

The connector implements backpressure through the maxInFlightRequests parameter. When the number of pending publish operations reaches this limit, the writer blocks until some requests complete. This prevents overwhelming the Pub/Sub service and controls memory usage.

// Configure for high throughput
.maxInFlightRequests(5000)

// Configure for lower memory footprint
.maxInFlightRequests(100)

Error Handling

The connector classifies exceptions into two categories:

Fatal Exceptions: Unrecoverable errors that fail the job immediately
- Topic not found (NOT_FOUND status)
- Project not found (NOT_FOUND: Requested project not found)
Non-Fatal Exceptions: Transient errors that can be retried
- Network timeouts
- Temporary service unavailability
- Rate limiting

Configure error behavior with the failOnError parameter:

failOnError = true: Fail the job on any error (recommended for critical pipelines)
failOnError = false: Retry non-fatal errors indefinitely (may cause message duplication)

Metrics

The connector exposes the following Flink metrics:

numBytesOut: Total bytes successfully published to Pub/Sub
numRecordsOut: Total number of records successfully published
numRecordsOutErrors: Total number of records that failed to publish

Access metrics through Flink's metric system or monitoring dashboards.

Architecture

The connector consists of the following key components:

PubSubSinkV2: Main sink implementation following Flink's Sink V2 API
PubSubWriter: Writer implementation handling message publishing and backpressure
PubSubWriterClient: Wrapper around GCP's Publisher client
PublisherConfig: Configuration object for Pub/Sub publisher settings
PubSubExceptionClassifiers: Exception classification for error handling

Best Practices

Choose appropriate batch settings: Balance latency vs. throughput by tuning batchDelayThreshold, batchElementCountThreshold, and batchRequestByteThreshold
Monitor in-flight requests: Set maxInFlightRequests based on your memory constraints and desired throughput
Use compression for large messages: Enable enableCompression when publishing large payloads to reduce network costs
Implement proper serialization: Ensure your serialization schema is efficient and handles null values appropriately
Handle credentials securely: Use GCP service accounts and avoid hardcoding credentials
Enable checkpointing: Configure Flink checkpointing to ensure exactly-once or at-least-once delivery semantics

Troubleshooting

Authentication Errors

Ensure your service account has the pubsub.publisher role on the target topic.

Topic Not Found

Verify the projectId and topicId are correct and the topic exists in GCP.

Backpressure Issues

Increase maxInFlightRequests or tune batching parameters to improve throughput.

Memory Issues

Reduce maxInFlightRequests or enable compression to lower memory usage.

Contributors

Geotabbers: Che-Wei Chou / Mark Ma

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
gradle		gradle
src/main/java/org/apache/flink/connector/gcp/pubsub/sink		src/main/java/org/apache/flink/connector/gcp/pubsub/sink
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Flink Connector for Google Cloud Pub/Sub

Features

Prerequisites

Installation

Building from Source

Maven Dependency

Gradle Dependency

Usage

Basic Example

Advanced Configuration

Custom Serialization

Retry Configuration

Configuration Options

PubSubSinkV2 Parameters

PublisherConfig Parameters

Backpressure and Flow Control

Error Handling

Metrics

Architecture

Best Practices

Troubleshooting

Authentication Errors

Topic Not Found

Backpressure Issues

Memory Issues

Contributors

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Apache Flink Connector for Google Cloud Pub/Sub

Features

Prerequisites

Installation

Building from Source

Maven Dependency

Gradle Dependency

Usage

Basic Example

Advanced Configuration

Custom Serialization

Retry Configuration

Configuration Options

PubSubSinkV2 Parameters

PublisherConfig Parameters

Backpressure and Flow Control

Error Handling

Metrics

Architecture

Best Practices

Troubleshooting

Authentication Errors

Topic Not Found

Backpressure Issues

Memory Issues

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages