ReCAP Harvester

This application is a poller in the Data/Search pipeline of NYPL's Library Services Platform. It polls an S3 bucket (formerly an SFTP server) managed by SCSB for additions, updates, and deletions of partner records. It processes SCSBXML and json files and writes BibPostRequest and ItemPostRequest documents into the BibPostRequest-[env] and ItemPostRequest-[env] streams just as our SierraUpdatePoller does for our own records. This app is represented by the "RecapHarvester" component in our Data/Search Architecture diagram.

Technology Stack

This is a Java application built with:

Gradle for build automation and dependency management
Spring Boot for application framework
Apache Camel for integration and routing
Java 17+ (LTS version required)

Prerequisites

Java Installation

You need Java 17 or later. To install on macOS using Homebrew:

brew install openjdk@17
echo 'export PATH="/opt/homebrew/opt/openjdk@17/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
java -version

Should report something like "openjdk version "17.0.x"

Gradle

Gradle Wrapper is included in the project, so no separate Gradle installation is required.

Setup

Environment Configuration

Copy the sample environment file and configure it for your environment:

cp .env-local-export.sample .env-local-export

Edit .env-local-export and fill in the missing values. The required missing environment variables are:

# Creds for connecting to the SCSB S3 bucket containing updates
s3Bucket=[aws s3 bucket name for scsb]
s3AccessKey=[aws access key for scsb bucket]
s3SecretKey=[aws secret key for scsb bucket]

# Creds for connecting to the SchemaService:
NyplOAuthKey=[platform api client id]
NyplOAuthSecret=[platform api client secret]

# Creds to write to Kinesis streams:
AWS_ACCESS_KEY_ID=[aws key]
AWS_SECRET_ACCESS_KEY=[aws secret]

For different options to set AWS credentials, please refer to the AWS SDK for Java documentation.

Building and Running

Build the Application

To compile and run all tests:

./gradlew clean build

Running Locally

Source your environment file and run the application:

source .env-local-export
./gradlew bootRun

The app can also be run locally in a container with Docker compose, and will automatically source .env-local-export:

docker-compose up --build

See Using Docker Compose for details on local Docker builds.

Application Modes

This app can be run in 2 different modes:

1. Nightly Updates (Primary Mode)

On a nightly basis, it downloads HTC-generated files that represent bibliographic additions, updates, and deletions.

2. Bulk Mode

For initial seeding of databases & index from a massive zip of partner items generated by HTC. Note: This was done at product launch and not retested since. We're not likely to need this again unless we have a very large influx of partner records to update.

How It Works

The app connects to an S3 bucket managed by SCSB to retrieve partner bib and item metadata updates. It uses Apache Camel, a streaming framework, to manage the following pipeline:

Additions & Updates:

Fetch zipped SCSBXml documents from S3 bucket: data-exports/NYPL/SCSBXml/Incremental/*.zip
Move processed zips to data-exports/NYPL/SCSBXml/Incremental.processed (prevents reprocessing)
Unzip the SCSBXML documents and write them locally
Iterate over local SCSBXML documents, translate them into Bib and Item documents
Broadcast the documents over BibPostRequest-[env] and ItemPostRequest-[env] Kinesis streams (Avro encoded)

Deletions:

Fetch zipped JSON documents from S3 bucket: data-exports/NYPL/Json/*.zip
Move processed zips to data-exports/NYPL/Json.processed (prevents reprocessing)
Unzip the JSON documents and write them locally
Translate them into Bib and Item documents with "deleted": true
Broadcast the "deleted" documents over Kinesis streams (Avro encoded)

Monitoring

The app is monitored by a CloudWatch alarm. The alarm fires when the RecapHarvesterBibProcessed-Production CW metric <= 0 for 1 day.

How Monitoring Works:

App logs "Processing bib - recap-" for each processed record
A metric filter converts log entries to metrics
An alarm fires when metrics aren't written for 1 day

False Positives:

Timing variations: SCSB drops incrementals at different times, causing temporary alarm states
Weekend gaps: Sometimes no updates occur for 1-2 days, particularly weekends

Development

Logging Configuration

Logging level is currently fixed at INFO. To modify logging levels, update the logback configuration or application properties.

IDE Setup

The project includes standard Gradle project structure:

src/
├── main/
│   ├── java/           # Application source code
│   └── resources/      # Configuration files, schemas
└── test/
    └── java/           # Test source code

Deployment

This app is deployed automatically by CI/CD when updates are made to the qa or production branch using containerized deployment to AWS ECS.

Container Deployment

The application runs as a Docker container on AWS ECS (Elastic Container Service) with images stored in ECR (Elastic Container Registry).

ECR Push

# Login to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 946183545209.dkr.ecr.us-east-1.amazonaws.com

# Build and tag local Docker image
docker build -t recap-harvester:local .

# Use current commit sha as image tag
export IMAGE_TAG=$(git rev-parse --short HEAD)

# Tag and push for ECR
docker tag recap-harvester:local 946183545209.dkr.ecr.us-east-1.amazonaws.com/recap-harvester:$IMAGE_TAG
docker push 946183545209.dkr.ecr.us-east-1.amazonaws.com/recap-harvester:$IMAGE_TAG

Using Docker Compose (Local Testing)

# Create docker-compose.yml with environment variables
docker-compose up -d

# View logs
docker-compose logs -f recap-harvester

# Stop service
docker-compose down

CI/CD Pipeline

The deployment pipeline typically follows these steps:

Build Stage: ./gradlew clean build
Test Stage: ./gradlew test
Docker Build: Build and tag container image
ECR Push: Push image to ECR repository
ECS Deploy: Update ECS service with new task definition
Health Check: Monitor service deployment and health

Troubleshooting

Common Issues

Java Version Problems

If you encounter build errors, ensure you're using Java 17+:

java -version
./gradlew -version

JAXB Issues

If you see JAXB-related errors, ensure the jaxb.index files are present in the model packages.

S3 Connection Issues

301 Moved Permanently: Wrong S3 region configuration
400 Bad Request: Check credentials and bucket permissions
403 Forbidden: Insufficient IAM permissions

AWS Credentials

Test your credentials work:

aws s3 ls s3://your-bucket-name/ --region us-east-2

# For ECS deployments, also test ECR access
aws ecr describe-repositories --region us-east-1

# Test ECS permissions
aws ecs list-clusters
aws ecs list-services --cluster your-cluster-name

Debug Logging

To enable detailed logging:

In ECS Task Definition (environment variables):

{
  "LOGGING_LEVEL_ORG_APACHE_CAMEL": "DEBUG",
  "LOGGING_LEVEL_SOFTWARE_AMAZON_AWSSDK": "DEBUG"
}

In application.properties (for local development):

logging.level.org.apache.camel=DEBUG
logging.level.software.amazon.awssdk=DEBUG

Git Workflow

This repo uses the Main-QA-Production Git Workflow.

Migration Notes

This application has been migrated from Maven to Gradle and upgraded to Java 17, and modernized from Elastic Beanstalk to containerized ECS deployment. Key changes:

Build System: Maven → Gradle
Java Version: Java 8 → Java 17
Deployment: Elastic Beanstalk → Docker + ECS
Container Registry: ECR for image storage
JAXB: Now external dependency (removed from JDK in Java 11+)
Dependencies: Updated to Jakarta EE specifications where applicable
AWS SDK: Updated to AWS SDK v2

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.ebextensions		.ebextensions
.github		.github
downloaded-updates		downloaded-updates
gradle		gradle
src		src
.env-local-export.sample		.env-local-export.sample
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
Procfile		Procfile
README.md		README.md
build.gradle.kts		build.gradle.kts
contributors.txt		contributors.txt
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Folders and files

Latest commit

History

Repository files navigation

ReCAP Harvester

Technology Stack

Prerequisites

Java Installation

Gradle

Setup

Environment Configuration

Building and Running

Build the Application

Running Locally

Application Modes

1. Nightly Updates (Primary Mode)

2. Bulk Mode

How It Works

Additions & Updates:

Deletions:

Monitoring

How Monitoring Works:

False Positives:

Development

Logging Configuration

IDE Setup

Deployment

Container Deployment

ECR Push

Using Docker Compose (Local Testing)

CI/CD Pipeline

Troubleshooting

Common Issues

Java Version Problems

JAXB Issues

S3 Connection Issues

AWS Credentials

Debug Logging

Git Workflow

Migration Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages