Skip to content

NYPL/recap-harvester

Repository files navigation

ReCAP Harvester

This application is a poller in the Data/Search pipeline of NYPL's Library Services Platform. It polls an S3 bucket (formerly an SFTP server) managed by SCSB for additions, updates, and deletions of partner records. It processes SCSBXML and json files and writes BibPostRequest and ItemPostRequest documents into the BibPostRequest-[env] and ItemPostRequest-[env] streams just as our SierraUpdatePoller does for our own records. This app is represented by the "RecapHarvester" component in our Data/Search Architecture diagram.

Technology Stack

This is a Java application built with:

  • Gradle for build automation and dependency management
  • Spring Boot for application framework
  • Apache Camel for integration and routing
  • Java 17+ (LTS version required)

Prerequisites

Java Installation

You need Java 17 or later. To install on macOS using Homebrew:

brew install openjdk@17
echo 'export PATH="/opt/homebrew/opt/openjdk@17/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
java -version

Should report something like "openjdk version "17.0.x"

Gradle

Gradle Wrapper is included in the project, so no separate Gradle installation is required.

Setup

Environment Configuration

Copy the sample environment file and configure it for your environment:

cp .env-local-export.sample .env-local-export

Edit .env-local-export and fill in the missing values. The required missing environment variables are:

# Creds for connecting to the SCSB S3 bucket containing updates
s3Bucket=[aws s3 bucket name for scsb]
s3AccessKey=[aws access key for scsb bucket]
s3SecretKey=[aws secret key for scsb bucket]

# Creds for connecting to the SchemaService:
NyplOAuthKey=[platform api client id]
NyplOAuthSecret=[platform api client secret]

# Creds to write to Kinesis streams:
AWS_ACCESS_KEY_ID=[aws key]
AWS_SECRET_ACCESS_KEY=[aws secret]

For different options to set AWS credentials, please refer to the AWS SDK for Java documentation.

Building and Running

Build the Application

To compile and run all tests:

./gradlew clean build

Running Locally

Source your environment file and run the application:

source .env-local-export
./gradlew bootRun

The app can also be run locally in a container with Docker compose, and will automatically source .env-local-export:

docker-compose up --build

See Using Docker Compose for details on local Docker builds.

Application Modes

This app can be run in 2 different modes:

1. Nightly Updates (Primary Mode)

On a nightly basis, it downloads HTC-generated files that represent bibliographic additions, updates, and deletions.

2. Bulk Mode

For initial seeding of databases & index from a massive zip of partner items generated by HTC. Note: This was done at product launch and not retested since. We're not likely to need this again unless we have a very large influx of partner records to update.

How It Works

The app connects to an S3 bucket managed by SCSB to retrieve partner bib and item metadata updates. It uses Apache Camel, a streaming framework, to manage the following pipeline:

Additions & Updates:

  1. Fetch zipped SCSBXml documents from S3 bucket: data-exports/NYPL/SCSBXml/Incremental/*.zip
  2. Move processed zips to data-exports/NYPL/SCSBXml/Incremental.processed (prevents reprocessing)
  3. Unzip the SCSBXML documents and write them locally
  4. Iterate over local SCSBXML documents, translate them into Bib and Item documents
  5. Broadcast the documents over BibPostRequest-[env] and ItemPostRequest-[env] Kinesis streams (Avro encoded)

Deletions:

  1. Fetch zipped JSON documents from S3 bucket: data-exports/NYPL/Json/*.zip
  2. Move processed zips to data-exports/NYPL/Json.processed (prevents reprocessing)
  3. Unzip the JSON documents and write them locally
  4. Translate them into Bib and Item documents with "deleted": true
  5. Broadcast the "deleted" documents over Kinesis streams (Avro encoded)

Monitoring

The app is monitored by a CloudWatch alarm. The alarm fires when the RecapHarvesterBibProcessed-Production CW metric <= 0 for 1 day.

How Monitoring Works:

  1. App logs "Processing bib - recap-" for each processed record
  2. A metric filter converts log entries to metrics
  3. An alarm fires when metrics aren't written for 1 day

False Positives:

  • Timing variations: SCSB drops incrementals at different times, causing temporary alarm states
  • Weekend gaps: Sometimes no updates occur for 1-2 days, particularly weekends

Development

Logging Configuration

Logging level is currently fixed at INFO. To modify logging levels, update the logback configuration or application properties.

IDE Setup

The project includes standard Gradle project structure:

src/
├── main/
│   ├── java/           # Application source code
│   └── resources/      # Configuration files, schemas
└── test/
    └── java/           # Test source code

Deployment

This app is deployed automatically by CI/CD when updates are made to the qa or production branch using containerized deployment to AWS ECS.

Container Deployment

The application runs as a Docker container on AWS ECS (Elastic Container Service) with images stored in ECR (Elastic Container Registry).

ECR Push

# Login to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 946183545209.dkr.ecr.us-east-1.amazonaws.com

# Build and tag local Docker image
docker build -t recap-harvester:local .

# Use current commit sha as image tag
export IMAGE_TAG=$(git rev-parse --short HEAD)

# Tag and push for ECR
docker tag recap-harvester:local 946183545209.dkr.ecr.us-east-1.amazonaws.com/recap-harvester:$IMAGE_TAG
docker push 946183545209.dkr.ecr.us-east-1.amazonaws.com/recap-harvester:$IMAGE_TAG

Using Docker Compose (Local Testing)

# Create docker-compose.yml with environment variables
docker-compose up -d

# View logs
docker-compose logs -f recap-harvester

# Stop service
docker-compose down

CI/CD Pipeline

The deployment pipeline typically follows these steps:

  1. Build Stage: ./gradlew clean build
  2. Test Stage: ./gradlew test
  3. Docker Build: Build and tag container image
  4. ECR Push: Push image to ECR repository
  5. ECS Deploy: Update ECS service with new task definition
  6. Health Check: Monitor service deployment and health

Troubleshooting

Common Issues

Java Version Problems

If you encounter build errors, ensure you're using Java 17+:

java -version
./gradlew -version

JAXB Issues

If you see JAXB-related errors, ensure the jaxb.index files are present in the model packages.

S3 Connection Issues

  • 301 Moved Permanently: Wrong S3 region configuration
  • 400 Bad Request: Check credentials and bucket permissions
  • 403 Forbidden: Insufficient IAM permissions

AWS Credentials

Test your credentials work:

aws s3 ls s3://your-bucket-name/ --region us-east-2

# For ECS deployments, also test ECR access
aws ecr describe-repositories --region us-east-1

# Test ECS permissions
aws ecs list-clusters
aws ecs list-services --cluster your-cluster-name

Debug Logging

To enable detailed logging:

In ECS Task Definition (environment variables):

{
  "LOGGING_LEVEL_ORG_APACHE_CAMEL": "DEBUG",
  "LOGGING_LEVEL_SOFTWARE_AMAZON_AWSSDK": "DEBUG"
}

In application.properties (for local development):

logging.level.org.apache.camel=DEBUG
logging.level.software.amazon.awssdk=DEBUG

Git Workflow

This repo uses the Main-QA-Production Git Workflow.

Migration Notes

This application has been migrated from Maven to Gradle and upgraded to Java 17, and modernized from Elastic Beanstalk to containerized ECS deployment. Key changes:

  • Build System: Maven → Gradle
  • Java Version: Java 8 → Java 17
  • Deployment: Elastic Beanstalk → Docker + ECS
  • Container Registry: ECR for image storage
  • JAXB: Now external dependency (removed from JDK in Java 11+)
  • Dependencies: Updated to Jakarta EE specifications where applicable
  • AWS SDK: Updated to AWS SDK v2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages