This application is a poller in the Data/Search pipeline of NYPL's Library Services Platform. It polls an S3 bucket (formerly an SFTP server) managed by SCSB for additions, updates, and deletions of partner records. It processes SCSBXML and json files and writes BibPostRequest and ItemPostRequest documents into the BibPostRequest-[env] and ItemPostRequest-[env] streams just as our SierraUpdatePoller does for our own records. This app is represented by the "RecapHarvester" component in our Data/Search Architecture diagram.
This is a Java application built with:
- Gradle for build automation and dependency management
- Spring Boot for application framework
- Apache Camel for integration and routing
- Java 17+ (LTS version required)
You need Java 17 or later. To install on macOS using Homebrew:
brew install openjdk@17
echo 'export PATH="/opt/homebrew/opt/openjdk@17/bin:$PATH"' >> ~/.zshrc
source ~/.zshrc
java -versionShould report something like "openjdk version "17.0.x"
Gradle Wrapper is included in the project, so no separate Gradle installation is required.
Copy the sample environment file and configure it for your environment:
cp .env-local-export.sample .env-local-exportEdit .env-local-export and fill in the missing values. The required missing environment variables are:
# Creds for connecting to the SCSB S3 bucket containing updates
s3Bucket=[aws s3 bucket name for scsb]
s3AccessKey=[aws access key for scsb bucket]
s3SecretKey=[aws secret key for scsb bucket]
# Creds for connecting to the SchemaService:
NyplOAuthKey=[platform api client id]
NyplOAuthSecret=[platform api client secret]
# Creds to write to Kinesis streams:
AWS_ACCESS_KEY_ID=[aws key]
AWS_SECRET_ACCESS_KEY=[aws secret]For different options to set AWS credentials, please refer to the AWS SDK for Java documentation.
To compile and run all tests:
./gradlew clean buildSource your environment file and run the application:
source .env-local-export
./gradlew bootRunThe app can also be run locally in a container with Docker compose, and will automatically source .env-local-export:
docker-compose up --buildSee Using Docker Compose for details on local Docker builds.
This app can be run in 2 different modes:
On a nightly basis, it downloads HTC-generated files that represent bibliographic additions, updates, and deletions.
For initial seeding of databases & index from a massive zip of partner items generated by HTC. Note: This was done at product launch and not retested since. We're not likely to need this again unless we have a very large influx of partner records to update.
The app connects to an S3 bucket managed by SCSB to retrieve partner bib and item metadata updates. It uses Apache Camel, a streaming framework, to manage the following pipeline:
- Fetch zipped SCSBXml documents from S3 bucket:
data-exports/NYPL/SCSBXml/Incremental/*.zip - Move processed zips to
data-exports/NYPL/SCSBXml/Incremental.processed(prevents reprocessing) - Unzip the SCSBXML documents and write them locally
- Iterate over local SCSBXML documents, translate them into Bib and Item documents
- Broadcast the documents over
BibPostRequest-[env]andItemPostRequest-[env]Kinesis streams (Avro encoded)
- Fetch zipped JSON documents from S3 bucket:
data-exports/NYPL/Json/*.zip - Move processed zips to
data-exports/NYPL/Json.processed(prevents reprocessing) - Unzip the JSON documents and write them locally
- Translate them into Bib and Item documents with
"deleted": true - Broadcast the "deleted" documents over Kinesis streams (Avro encoded)
The app is monitored by a CloudWatch alarm. The alarm fires when the RecapHarvesterBibProcessed-Production CW metric <= 0 for 1 day.
- App logs "Processing bib - recap-" for each processed record
- A metric filter converts log entries to metrics
- An alarm fires when metrics aren't written for 1 day
- Timing variations: SCSB drops incrementals at different times, causing temporary alarm states
- Weekend gaps: Sometimes no updates occur for 1-2 days, particularly weekends
Logging level is currently fixed at INFO. To modify logging levels, update the logback configuration or application properties.
The project includes standard Gradle project structure:
src/
├── main/
│ ├── java/ # Application source code
│ └── resources/ # Configuration files, schemas
└── test/
└── java/ # Test source code
This app is deployed automatically by CI/CD when updates are made to the qa or production branch using containerized deployment to AWS ECS.
The application runs as a Docker container on AWS ECS (Elastic Container Service) with images stored in ECR (Elastic Container Registry).
# Login to ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 946183545209.dkr.ecr.us-east-1.amazonaws.com
# Build and tag local Docker image
docker build -t recap-harvester:local .
# Use current commit sha as image tag
export IMAGE_TAG=$(git rev-parse --short HEAD)
# Tag and push for ECR
docker tag recap-harvester:local 946183545209.dkr.ecr.us-east-1.amazonaws.com/recap-harvester:$IMAGE_TAG
docker push 946183545209.dkr.ecr.us-east-1.amazonaws.com/recap-harvester:$IMAGE_TAG# Create docker-compose.yml with environment variables
docker-compose up -d
# View logs
docker-compose logs -f recap-harvester
# Stop service
docker-compose downThe deployment pipeline typically follows these steps:
- Build Stage:
./gradlew clean build - Test Stage:
./gradlew test - Docker Build: Build and tag container image
- ECR Push: Push image to ECR repository
- ECS Deploy: Update ECS service with new task definition
- Health Check: Monitor service deployment and health
If you encounter build errors, ensure you're using Java 17+:
java -version
./gradlew -versionIf you see JAXB-related errors, ensure the jaxb.index files are present in the model packages.
- 301 Moved Permanently: Wrong S3 region configuration
- 400 Bad Request: Check credentials and bucket permissions
- 403 Forbidden: Insufficient IAM permissions
Test your credentials work:
aws s3 ls s3://your-bucket-name/ --region us-east-2
# For ECS deployments, also test ECR access
aws ecr describe-repositories --region us-east-1
# Test ECS permissions
aws ecs list-clusters
aws ecs list-services --cluster your-cluster-nameTo enable detailed logging:
In ECS Task Definition (environment variables):
{
"LOGGING_LEVEL_ORG_APACHE_CAMEL": "DEBUG",
"LOGGING_LEVEL_SOFTWARE_AMAZON_AWSSDK": "DEBUG"
}In application.properties (for local development):
logging.level.org.apache.camel=DEBUG
logging.level.software.amazon.awssdk=DEBUGThis repo uses the Main-QA-Production Git Workflow.
This application has been migrated from Maven to Gradle and upgraded to Java 17, and modernized from Elastic Beanstalk to containerized ECS deployment. Key changes:
- Build System: Maven → Gradle
- Java Version: Java 8 → Java 17
- Deployment: Elastic Beanstalk → Docker + ECS
- Container Registry: ECR for image storage
- JAXB: Now external dependency (removed from JDK in Java 11+)
- Dependencies: Updated to Jakarta EE specifications where applicable
- AWS SDK: Updated to AWS SDK v2