Skip to content

Commit 7c3e152

Browse files
authored
Merge pull request #5 from NeotomaDB/feature/singlecontainer
Containerized and set up to deploy to AWS.
2 parents 4ba7b13 + 6bd0e06 commit 7c3e152

26 files changed

+798
-260
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11

22
.env
3+
infrastructure/parameters.json
34

45
*.gz
56

CITATION.cff

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
cff-version: 1.2.0
2+
message: "If you use this software, please cite it as below."
3+
authors:
4+
- family-names: "Goring"
5+
given-names: "Simon James"
6+
orcid: "https://orcid.org/0000-0002-2700-4605"
7+
title: "Neotoma Snapshot Sanitizer"
8+
version: 1.0.0
9+
date-released: 2025-07-28
10+
url: "https://github.com/NeotomaDB/clean_backup"

CONTRIBUTING.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Introduction
2+
3+
Thank you for considering contributing to Neotoma's development. It's people like you that make Neotoma such a great community.
4+
5+
Following these guidelines helps to communicate that you respect the time of the developers managing and developing this open source project. In return, they should reciprocate that respect in addressing your issue, assessing changes, and helping you finalize your pull requests.
6+
7+
Improving documentation, bug triaging, or writing tutorials are all examples of helpful contributions that mean less work for the developers. If you wish to make any of these contributions please read on.
8+
9+
## Submitting changes
10+
11+
Send GitHub Pull Requests to [NeotomaDB/clean_backup](https://github.com/NeotomaDB/clean_backup), adding a clear list of the contributions of the pull request. In particular we'd appreciate test coverage where possible, and more documentation.
12+
13+
Always write a clear log message for your commits. One-line messages are fine for small changes, but bigger changes should look like this:
14+
15+
$ git commit -m "A brief summary of the commit
16+
>
17+
> A paragraph describing what changed and its impact."
18+
19+
Please try to use some form of linting, or code style with your contributions. This includes proper indentation (generally, 2 spaces), proper spacing around commas and operators, no trailing whitespace and clear variable naming conventions.
20+
21+
## Asking for Help
22+
23+
Please feel free to contact us through this repositories Issues, by contacting us directly, or through our Slack channel, to ask any questions.

README.md

Lines changed: 43 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,69 @@
1+
[![NSF-1948926](https://img.shields.io/badge/NSF-1948926-blue.svg)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=1948926)
2+
[![NSF-2410961](https://img.shields.io/badge/NSF-2410961-blue.svg)](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2410961)
3+
14
[![lifecycle](https://img.shields.io/badge/lifecycle-stable-green.svg)](https://www.tidyverse.org/lifecycle/#stable)
25

36

47
# Neotoma Anonymized Backups
58

6-
This repository generates a container service for Neotoma that copies the [Neotoma database](https://neotomadb.org) into a container and overwrites sensitive data using a random `md5` hash. The container then uploads the data to a Neotoma [AWS S3 bucket]() where the snapshot is made publically available.
9+
This repository generates a container service for Neotoma that copies the [Neotoma Paleoecology Database](https://neotomadb.org) into a Docker container and overwrites sensitive data using a random `md5` hash. The bash script running in the container then uploads the data to a Neotoma AWS S3 bucket where the snapshot is made publically available through a URL that is shared on the Neotoma website.
10+
11+
The compressed file (`neotoma_clean_{DATETIME}.tr.gz`) includes a [bash script](archives/regenbash.sh) that will re-build the database in a user's local Postgres instance. Currently the bash script only runs for Mac and Linux. There is an experimental [Windows batch script](archives/experimental_win_restore.bat) that can be used with caution.
12+
13+
We welcome any user contributions see the [contributors guide](CONTRIBUTING.md).
14+
15+
## Restoring the Database
716

8-
The compressed file (XXXX) includes a small README and a script to re-build the database in a local Postgres instance.
17+
The most recent snapshot of the Neotoma Database will always be tagged as `neotoma_clean_latest` in the compressed file, but the actual SQL file used to restore the database will be named with the date the snapshot was taken. Generally, the snapshots will be taken every month. If there is a need for a more recent snapshot, please contact the database administrators to request a newer snapshot.
918

10-
The following installation instructions were tested on PostgreSQL version 16, using script regenbash.sh (mac and linux).
11-
Alternatively, the commands can be entered directly in the command line. PostgreSQL must already be installed.
19+
### Postgres Extensions Used
1220

13-
## Postgres Extensions Used
21+
The Docker container uses Postgres 15, and the current RDS database version is PostgreSQL v15.14. The local database requires the following extensions to be installed before you can restore Neotoma locally:
1422

15-
* [pg_trgm](https://www.postgresql.org/docs/current/pgtrgm.html)
23+
* [pg_trgm](https://www.postgresql.org/docs/current/pgtrgm.html): Helps with full-text searching of publications.
1624
* [intarray](https://www.postgresql.org/docs/9.1/intarray.html)
17-
* [unaccent](https://www.postgresql.org/docs/current/unaccent.html)
18-
* External: [postgis](https://postgis.net/)
19-
* External: [vector/pgvector](https://github.com/pgvector/pgvector)
25+
* [unaccent](https://www.postgresql.org/docs/current/unaccent.html): Helps with searches for terms that may include accents (sitenames, contact names).
26+
* External: [postgis](https://postgis.net/): Helps manage spatial data.
2027

21-
These extensions are used to improve functionality within the Neotoma Database. External tools such as `postgis` and `pgvector` must be installed prior to creation within the Postgres server. We include the bash script in an effort to help users make the restoration process as simple as possible.
28+
These extensions are used to improve functionality within the Neotoma Database. The `pg_grgm`, `intarray`, and `unaccent` extensions are included with PostgreSQL. External tools such as `postgis` must be installed prior to creation within the Postgres server.
2229

23-
## Restoring the Database
30+
The [regenbash.sh](archives/regenbash.sh) script automates some of the creation of the extensions within the restored database.
31+
32+
### Restoring from the Cloud
33+
34+
The *most recent* version of the clean database is always uploaded as a `.tar.gz` file to Neotoma S3 cloud storage. You can download it directly by clicking the badge below. Note that this download is over 2 Gigs in size.
35+
36+
[![Download Snapshot](https://img.shields.io/badge/Download-Neotoma--Snapshot-orange.svg)](https://neotoma-remote-store.s3.us-east-2.amazonaws.com/neotoma_clean_latest.tar.gz)
37+
38+
Once the file is downloaded, you can extract it locally. The file archive contains the following files (the terminal date for the sql file may differ):
2439

25-
1. If you haven't already, [download the backup](https://neotomaprimarybackup.s3.us-east-2.amazonaws.com/clean_dump.tar.gz) to your local drive.
40+
* dbsetup.sql
41+
* experimental_win_restore.bat
42+
* regenbash.sh
43+
* neotoma_clean_2025-07-01.sql
2644

27-
2. Unzip the snapshot file (with commandline, or a tool):
45+
Once you execute `regenbash.sh` (Mac/Linux) or `experimental_win_restore.bat` (Windows) the database will be restored from the text file to your local database within a database `neotoma` at which point you can use the database from whichever database management system you'd like to use.
2846

29-
`gunzip clean_dump.tar.gz`
47+
## AWS Infrastructure
3048

31-
2. Enter the folder and restore database using the command `bash regenbash.sh`. For help, use: `bash regenbash.sh --help`
49+
The backup itself is generated through AWS. There are two steps, the first is packaging the Docker image and sending it to ECR, the second is initiating the Batch job, which will run the scripts in the Docker container.
3250

33-
`bash regenbash.sh`
51+
![AWS Configuration](/assets/AWS_scrub_database_infrastructure.svg)
3452

35-
The script performs the following actions. A password prompt will appear at each step:
36-
37-
The database "neotoma" is first dropped if it exists;
38-
The new "neotoma" is created;
39-
Extenesions are installled;
40-
The snapshot file (neotoma_ndb_only_2024-03-18.sql) is loaded into the new database.
53+
All files (with the exception of files that directly expose secrets) are available in this repository. All secrets are contained in a `parameters.yaml` file in the `./infrastructure` folder. We provide a [`parameters-template.yaml`](./infrastructure/parameters-template.json) file for convenience, so that users can see which key-value pairs are needed for full implementation of the workflow.
4154

42-
3. Alternatively, instead of using the script, the commands can be entered directly via command line:
55+
### Docker Configuration
4356

44-
dropdb neotoma -h localhost -U username
45-
createdb neotoma -h localhost -U username
46-
psql -h localhost -d neotoma -U username -c "CREATE EXTENSION postgis;"
47-
psql -h localhost -d neotoma -U username -c "CREATE EXTENSION pg_trgm;"
48-
psql -h localhost -d neotoma -U username -f neotoma_ndb_only_2024-03-18.sql
57+
The Docker [configuration file](batch.Dockerfile) sets up a container with PostgreSQL 15 and PostGIS. The Docker container sets up the system, creates a connection to a containerized Postgres database, and then uses `pg_dump` to create a plaintext SQL dump of the remote Neotoma database that is restored within the container. To sanitize the database of sensitive data we execute the script [`app/scrubbed_database.sh`](app/scrubbed_database.sh). The SQL statements write over rows in the Data Stewards tables as well as the Contacts tables.
4958

59+
The Docker container is built and deployed to the AWS ECR using the script [`build-and-push.sh`](build-and-push.sh). For this script to work, the user must have the AWS CLI installed, and have permissions to access Neotoma AWS services.
5060

61+
### AWS Infrastructure Builder
5162

52-
4. To view database using command line interactive terminal:
63+
The scripts [`deploy.sh`](deploy.sh) and [`update.sh`](update.sh) are used to deploy the [Batch Infrastructure](infrastructure/batch-infrastructure.yaml) configuration to CloudFormation, which will then be used to define the AWS Batch run when a job is submitted.
5364

54-
psql neotoma username
65+
Within the infrastructure file there is a defined `ScheduleRule`, which uses the EventBridge [`cron()`](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-scheduled-rule-pattern.html) scheduler to execute the backup snapshot at 2am on the first day of each month. Single instances of the job can also be executed using [`test_job.sh`](test_job.sh).
5566

56-
Meta-command \d ("describe") will list all the tables in the publice schema. To view the schema (ndb) and tables in the database,
57-
expand the search path by entering the command:
67+
## Final Overview
5868

59-
SET search_path TO 'ndb', public;
69+
With this repository, we implement a monthly backup system using AWS infrastructure to provide Neotoma users with a sanitized version of the database for local use on their personal systems.

README.txt

Lines changed: 0 additions & 60 deletions
This file was deleted.

app/app.Dockerfile

Lines changed: 0 additions & 19 deletions
This file was deleted.

app/connect_cloud.sh

Lines changed: 0 additions & 10 deletions
This file was deleted.

app/connect_database.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
# connect_database.sh - Simple database connection
3+
4+
set -e
5+
6+
echo "Setting up database connection..."
7+
8+
# Use direct RDS connection
9+
echo "Using direct RDS connection"
10+
export DB_HOST=${RDS_ENDPOINT:-"neotomaprivate.cxkwxkjpj8zi.us-east-2.rds.amazonaws.com"}
11+
export DB_PORT=${RDS_PORT:-"5432"}
12+
13+
echo "Database connection configured: ${DB_HOST}:${DB_PORT}"

app/scrubbed_database.sh

Lines changed: 45 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,59 @@
11
#!/bin/bash
2-
32
set -e
43

54
# by: Simon Goring
65
# Documentation for pg_dump is available at https://www.postgresql.org/docs/current/app-pgdump.html
76
# This script uses pg_dump to duplicate the whole database to a local version.
87

9-
bash /home/app/connect_cloud.sh
8+
# We're logging out to a proper log file here.
9+
exec > >(tee -a /var/log/db-sanitize.log)
10+
exec 2>&1
11+
12+
echo "💡 Starting database sanitization at $(date)"
13+
DATESTAMP=$(date +"%Y-%m-%d")
14+
15+
echo "🔌 Connecting to the primary Neotoma database..."
16+
source /home/app/connect_database.sh
1017

11-
echo Dumping:
18+
echo "Dumping the primary database from ${DB_HOST}:${DB_PORT}:"
1219
export PGPASSWORD=$REMOTE_PASSWORD
20+
pg_dump -v -O -C -c --no-owner -x -U $REMOTE_USER -h ${DB_HOST} -p ${DB_PORT} \
21+
--no-subscriptions -T ap.globalmammals -T ap.icesheets -N cron -Fp -d neotoma > /home/archives/tempdump.dump
1322

14-
pg_dump -C -v -O --no-owner -x -U $REMOTE_USER -h localhost -p 5454 -T ap.globalmammals -N cron -Fc -d neotoma > /home/archives/tempdump.dump
23+
echo "Checking to ensure the dump is stable:"
24+
pg_restore --list /home/archives/tempdump.dump | head -20
1525

26+
echo "🛠 Restoring the database locally"
1627
export PGPASSWORD=$POSTGRES_PASSWORD
17-
psql -U postgres -h pgneotoma -d postgres -c "DROP DATABASE IF EXISTS neotoma;"
18-
psql -U postgres -h pgneotoma -d postgres -c "CREATE DATABASE neotoma;"
19-
psql -U postgres -h pgneotoma -d postgres -c "CREATE EXTENSION IF NOT EXISTS postgis;"
20-
psql -U postgres -h pgneotoma -d postgres -c "CREATE EXTENSION IF NOT EXISTS pg_trgm;"
21-
psql -U postgres -h pgneotoma -d postgres -c "CREATE EXTENSION IF NOT EXISTS vector;"
22-
psql -U postgres -h pgneotoma -d postgres -c "CREATE EXTENSION IF NOT EXISTS intarray;"
23-
psql -U postgres -h pgneotoma -d postgres -c "CREATE EXTENSION IF NOT EXISTS unaccent;"
24-
pg_restore -U postgres -h pgneotoma -c --no-owner --no-privileges -d neotoma /home/archives/tempdump.dump
25-
psql -U postgres -h pgneotoma -d neotoma -c "UPDATE ti.stewards SET username=SUBSTRING(md5(random()::text) from 1 for 10), pwd=SUBSTRING(md5(random()::text) from 1 for 10);"
26-
psql -U postgres -h pgneotoma -d neotoma -c "UPDATE ndb.contacts SET address=SUBSTRING(md5(random()::text) from 1 for 10), phone=SUBSTRING(md5(random()::text) from 1 for 10), fax=SUBSTRING(md5(random()::text) from 1 for 10), email=SUBSTRING(md5(random()::text) from 1 for 10);"
27-
PGPASSWORD=postgres pg_dump -C -v -O --no-owner -x -Fc -p 5432 -h pgneotoma -d neotoma -U postgres > /home/archives/neotoma_clean.dump
28+
psql -U postgres -h localhost -d postgres -c "DROP DATABASE IF EXISTS neotoma;"
29+
psql -U postgres -h localhost -d postgres -c "CREATE DATABASE neotoma;"
30+
psql -U postgres -h localhost -d postgres -c "CREATE EXTENSION IF NOT EXISTS postgis;"
31+
psql -U postgres -h localhost -d postgres -c "CREATE EXTENSION IF NOT EXISTS pg_trgm;"
32+
psql -U postgres -h localhost -d postgres -c "CREATE EXTENSION IF NOT EXISTS intarray;"
33+
psql -U postgres -h localhost -d postgres -c "CREATE EXTENSION IF NOT EXISTS unaccent;"
34+
35+
echo "Restoring database in container. Checking for errors."
36+
37+
psql -U postgres -h localhost -d neotoma < /home/archives/tempdump.dump
38+
39+
40+
echo "🧹 Cleaning up all sensitive data from the database."
41+
psql -U postgres -h localhost -d neotoma -c "UPDATE ti.stewards SET username=SUBSTRING(md5(random()::text) from 1 for 10), pwd=SUBSTRING(md5(random()::text) from 1 for 10);"
42+
psql -U postgres -h localhost -d neotoma -c "UPDATE ndb.contacts SET address=SUBSTRING(md5(random()::text) from 1 for 10), phone=SUBSTRING(md5(random()::text) from 1 for 10), fax=SUBSTRING(md5(random()::text) from 1 for 10), email=SUBSTRING(md5(random()::text) from 1 for 10);"
43+
44+
echo "✍🏼 Creating the final cleaned dump."
45+
PGPASSWORD=postgres pg_dump -C -v -O --no-owner -x -Fc -p 5432 -h localhost -d neotoma -U postgres > /home/archives/neotoma_clean_${DATESTAMP}.dump
46+
47+
echo "📦 Compressing the dumped database."
48+
tar -zcvf /home/archives/neotoma_clean_${DATESTAMP}.tar.gz -C /home/archives/ .
49+
50+
echo "💾 Uploading the archive to S3."
51+
aws s3 cp /home/archives/neotoma_clean_${DATESTAMP}.tar.gz s3://neotoma-remote-store/ --content-encoding "application/x-compressed-tar"
52+
aws s3 cp s3://neotoma-remote-store/neotoma_clean_${DATESTAMP}.tar.gz s3://neotoma-remote-store/neotoma_clean_latest.tar.gz --content-encoding "application/x-compressed-tar"
53+
54+
echo "🗑️ Removing temporary files..."
2855
rm /home/archives/tempdump.dump
29-
tar -zcvf /home/archives/clean_dump.tar.gz /home/archives/
56+
rm /home/archives/neotoma_clean_${DATESTAMP}.dump
57+
rm /home/archives/neotoma_clean_${DATESTAMP}.tar.gz
3058

31-
aws s3 cp --content-encoding "application/x-compressed-tar" /home/archives/clean_dump.tar.gz s3://neotomaprimarybackup
59+
echo "✔ Database sanitization completed successfully at $(date)"

archives/dbsetup.sql

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
DROP DATABASE IF EXISTS neotoma;
2-
CREATE DATABASE neotoma;
32
CREATE EXTENSION IF NOT EXISTS postgis;
43
CREATE EXTENSION IF NOT EXISTS pg_trgm;
5-
CREATE EXTENSION IF NOT EXISTS vector;
64
CREATE EXTENSION IF NOT EXISTS intarray;
7-
CREATE EXTENSION IF NOT EXISTS unaccent;
5+
CREATE EXTENSION IF NOT EXISTS unaccent;

0 commit comments

Comments
 (0)