Skip to content

Commit 0ffe3c0

Browse files
authored
Merge pull request #44 from bclasky1539/feature/dual-storage
feat(ingestion): implement dual-storage for NOAA weather data (v1.14.0-SNAPSHOT)
2 parents 2b2e116 + 861f656 commit 0ffe3c0

File tree

17 files changed

+2002
-332
lines changed

17 files changed

+2002
-332
lines changed

CHANGELOG.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,116 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Version 1.14.0-SNAPSHOT - February 03, 2026
11+
12+
#### Weather Ingestion Module - Dual Storage Implementation for NOAA Data
13+
14+
**Added:**
15+
- **Dual Storage System** - Simultaneous storage of raw text and JSON formats
16+
- Raw text files stored in `raw-data/{source}/{type}/{year}/{month}/{day}/` structure
17+
- JSON files stored in `speed-layer/{source}/{type}/{year}/{month}/{day}/` structure
18+
- Consistent date partitioning across both storage types
19+
- File naming: `{station}_{timestamp}.{ext}` where timestamp is `yyyyMMdd_HHmm`
20+
21+
- **S3UploadService Enhancements** (weather-ingestion)
22+
- `uploadWeatherDataDual()` - Recommended method for NOAA data ingestion with dual storage
23+
- `uploadRawDataWithPartitioning()` - Enhanced raw data upload with date partitioning
24+
- `DualStorageResult` record - Immutable result containing both S3 keys (raw text + JSON)
25+
- Compact constructor with validation
26+
- Ensures both keys are non-null and non-empty
27+
- Accessor methods: `rawTextKey()`, `jsonKey()`
28+
- Enhanced metadata tagging for both raw text and JSON uploads
29+
- Comprehensive parameter validation in all upload methods
30+
31+
- **Enhanced Metadata Tracking** (weather-ingestion)
32+
- `s3_raw_key` - S3 key for raw text file location
33+
- `s3_json_key` - S3 key for JSON file location
34+
- `s3_key` - Legacy field maintained for backward compatibility (points to JSON)
35+
- `storage_format` - Set to "dual" to indicate both formats stored
36+
- `processor_version` - Updated to "2.1"
37+
38+
- **Documentation** (weather-ingestion)
39+
- `S3_BUCKET_SETUP.md` - Comprehensive S3 bucket configuration guide
40+
- AWS CLI and Console setup instructions
41+
- Lifecycle policies for cost optimization
42+
- Bucket structure and partitioning examples
43+
- Security best practices
44+
- Troubleshooting guide
45+
- `SINGLE_STATION_INTEGRATION_TEST.md` - Step-by-step integration testing procedures
46+
- Pre-flight checklist
47+
- Test execution instructions
48+
- Validation commands
49+
- Success criteria
50+
- Troubleshooting scenarios
51+
52+
**Changed:**
53+
- **S3UploadService** (weather-ingestion)
54+
- `uploadWeatherDataDual()` now the recommended method for NOAA data ingestion
55+
- Enhanced partitioning structure matches between raw-data and speed-layer paths
56+
- Improved metadata tagging with source, station-id, data-type, and ingestion-time
57+
- Added comprehensive validation in `uploadRawDataWithPartitioning()` for all parameters
58+
- Updated S3 content types: `text/plain` for raw text, `application/json` for JSON
59+
60+
- **SpeedLayerProcessor** (weather-ingestion)
61+
- Updated to use dual storage by default via `uploadWeatherDataDual()`
62+
- Processor version incremented from "2.0" to "2.1"
63+
- Enhanced metadata enrichment with `storage_format` field
64+
- Both S3 keys now stored in processed WeatherData metadata
65+
- Updated statistics output to indicate dual storage enabled
66+
- Improved logging with both raw and JSON file paths
67+
68+
- **S3 Bucket Structure** (weather-ingestion)
69+
- Standardized date partitioning: `{year}/{month}/{day}/` for both storage types
70+
- File naming convention: `{station}_{timestamp}.{ext}`
71+
- Timestamp format: `yyyyMMdd_HHmm` (UTC timezone)
72+
- Consistent metadata across both raw and JSON uploads
73+
- Example raw path: `raw-data/noaa/metar/2026/02/03/KCLT_20260203_1430.txt`
74+
- Example JSON path: `speed-layer/noaa/metar/2026/02/03/KCLT_20260203_1430.json`
75+
76+
**Technical Details:**
77+
- **Dual Storage Benefits:**
78+
- Raw text enables long-term archival and reprocessing
79+
- JSON enables fast querying and analysis
80+
- Both formats stored simultaneously in single transaction
81+
- Date partitioning optimizes query performance and cost
82+
83+
- **File Format Specifications:**
84+
- Raw text files: `.txt` extension with `text/plain` content type
85+
- JSON files: `.json` extension with `application/json` content type
86+
- Both include comprehensive S3 metadata for tracking and filtering
87+
88+
- **Time Handling:**
89+
- All timestamps in UTC for consistency
90+
- Date partitioning uses ingestion time (not observation time)
91+
- Supports month/year boundary transitions correctly
92+
93+
- **Backward Compatibility:**
94+
- Existing single-storage deployments continue to work
95+
- Legacy `s3_key` metadata field maintained (points to JSON)
96+
- New deployments should use `uploadWeatherDataDual()` method
97+
- Graceful handling of missing dual storage fields
98+
99+
**Migration Notes:**
100+
- New deployments should use `uploadWeatherDataDual()` for NOAA data
101+
- Existing code using `uploadWeatherData()` (JSON-only) continues to work
102+
- Legacy `s3_key` field maintained for backward compatibility
103+
- Update lifecycle policies to handle both `raw-data/` and `speed-layer/` prefixes
104+
- Recommended lifecycle:
105+
- Speed layer JSON: Delete after 30 days (recent data only)
106+
- Raw data text: Archive to Glacier after 90 days (long-term storage)
107+
108+
**Build & Quality:**
109+
- All existing tests passing (0 failures, 0 errors)
110+
- No breaking changes to public APIs
111+
- Requires Java 16+ for record types (`DualStorageResult`)
112+
- AWS SDK S3 client configuration unchanged
113+
114+
**Notes:**
115+
- Dual storage implementation complete and production-ready
116+
- Comprehensive documentation enables smooth deployment
117+
- Integration test guide validates end-to-end functionality
118+
- Ready for production deployment with monitoring and lifecycle policies
119+
10120
### Version 1.13.0-SNAPSHOT - January 28, 2026
11121

12122
#### Weather Storage Module - Phase 4 GSI Implementation & DynamoDB Integration Testing

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,20 @@ mvn clean verify sonar:sonar \
130130
- AWS credentials file configuration
131131
- Security best practices and troubleshooting
132132

133+
- **[S3 Bucket Setup](https://github.com/bclasky1539/noakweather-engineering-pipeline/blob/main/docs/S3_BUCKET_SETUP.md)** - Comprehensive guide for configuring S3 buckets for dual-storage weather data
134+
- AWS CLI and Console bucket creation
135+
- Lifecycle policies for cost optimization (30-day retention, Glacier archival)
136+
- Bucket structure and date partitioning examples
137+
- Security best practices (encryption, public access blocking)
138+
- Environment variable configuration and troubleshooting
139+
140+
- **[Single Station Integration Test](https://github.com/bclasky1539/noakweather-engineering-pipeline/blob/main/docs/SINGLE_STATION_INTEGRATION_TEST.md)** - Step-by-step guide for testing dual-storage NOAA data ingestion
141+
- Pre-flight checklist (AWS credentials, S3 access, Maven build)
142+
- Test execution for KCLT (Charlotte Douglas International)
143+
- Validation commands for raw text and JSON files
144+
- Success criteria and verification steps
145+
- Troubleshooting common issues
146+
133147
- **[Logging Configuration Setup](https://github.com/bclasky1539/noakweather-engineering-pipeline/blob/main/docs/LOGGING_SETUP.md)** - Centralized logging configuration for multi-module projects
134148
- Log4j2 master configuration
135149
- Maven resources plugin setup

0 commit comments

Comments
 (0)