Author: MonkDB Engineering
Last Updated: 2025-06-13
We constructed a large-scale synthetic geospatial tile index using real metadata extracted from the Sentinel-2 MSI L2A product specifically from a representative granule with tile ID 30UVA and orbit details like:
Field | Value |
---|---|
Absolute orbit number | 4011 |
Beginning date time | 2025-06-12T11:21:31.025000Z |
Cloud cover | 97.414082 |
Datastrip id | S2C_OPER_MSI_L2A_DS_2CPS_20250612T150504_S20250612T112401_N05.11 |
Ending date time | 2025-06-12T11:21:31.025000Z |
Granule identifier | S2C_OPER_MSI_L2A_TL_2CPS_20250612T150504_A004011_T30UVA_N05.11 |
Modification date | 2025-06-12T16:09:26.704058Z |
Operational mode | INS-NOBS |
Origin | ESA |
Origin date | 2025-06-12T15:57:26.000000Z |
Processing date | 2025-06-12T15:05:04.000000Z |
Processing level | S2MSI2A |
Processor version | 05.11 |
Product group id | GS2C_20250612T112131_004011_N05.11 |
Product type | S2MSI2A |
Publication date | 2025-06-12T16:09:26.704058Z |
Relative orbit number | 37 |
S3Path | /eodata/Sentinel-2/MSI/L2A/2025/06/12/S2C_MSIL2A_20250612T112131_N0511_R037_T30UVA_20250612T150504.SAFE |
Source product | S2C_OPER_MSI_L2A_TL_2CPS_20250612T150504_A004011_T30UVA_N05.11, S2C_OPER_MSI_L2A_DS_2CPS_20250612T150504_S20250612T112401_N05.11 |
Source product origin date | 2025-06-12T15:57:26Z, 2025-06-12T15:55:59Z |
Tile id | 30UVA |
This seed tile was accompanied by associated raster *.tif
paths and footprint polygons.
Sentinel-2 L2A products nominally contain 13 spectral bands:
- 10m resolution: B02 (Blue), B03 (Green), B04 (Red), B08 (NIR)
- 20m resolution: B05, B06, B07, B8A, B11, B12
- 60m resolution: B01, B09, B10
Note: For most official ESA L2A products, B10 (Cirrus) is not included in the output because it does not contain surface information or is not atmospherically corrected. The folder structure may reference B10, but the actual image is typically absent.[1][2][6]
Sen2Cor (the processor for L2A) generates several additional raster layers per tile:
- AOT (Aerosol Optical Thickness)
- WVP (Water Vapour)
- SCL (Scene Classification Layer)
- SNW (Snow Probability)
- CLD (Cloud Probability)
- CLP (Cloud Probability from s2cloudless, if enabled)
- CLM (Cloud Mask)
- dataMask
- Sun and View Angles (per-pixel or per-tile, sometimes as separate files)
โน๏ธ Angle layers are typically stored in the QI_DATA/ subdirectory and may be tile-wide or pixel-wise, depending on the processing configuration. While not always required for general ingestion, they are critical for precise reflectance modeling or BRDF correction.
- Spectral bands: 12 (B01โB09, B11, B12; B10 usually excluded)
- Auxiliary layers: 7 or more, depending on processor version and configuration
So, a typical L2A tile directory contains:
- 12 spectral band images (B01โB12, usually excluding B10)
- 7+ auxiliary raster layers
Total: Usually 19โ21 raster layers per tile
Layer Type | Example Names | Count |
---|---|---|
Spectral Bands | B01โB12 (B10 usually excluded in L2A) | 12 |
Auxiliary Layers | AOT, WVP, SCL, SNW, CLD, CLP, CLM, dataMask, angles | 7+ |
Total | 19โ21+ |
Since the original dataset contained only 36 tile footprints, we amplified it to 100,000 entries for scalability and stress testing purposes. The process involved:
- Spatial translations and perturbations of original WKT polygons.
- Systematic duplication across varying
layer
,resolution
, andpath
values. - Generation of realistic metadata fields such as
centroid
,area_km
, and derivedgeohash3
Each row in the resulting dataset simulates a unique raster tile footprint suitable for spatial querying, aggregation, and indexing.
- Real-world provenance (based on actual Sentinel-2 tile metadata)
- Highly scalable: 100K records mimic high-throughput satellite ingestion pipelines
- Rich geospatial structure:
area
: WKTGEO_SHAPE
(polygon)centroid
: derivedGEO_POINT
geohash3
: spatial region hashing for clustering and diversity checks
- Used for:
- intersects, within, distance, and area benchmarks
- Advanced queries like percentile distributions, bounding box coverage, resolution-layer analysis, and geohash diversity
- Evaluating MonkDB's performance for Earth observation-style workloads
Tool | Purpose |
---|---|
GDAL |
To extract tile metadata and geometries |
Shapely |
Geometry parsing and validation |
PyProj |
Accurate area calculation on curved surfaces |
MonkDB |
AI-Native database for geospatial storage |
Dask |
Parallel processing of large sets |
Python3 |
Scripting, data transformation, ETL |
You need to create a config ini file at the root with the below structure. Replace tile_dir
paths according to your system's directory layout. Also replace DB_HOST
value with the IP address of MonkDB.
[paths]
output_csv_v3 = sentinel_v3_tile_index.csv # this is the file name (where the indexed tiles would be written to)
[sentinel]
sentinel_data_dir_v2 = /home/ubuntu/v3_geo # this is where the derived *.tiff are stored.
[metadata]
export_format = csv
[database]
DB_HOST = xx.xx.xxx.xxx
DB_PORT = 4200
DB_USER = testuser
DB_PASSWORD = testpassword
DB_SCHEMA = monkdb
RASTER_GEO_SHAPE_TABLE_V2 = sentinel
The data from Sentinel Hub is open-source and typically provided as a .SAFE.zip
archive.
Once extracted, the archive contains a structured directory hierarchy. To locate the imagery files, navigate to:
GRANULE/L<...>{TILE_ID}{TIMESTAMP}/IMG_DATA
Inside the IMG_DATA
folder, you will find multiple .jp2
(JPEG2000) files representing different spectral bands and resolutions.
To convert .jp2
files to .tif
format, use the provided to_tiff.sh
script:
chmod +x to_tiff.sh
./to_tiff.sh <SOURCE_IMG_DATA_FOLDER> <DESTINATION_FOLDER_FOR_TIFFS>
This will recursively process all .jp2
files in the source folder and save the converted .tif
files in the specified destination directory.
tif
files need to be validated to ensure the presence of spatial reference system (CRS). To validate run the below command.
We used:
gdalinfo -json /path/to/file_name.tif
to extract:
- Bounding box
- Coordinate reference system (CRS)
- Tile name and path
- Layer metadata (e.g., 'hypso_relief')
This metadata is used to compute:
- Polygon for the bounding box (
GEO_SHAPE
) - Centroid of the tile (
GEO_POINT
) - Area using
pyproj.Geod
for Earth-accurate results
MonkDB requires all
GEO_SHAPE
column entries to be spatially referenced.
- Reads the metadata from pre-tiled
.tif
files using a CSV. - Computes and adjusts:
- Bounding polygon (as
GEO_SHAPE
) - Centroid (as
GEO_POINT
) - Area (via geodesic area in kmยฒ using WGS84 ellipsoid)
- Bounding polygon (as
- Stores all metadata into a
raster_geo_shapes
table in MonkDB.
Includes safety checks (e.g. bounding limits -180/180, -90/90), duplicate handling, and invalid polygon skips.
Feature | Description |
---|---|
Centroid | Computed from polygon center using shapely |
Area (kmยฒ) | Computed using geodesic area on ellipsoid using pyproj.Geod.geometry_area_perimeter() |
Safe Polygon | Ensures lat/lon does not exceed [-180, 180] and [-90, 90] |
Geohash | Encodes centroid into geohash regions for grouping |
The queries are designed for both data integrity and spatial intelligence:
# | Query Name | SQL Snippet | Purpose / Use Case |
---|---|---|---|
1 | Centroids within Bounding Box | WHERE within(centroid, 'POLYGON ((100 -10, 120 -10, 120 10, 100 10, 100 -10))' ) |
Extract tiles in Southeast Asia for regional spatial analytics. |
2 | Tiles with Zero/Near-Zero Area | WHERE area_km < 0.01 |
Flags corrupt raster tiles or invalid polygons with bad area values. |
3 | Top 10 Largest Tiles | ORDER BY area_km DESC LIMIT 10 |
Identify top-coverage tiles, helpful in understanding dominant geographies. |
4 | Layer-based Filtering | WHERE layer = 'hypso_relief' |
Isolates thematic content within specific layers (e.g., elevation, water, slope). |
5 | Centroids within Radius | WHERE distance(centroid, [85.0, 20.0]) < 1000000 |
Fetch tiles within 1000km of a reference point (e.g., disaster/event zone). |
6 | Geohash Grouping | GROUP BY substr(geohash(centroid), 1, 3) |
Spatial binning for regional heatmaps and distributed ingestion partitioning. |
7 | Southern & Eastern Hemisphere | WHERE latitude(centroid) < 0 AND longitude(centroid) > 0 |
Filters tiles falling in the southeastern quadrant (e.g., Oceania, Indonesia). |
8 | Total Area Coverage | SELECT SUM(area_km) FROM ... |
Measures total geospatial coverage (in kmยฒ) across all tiles. |
# | Query Name | Purpose | SQL Snippet |
---|---|---|---|
1 | Tiles with Multiple Layer Versions | Detects duplicate tile_ids across layers (e.g., tile123__hypso, tile123__slope) | SELECT tile_id, COUNT(*) AS layer_versions FROM schema.table GROUP BY tile_id HAVING COUNT(*) > 1; |
2 | Compare Area Across Layers for Same Tile ID | Cross-checks area_km across versions of the same tile across different layers | SELECT t.tile_id, t.layer, t.area_km FROM schema.table t JOIN (SELECT tile_id FROM schema.table GROUP BY tile_id HAVING COUNT(*) > 1) d ON t.tile_id = d.tile_id; |
3 | Tile Count by Layer (Distribution) | Shows how many tiles exist in each raster layer (e.g., land cover, slope, water bodies) | SELECT layer, COUNT(*) AS tile_count FROM schema.table GROUP BY layer ORDER BY tile_count DESC; |
4 | Average Tile Area per Layer | Highlights granularity differences between low-res and high-res layers | SELECT layer, ROUND(AVG(area_km), 2) AS avg_area_km FROM schema.table GROUP BY layer ORDER BY avg_area_km DESC; |
5 | Top 5 Tiles by Area per Layer | Finds the largest tiles in each layer for QA or tile curation | SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY layer ORDER BY area_km DESC) AS rank FROM schema.table ) t WHERE rank <= 5; |
6 | Tiles Missing Area Data (NULL area_km) | Ensures all tiles have proper geodesic area populated | SELECT tile_id, layer FROM schema.table WHERE area_km IS NULL; |
7 | Resolution Distribution by Layer | Groups tiles by resolution level (e.g., "high", "low") within each layer | SELECT layer, resolution, COUNT(*) AS count FROM schema.table GROUP BY layer, resolution ORDER BY layer, count DESC; |
8 | Average Distance to Reference Point per Layer | Measures how far on average each layerโs tiles are from a focal point (e.g., disaster zone) | SELECT layer, ROUND(AVG(distance(centroid, [85.0, 20.0])), 2) AS avg_dist_m FROM schema.table GROUP BY layer ORDER BY avg_dist_m; |
Replace
schema.table
with{DB_SCHEMA}.{RASTER_TABLE}
.
# | Query/Output Name | Purpose | SQL Snippet / Method | Benefits | AI Insight Use Case |
---|---|---|---|---|---|
1 | Layer Statistics Summary | Gives min, max, mean, stddev, and count of tile areas per layer | SELECT layer, COUNT(*), MIN(area_km), MAX(area_km), ROUND(AVG(area_km),2), ROUND(STDDEV(area_km),2) FROM schema.table GROUP BY layer; |
Identifies data anomalies, standardizes expected tile sizes | Detect layers with abnormal fragmentation; guide AI models to handle edge-case tiles |
2 | Layer Percentile Distribution | Understands spread of tile area across 25th, 50th, 75th, and 95th percentile | SELECT layer, percentile(area_km, 0.25), 0.5, 0.75, 0.95 FROM schema.table GROUP BY layer; |
Helps define thresholds for area-based clustering or pruning | Feed percentile curves into spatial anomaly detection or adaptive tiling AI models |
3 | WKT-based Intersection Query | Finds tiles intersecting a target WKT polygon (e.g., state boundary) | SELECT tile_id, layer, area_km, centroid FROM schema.table WHERE intersects(area, ?) ORDER BY area_km DESC LIMIT 100; |
Quick lookup for all coverage tiles of a region | Enable AI to infer content availability, perform boundary-aware inference (e.g., floods in Telangana) |
4 | Boundary Union + Bounding Box (Client) | Computes bounding geometry of entire tile set | unary_union + .bounds in Python |
Geo-referencing all tiles as one boundary block | Train AI to operate over the entire coverage zone; use unified WKT as input for global pattern learning |
In this solution, we pass SQL commands to MonkDB and leverage the TinyLlama model to perform summarization, extract key insights, and more. The stack utilizes several key tools to enable seamless interaction between natural language, databases, and advanced NLP techniques.
Tool | Description |
---|---|
mcp-monkdb server | AI gateway that bridges MonkDB with large language models, enabling SQL command execution and response handling. |
TinyLlama | Lightweight LLM (1.1B parameters) fine-tuned for tasks such as text-to-SQL, summarization, and extraction of insights from database responses. Efficient for environments with limited resources, but can be swapped for larger models for improved output quality[2][4]. |
Pandas | Used to structure and format the responses retrieved from MonkDB collections for downstream processing and display. |
HF transformers pipeline | Provides advanced NLP capabilities and model inference, supporting a wide range of language processing tasks. |
Below is a short demonstration of the workflow:
[!CAUTION]
MonkDB's MCP server requires Python >=3.13. Run only
SELECT *
statements. Other SQL statements won't work with our MCP as it is not recommended. For more information, please go through this GitHub page.
[!NOTE]
TinyLlama is a compact model (1.1B parameters) designed for efficiency. For higher quality outputs, consider replacing TinyLlama with larger models such as OpenAI, Anthropic, Grok, Gemini, Llama, or others as needed.
-
Final output CSV contains:
tile_id
area
(asGEO_SHAPE
)centroid
(asGEO_POINT
)area_km
(float)layer
path
-
Outputs have been captured in the
results/
folder for core, advanced and spatial insight queries. -
The column names in spatial intelligence query results which were captured in the csv are:
layer_statistics
layer
,tile_count
,min_area
,max_area
,mean_area
,stddev_area
.
layer_percentiles
layer
,p25
,median
,p75
,p95
wkt_intersection_results
tile_id
,layer
,area_km
,centroid
.
- Migrated to Sentinel-2 Level-2A imagery, using metadata-rich
.SAFE
archives from Copernicus Open Access Hub - Generated 100,000+ synthetic raster tile entries from an original sample of 36 real Sentinel tiles, ensuring varied spatial coverage and realistic duplication for benchmarking
- Resolution values (e.g.,
10m
,20m
,60m
) were extracted directly from the dataset structure, ensuring fidelity to Sentinel band characteristics - Bounding boxes and centroids were computed accurately using
WGS84
ellipsoid geometry for each tile polygon, and all values were normalized toGEO_SHAPE
andGEO_POINT
- Calculated
area_km
per tile with high precision; uniformity across layers reflects consistent tiling and reprojection logic - Introduced geohash3 regions to enable proximity and clustering analysis in downstream geospatial queries
- All ingestion steps included validation and fault tolerance โ only 5โ10 entries had invalid or null geometries and were gracefully skipped
- End-to-end queries such as percentile distribution, bounding box unions, region diversity, and intersection tests were executed on MonkDB with sub-second response times
- Results show MonkDBโs spatial indexing, layer-wise analytics, and vector-tile handling are production-ready, even under synthetic scale and multiplexed conditions.
Why MonkDB is relevant for these kind of advanced use cases vis-a-vis PostgreSQL + PostGIS has been explained in this document.
The updated results are stored in
results/v3
directory.
- Edge-aware tile indexing for 5G tower planning or multi-resolution coverage heatmaps (e.g., high-res elevation + low-res land use).
- Disaster zone geofencing using
within()
+ radius search on multi-layered raster datasets (e.g., flood risk + slope map + urban density). - Terrain & surface analytics for smart cities, drone/autonomous routingโpowered by layer-aware queries.
- Resolution multiplexing enables context-sensitive insights at different zoom levels (e.g., fine-grained slope maps vs coarse vegetation layers).
- Dynamic data partitioning via geohash for efficient spatial sharding and parallel ingest, scalable to national and continental datasets.
- Enable streaming tile insertions via MonkDBโs ingestion APIs
- Explore real-time raster transformations and AI model overlays