Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
209 changes: 209 additions & 0 deletions docs/notebooks/its-live.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,215 @@
"data_frame = GeoDataFrame.from_arrow(table)\n",
"data_frame.plot()"
]
},
{
"cell_type": "markdown",
"id": "aff144a0",
"metadata": {},
"source": [
"## Performance\n",
"\n",
"Let's do a small investigation into the performance characteristics of the two (partitioned, non-partitioned) datasets.\n",
"We've uploaded them to the bucket `stac-fastapi-geoparquet-labs-375`, which is public via [requester pays](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html).\n",
"In all these examples, we've limited the returned item count to `10`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e6da363e",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"building \"rustac\"\n",
"rebuilt and loaded package \"rustac\" in 8.977s\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Getting the first ten items\n",
"Got 10 items from the non-partitioned dataset in 7.33 seconds\n",
"Got 10 items from the partitioned dataset in 1.34 seconds\n"
]
}
],
"source": [
"import time\n",
"from rustac import DuckdbClient\n",
"\n",
"client = DuckdbClient()\n",
"\n",
"href = \"s3://stac-fastapi-geoparquet-labs-375/its-live/**/*.parquet\"\n",
"href_partitioned = (\n",
" \"s3://stac-fastapi-geoparquet-labs-375/its-live-partitioned/**/*.parquet\"\n",
")\n",
"\n",
"print(\"Getting the first ten items\")\n",
"start = time.time()\n",
"items = client.search(href, limit=10)\n",
"print(\n",
" f\"Got {len(items)} items from the non-partitioned dataset in {time.time() - start:.2f} seconds\"\n",
")\n",
"\n",
"start = time.time()\n",
"items = client.search(href_partitioned, limit=10)\n",
"print(\n",
" f\"Got {len(items)} items from the partitioned dataset in {time.time() - start:.2f} seconds\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4e631b6d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Searching by year\n",
"Got 10 items from 2024 from the non-partitioned dataset in 19.33 seconds\n",
"Got 10 items from 2024 the partitioned dataset in 62.54 seconds\n"
]
}
],
"source": [
"print(\"Searching by year\")\n",
"start = time.time()\n",
"items = client.search(\n",
" href, limit=10, datetime=\"2024-01-01T00:00:00Z/2024-12-31T23:59:59Z\"\n",
")\n",
"print(\n",
" f\"Got {len(items)} items from 2024 from the non-partitioned dataset in {time.time() - start:.2f} seconds\"\n",
")\n",
"\n",
"start = time.time()\n",
"items = client.search(\n",
" href_partitioned, limit=10, datetime=\"2024-01-01T00:00:00Z/2024-12-31T23:59:59Z\"\n",
")\n",
"print(\n",
" f\"Got {len(items)} items from 2024 the partitioned dataset in {time.time() - start:.2f} seconds\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e0d8965a",
"metadata": {},
"source": [
"The non-partitioned dataset has much smaller files, so the search for the first ten items in 2024 didn't take as long because it didn't have to read in large datasets across the network.\n",
"Let's use the `year` partitioning filter to speed things up."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "28b83009",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Got 10 items from 2024 the partitioned dataset, using `year`, in 1.09 seconds\n"
]
}
],
"source": [
"start = time.time()\n",
"items = client.search(\n",
" href_partitioned,\n",
" limit=10,\n",
" datetime=\"2024-01-01T00:00:00Z/2024-12-31T23:59:59Z\",\n",
" filter=\"year=2024\",\n",
")\n",
"print(\n",
" f\"Got {len(items)} items from 2024 the partitioned dataset, using `year`, in {time.time() - start:.2f} seconds\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e54bdca1",
"metadata": {},
"source": [
"Much better.\n",
"Now let's try a spatial search.\n",
"During local testing, we determined that it wasn't even worth it to try against the non-partitioned dataset, as it takes too long."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a9fad4df",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Got 10 items over Helheim Glacier from the partitioned dataset in 9.33 seconds\n"
]
}
],
"source": [
"helheim = {\"type\": \"Point\", \"coordinates\": [-38.2, 66.65]}\n",
"\n",
"start = time.time()\n",
"items = client.search(href_partitioned, limit=10, intersects=helheim)\n",
"print(\n",
" f\"Got {len(items)} items over Helheim Glacier from the partitioned dataset in {time.time() - start:.2f} seconds\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "34cf6b59",
"metadata": {},
"source": [
"For experimentation, we've also got a [stac-fastapi-geoparquet](https://github.com/stac-utils/stac-fastapi-geoparquet/) server pointing to the same partitioned dataset.\n",
"Since spatial queries take a lot of data transfer from the DuckDB client to blob storage, is it any faster to query using the **stac-fastapi-geoparquet** lambda?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "000e1cd9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Got 10 items over Helheim Glacier from the stac-fastapi-geoparquet server in 2.25 seconds\n"
]
}
],
"source": [
"import rustac\n",
"import requests\n",
"\n",
"# Make sure the lambda is started\n",
"response = requests.get(\"https://stac-geoparquet.labs.eoapi.dev\")\n",
"response.raise_for_status()\n",
"\n",
"start = time.time()\n",
"items = await rustac.search(\n",
" \"https://stac-geoparquet.labs.eoapi.dev\",\n",
" collections=[\"its-live-partitioned\"],\n",
" intersects=helheim,\n",
" max_items=10,\n",
")\n",
"print(\n",
" f\"Got {len(items)} items over Helheim Glacier from the stac-fastapi-geoparquet server in {time.time() - start:.2f} seconds\"\n",
")"
]
}
],
"metadata": {
Expand Down