Skip to content

Commit ff158fa

Browse files
authored
Merge pull request #41 from LandscapeGeoinformatics/support_queries_with_coarser_refinement_level
Zone query / Zone data retrieval with coarser refinement level than collections
2 parents d632730 + c08e029 commit ff158fa

29 files changed

+483
-434
lines changed

dggs_api_config_example.json

Lines changed: 32 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,13 @@
44
"suitability_hytruck": {
55
"title": "Suitability Modelling for Hytruck",
66
"description": "Desc",
7-
"bonds": [5.86307954788208, 47.31793212890625, 31.61196517944336, 70.0753173828125],
7+
"extent": {"spatial": {"bbox":[ [5.86307954788208, 47.31793212890625, 31.61196517944336, 70.0753173828125] ] }} ,
88
"collection_provider": {
99
"providerId": "clickhouse",
1010
"dggrsId": "igeo7",
11-
"maxzonelevel": 9,
12-
"getdata_params": {
13-
"table": "<table name>",
14-
"zoneId_cols": {
15-
"9": "column of refinement level 9",
16-
"8": "column of refinement level 8",
17-
"7": "column of refinement level 7",
18-
"6": "column of refinement level 6",
19-
"5": "column of refinement level 5"
20-
},
21-
"data_cols": [
22-
"data column names"
23-
]
24-
}
11+
"min_refinement_level": 5,
12+
"max_refinement_level": 9,
13+
"datasource_id": "hytruck_clickhousre"
2514
}
2615
}
2716
},
@@ -32,10 +21,9 @@
3221
"collection_provider": {
3322
"providerId": "zarr",
3423
"dggrsId": "igeo7",
35-
"maxzonelevel": 9,
36-
"getdata_params": {
37-
"datasource_id": "zarr_hytruck"
38-
}
24+
"min_refinement_level": 5,
25+
"max_refinement_level": 8,
26+
"datasource_id": "zarr_hytruck"
3927
}
4028
}
4129
},
@@ -46,10 +34,9 @@
4634
"collection_provider": {
4735
"providerId": "parquet",
4836
"dggrsId": "igeo7",
49-
"maxzonelevel": 9,
50-
"getdata_params": {
51-
"datasource_id": "hytruck"
52-
}
37+
"min_refinement_level": 5,
38+
"max_refinement_level": 7,
39+
"datasource_id": "hytruck"
5340
}
5441
}
5542
}
@@ -82,36 +69,47 @@
8269
"1": {
8370
"clickhouse": {
8471
"classname": "clickhouse_collection_provider.ClickhouseCollectionProvider",
85-
"initial_params": {
86-
"host": "127.0.0.1",
87-
"user": null,
88-
"password": null,
89-
"port": 9000,
90-
"database": "DevelopmentTesting"
91-
}
72+
"datasources": {
73+
"connection":{
74+
"host": "127.0.0.1",
75+
"user": null,
76+
"password": null,
77+
"port": 9000,
78+
"database": "DevelopmentTesting"
79+
},
80+
"hytruck_clickhouse":{
81+
"table": "<table name>",
82+
"zoneId_cols": {
83+
"9": "column of refinement level 9",
84+
"8": "column of refinement level 8",
85+
"7": "column of refinement level 7",
86+
"6": "column of refinement level 6",
87+
"5": "column of refinement level 5"
88+
},
89+
"data_cols": [
90+
"data column names"
91+
]
92+
}
9293
}
9394
},
9495
"2": {
9596
"zarr": {
9697
"classname": "zarr_collection_provider.ZarrCollectionProvider",
97-
"initial_params": {
9898
"datasources": {
9999
"zarr_hytruck": {
100100
"filepath": "./aggregated_tree.zarr",
101-
"zones_grps": {
101+
"zone_groups": {
102102
"4": "res4",
103103
"5": "res5",
104104
"6": "res6"
105105
}
106106
}
107107
}
108108
}
109-
}
110109
},
111110
"3": {
112111
"parquet": {
113112
"classname": "parquet_collection_provider.ParquetCollectionProvider",
114-
"initial_params": {
115113
"datasources": {
116114
"hytruck": {
117115
"filepath": "<local file path or path of a cloud bucket>",
@@ -121,6 +119,5 @@
121119
}
122120
}
123121
}
124-
}
125122
}
126123
}

docs/source/providers/collection_providers/implementations/clickhouse_collection_provider.rst

Lines changed: 43 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -2,76 +2,61 @@ Clickhouse Collection Provider
22
==============================
33
The implementation uses `clickhouse_drive <https://clickhouse-driver.readthedocs.io/en/latest/>`_ to connect to Clickhouse DB. The provider serves multiple tables on the same database, with each table as a data source. It creates an instance of ``clickhouse_diver::Client`` at initialisation and assigns it to ``self.db``. The reference is used in ``get_data`` for data queries.
44

5+
ClickhouseDatasourceInfo
6+
==============================
7+
- ``table``: A string to indicate the table for query
8+
- ``aggregation``: A string to indicate which aggregation should use. Currently only for `mode`.
9+
510
Note on Clickhouse query
611
-------------------------
712
Clickhouse restricts the query size to 200KB by default. It is controlled by the setting `max_query_size <https://clickhouse.com/docs/operations/settings/settings#max_query_size>`_ . The default size is too small when the number of zone IDs for the query is large. For instance, each zone ID consumes 10 bytes for IGEO7 z7 (in string format) at refinement level 8, the query is limited to 20,000 zones without considering other overheads.
813

914

10-
Constructor parameters
11-
----------------------
15+
Class initialisation
16+
--------------------
1217

13-
For ``initla_param`` uses in :ref:`collection_providers <collection_providers>`
18+
Clickhouse prvoider need an extra setting "connection" from the ``datasources`` to define the DB connection:
1419

15-
* ``host``
16-
* ``user``
17-
* ``password``
18-
* ``port``
19-
* ``compression (default: False)``
20-
* ``database (default: 'default')``
20+
.. code-block:: json
21+
"connection" {
22+
"host": "127.0.0.1",
23+
"user": "default",
24+
"password": "default",
25+
"port": 9000
26+
"compression": False,
27+
"database": "default"
28+
}
2129
2230
An example to define a Clickhouse collection provider:
2331

2432
.. code-block:: json
2533
26-
"collection_providers": {"1":
27-
{"clickhouse":
28-
{"classname": "clickhouse_collection_provider.ClickhouseCollectionProvider",
29-
"initial_params":
30-
{"host": "127.0.0.1",
31-
"user": "user",
32-
"password": "password",
33-
"port": 9000,
34-
"database": "DevelopmentTesting"}
35-
}
34+
"collection_providers": {
35+
"1": {
36+
"clickhouse": {
37+
"classname": "clickhouse_collection_provider.ClickhouseCollectionProvider",
38+
"datasources": {
39+
"connection": {
40+
"host": "127.0.0.1",
41+
"user": "default",
42+
"password": "user",
43+
"port": 9000,
44+
"database": "DevelopmentTesting"
45+
},
46+
"hytruck_clickhouse": {
47+
"table": "testing_suitability_IGEO7",
48+
"zone_groups": {
49+
"9": "res_9_id",
50+
"8": "res_8_id",
51+
"7": "res_7_id",
52+
"6": "res_6_id",
53+
"5": "res_5_id"
54+
},
55+
"data_cols": ["data_1", "data_2"]
56+
}
57+
}
3658
}
59+
}
3760
}
38-
39-
40-
41-
get_data parameters
42-
----------------------
43-
44-
For ``getdata_params`` uses in :ref:`collections <collections>`
45-
46-
* ``table`` : table's name
47-
* ``zoneId_cols`` : a dictionary that maps refinement levels to columns that store the corresponding zone ID.
48-
* ``data_cols`` : a list of column names to control which columns should be selected for data queries.
49-
* ``aggregation`` : default is 'mode'
50-
* ``max_query_size`` : to be implemented
51-
52-
A collection example of using clickhouse collection provider :
53-
54-
.. code-block:: json
55-
56-
"collections": {"1":
57-
{"suitability_hytruck":
58-
{"title": "Suitability Modelling for Hytruck",
59-
"description": "Desc",
60-
"collection_provider": {
61-
"providerId": "clickhouse",
62-
"dggrsId": "igeo7",
63-
"maxzonelevel": 9,
64-
"getdata_params":
65-
{ "table": "testing_suitability_IGEO7",
66-
"zoneId_cols": {"9":"res_9_id", "8":"res_8_id", "7":"res_7_id", "6":"res_6_id", "5":"res_5_id"},
67-
"data_cols" : ["modelled_fuel_stations","modelled_seashore","modelled_solar_wind",
68-
"modelled_urban_nodes", "modelled_water_bodies", "modelled_gas_pipelines",
69-
"modelled_hydrogen_pipelines", "modelled_corridor_points", "modelled_powerlines",
70-
"modelled_transport_nodes", "modelled_residential_areas", "modelled_rest_areas",
71-
"modelled_slope"]
72-
}
73-
}
74-
}
75-
}
76-
}
61+
7762

docs/source/providers/collection_providers/implementations/parquet_collection_provider.rst

Lines changed: 10 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,24 @@ Parquet Collection Provider
44
The implementation uses `duckdb <https://duckdb.org/>`_ as the driver to access the data in parquet format. The duckdb starts in `in-memory` mode and uses the extension ``httpfs`` for cloud storage access. Each data source has its own ``duckdb.DuckDBPyConnection`` object from the ``duckdb.connect()`` function, and sets up the secret if needed for the connection.
55
Therefore, multiple cloud providers can be supported by the same parquet providers and different bucket credentials. All data source info are stored as a dictionary in ``self.datasources`` for retrieval. The key of the dictionary represents the ID of the data source.
66

7+
ParquetDatasourceInfo
8+
=====================
9+
- ``filepath`` : String. A file path of the data source. Supports both local, gcs and s3 cloud storage.
10+
- ``id_col``: String. The column name of the zone IDs.
11+
- ``credential``: String that is in the form of `temporary secrets from duckdb <https://duckdb.org/docs/stable/configuration/secrets_manager.html>`_. To specify a custom s3 endpoint, please refer `here <https://duckdb.org/docs/stable/core_extensions/httpfs/s3api.html>`_.
12+
- ``conn``: duckdb object to store the connection.
13+
714
Organisation of the dataset with multiple refinement levels
815
-----------------------------------------------------------
916
The user must arrange all zone IDs at different refinement levels into a single column (e.g. `cell_id`), and ensure that the data at coarser refinement levels is aggregated; the provider doesn't perform any aggregation on the fly. An example screenshot of a parquet dataset with multiple refinement levels is shown below.
1017

1118
|parquet_data_example|
1219

1320

14-
Constructor parameters
15-
----------------------
16-
For ``initial_params`` uses in :ref:`collection_providers <collection_providers>`
21+
Class initialisation
22+
---------------------
1723

18-
It is a nested dictionary. At the root level, the dictionary ``datasources`` contains information about the parquet data sources. User defines the parquet data sources as child dictionaries under ``datasources``. The key of the child dictionary represents the unique ID for the parquet data.
24+
The dictionary ``datasources`` contains information about the parquet data sources. User defines the parquet data sources as child dictionaries under ``datasources``. The key of the child dictionary represents the unique ID for the parquet data.
1925

2026
An example to define a Parquet collection provider:
2127

@@ -25,7 +31,6 @@ An example to define a Parquet collection provider:
2531
{
2632
"parquet": {
2733
"classname": "parquet_collection_provider.ParquetCollectionProvider",
28-
"initial_params": {
2934
"datasources": {
3035
"hytruck": {
3136
"filepath": "gcs://<path to parquet file>",
@@ -35,51 +40,9 @@ An example to define a Parquet collection provider:
3540
"credential": "TYPE gcs, KEY_ID 'myKEY', SECRET 'secretKEY'"
3641
}
3742
}
38-
}
3943
}
4044
}
4145
4246
43-
To define a parquet data source, two mandatory parameters are required:
44-
45-
* ``filepath``: the file path of the parquet file.
46-
* ``id_col``: a string to indicate the column name of zone ID.
47-
48-
Optional parameters:
49-
50-
* ``data_cols``: A list of strings specifying a set of column names from the dataset used in the data query. In the case of all columns, the user can use the short form: ``["*"]``. Default to ``["*"]``
51-
* ``exclude_data_cols``: A list of strings specifying a set of column names from the dataset that are excluded from the data query. Default to ``[]``
52-
* ``credential``: a string that is in the form of `temporary secrets from duckdb <https://duckdb.org/docs/stable/configuration/secrets_manager.html>`_. To specify a custom s3 endpoint, please refer `here <https://duckdb.org/docs/stable/core_extensions/httpfs/s3api.html>`_.
53-
54-
55-
get_data parameters
56-
----------------------
57-
58-
For ``getdata_params`` uses in :ref:`collections <collections>`
59-
60-
* ``datasource_id`` : the unique ID defines for a parquet data source under ``initial_params``
61-
62-
A collection example of using parquet collection provider :
63-
64-
.. code-block:: json
65-
66-
"collections": {"1":
67-
{"suitability_hytruck_parquet":
68-
{
69-
"title": "Suitability Modelling for Hytruck in parquet data format",
70-
"description": "Desc",
71-
"collection_provider": {
72-
"providerId": "parquet",
73-
"dggrsId": "igeo7",
74-
"maxzonelevel": 9,
75-
"getdata_params": {
76-
"datasource_id" : "hytruck"
77-
}
78-
}
79-
}
80-
}
81-
}
82-
83-
8447
.. |parquet_data_example| image:: ../../../images/parquet_multiple_refinement_levels_in_one_column.png
8548
:width: 600

0 commit comments

Comments
 (0)