Skip to content

Commit c08e029

Browse files
committed
- update documentation - corrected the zone query/zone data retrieval to not aggregate data.
1 parent b7953a8 commit c08e029

File tree

12 files changed

+204
-254
lines changed

12 files changed

+204
-254
lines changed

docs/source/providers/collection_providers/implementations/clickhouse_collection_provider.rst

Lines changed: 43 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -2,76 +2,61 @@ Clickhouse Collection Provider
22
==============================
33
The implementation uses `clickhouse_drive <https://clickhouse-driver.readthedocs.io/en/latest/>`_ to connect to Clickhouse DB. The provider serves multiple tables on the same database, with each table as a data source. It creates an instance of ``clickhouse_diver::Client`` at initialisation and assigns it to ``self.db``. The reference is used in ``get_data`` for data queries.
44

5+
ClickhouseDatasourceInfo
6+
==============================
7+
- ``table``: A string to indicate the table for query
8+
- ``aggregation``: A string to indicate which aggregation should use. Currently only for `mode`.
9+
510
Note on Clickhouse query
611
-------------------------
712
Clickhouse restricts the query size to 200KB by default. It is controlled by the setting `max_query_size <https://clickhouse.com/docs/operations/settings/settings#max_query_size>`_ . The default size is too small when the number of zone IDs for the query is large. For instance, each zone ID consumes 10 bytes for IGEO7 z7 (in string format) at refinement level 8, the query is limited to 20,000 zones without considering other overheads.
813

914

10-
Constructor parameters
11-
----------------------
15+
Class initialisation
16+
--------------------
1217

13-
For ``initla_param`` uses in :ref:`collection_providers <collection_providers>`
18+
Clickhouse prvoider need an extra setting "connection" from the ``datasources`` to define the DB connection:
1419

15-
* ``host``
16-
* ``user``
17-
* ``password``
18-
* ``port``
19-
* ``compression (default: False)``
20-
* ``database (default: 'default')``
20+
.. code-block:: json
21+
"connection" {
22+
"host": "127.0.0.1",
23+
"user": "default",
24+
"password": "default",
25+
"port": 9000
26+
"compression": False,
27+
"database": "default"
28+
}
2129
2230
An example to define a Clickhouse collection provider:
2331

2432
.. code-block:: json
2533
26-
"collection_providers": {"1":
27-
{"clickhouse":
28-
{"classname": "clickhouse_collection_provider.ClickhouseCollectionProvider",
29-
"initial_params":
30-
{"host": "127.0.0.1",
31-
"user": "user",
32-
"password": "password",
33-
"port": 9000,
34-
"database": "DevelopmentTesting"}
35-
}
34+
"collection_providers": {
35+
"1": {
36+
"clickhouse": {
37+
"classname": "clickhouse_collection_provider.ClickhouseCollectionProvider",
38+
"datasources": {
39+
"connection": {
40+
"host": "127.0.0.1",
41+
"user": "default",
42+
"password": "user",
43+
"port": 9000,
44+
"database": "DevelopmentTesting"
45+
},
46+
"hytruck_clickhouse": {
47+
"table": "testing_suitability_IGEO7",
48+
"zone_groups": {
49+
"9": "res_9_id",
50+
"8": "res_8_id",
51+
"7": "res_7_id",
52+
"6": "res_6_id",
53+
"5": "res_5_id"
54+
},
55+
"data_cols": ["data_1", "data_2"]
56+
}
57+
}
3658
}
59+
}
3760
}
38-
39-
40-
41-
get_data parameters
42-
----------------------
43-
44-
For ``getdata_params`` uses in :ref:`collections <collections>`
45-
46-
* ``table`` : table's name
47-
* ``zoneId_cols`` : a dictionary that maps refinement levels to columns that store the corresponding zone ID.
48-
* ``data_cols`` : a list of column names to control which columns should be selected for data queries.
49-
* ``aggregation`` : default is 'mode'
50-
* ``max_query_size`` : to be implemented
51-
52-
A collection example of using clickhouse collection provider :
53-
54-
.. code-block:: json
55-
56-
"collections": {"1":
57-
{"suitability_hytruck":
58-
{"title": "Suitability Modelling for Hytruck",
59-
"description": "Desc",
60-
"collection_provider": {
61-
"providerId": "clickhouse",
62-
"dggrsId": "igeo7",
63-
"maxzonelevel": 9,
64-
"getdata_params":
65-
{ "table": "testing_suitability_IGEO7",
66-
"zoneId_cols": {"9":"res_9_id", "8":"res_8_id", "7":"res_7_id", "6":"res_6_id", "5":"res_5_id"},
67-
"data_cols" : ["modelled_fuel_stations","modelled_seashore","modelled_solar_wind",
68-
"modelled_urban_nodes", "modelled_water_bodies", "modelled_gas_pipelines",
69-
"modelled_hydrogen_pipelines", "modelled_corridor_points", "modelled_powerlines",
70-
"modelled_transport_nodes", "modelled_residential_areas", "modelled_rest_areas",
71-
"modelled_slope"]
72-
}
73-
}
74-
}
75-
}
76-
}
61+
7762

docs/source/providers/collection_providers/implementations/parquet_collection_provider.rst

Lines changed: 10 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,24 @@ Parquet Collection Provider
44
The implementation uses `duckdb <https://duckdb.org/>`_ as the driver to access the data in parquet format. The duckdb starts in `in-memory` mode and uses the extension ``httpfs`` for cloud storage access. Each data source has its own ``duckdb.DuckDBPyConnection`` object from the ``duckdb.connect()`` function, and sets up the secret if needed for the connection.
55
Therefore, multiple cloud providers can be supported by the same parquet providers and different bucket credentials. All data source info are stored as a dictionary in ``self.datasources`` for retrieval. The key of the dictionary represents the ID of the data source.
66

7+
ParquetDatasourceInfo
8+
=====================
9+
- ``filepath`` : String. A file path of the data source. Supports both local, gcs and s3 cloud storage.
10+
- ``id_col``: String. The column name of the zone IDs.
11+
- ``credential``: String that is in the form of `temporary secrets from duckdb <https://duckdb.org/docs/stable/configuration/secrets_manager.html>`_. To specify a custom s3 endpoint, please refer `here <https://duckdb.org/docs/stable/core_extensions/httpfs/s3api.html>`_.
12+
- ``conn``: duckdb object to store the connection.
13+
714
Organisation of the dataset with multiple refinement levels
815
-----------------------------------------------------------
916
The user must arrange all zone IDs at different refinement levels into a single column (e.g. `cell_id`), and ensure that the data at coarser refinement levels is aggregated; the provider doesn't perform any aggregation on the fly. An example screenshot of a parquet dataset with multiple refinement levels is shown below.
1017

1118
|parquet_data_example|
1219

1320

14-
Constructor parameters
15-
----------------------
16-
For ``initial_params`` uses in :ref:`collection_providers <collection_providers>`
21+
Class initialisation
22+
---------------------
1723

18-
It is a nested dictionary. At the root level, the dictionary ``datasources`` contains information about the parquet data sources. User defines the parquet data sources as child dictionaries under ``datasources``. The key of the child dictionary represents the unique ID for the parquet data.
24+
The dictionary ``datasources`` contains information about the parquet data sources. User defines the parquet data sources as child dictionaries under ``datasources``. The key of the child dictionary represents the unique ID for the parquet data.
1925

2026
An example to define a Parquet collection provider:
2127

@@ -25,7 +31,6 @@ An example to define a Parquet collection provider:
2531
{
2632
"parquet": {
2733
"classname": "parquet_collection_provider.ParquetCollectionProvider",
28-
"initial_params": {
2934
"datasources": {
3035
"hytruck": {
3136
"filepath": "gcs://<path to parquet file>",
@@ -35,51 +40,9 @@ An example to define a Parquet collection provider:
3540
"credential": "TYPE gcs, KEY_ID 'myKEY', SECRET 'secretKEY'"
3641
}
3742
}
38-
}
3943
}
4044
}
4145
4246
43-
To define a parquet data source, two mandatory parameters are required:
44-
45-
* ``filepath``: the file path of the parquet file.
46-
* ``id_col``: a string to indicate the column name of zone ID.
47-
48-
Optional parameters:
49-
50-
* ``data_cols``: A list of strings specifying a set of column names from the dataset used in the data query. In the case of all columns, the user can use the short form: ``["*"]``. Default to ``["*"]``
51-
* ``exclude_data_cols``: A list of strings specifying a set of column names from the dataset that are excluded from the data query. Default to ``[]``
52-
* ``credential``: a string that is in the form of `temporary secrets from duckdb <https://duckdb.org/docs/stable/configuration/secrets_manager.html>`_. To specify a custom s3 endpoint, please refer `here <https://duckdb.org/docs/stable/core_extensions/httpfs/s3api.html>`_.
53-
54-
55-
get_data parameters
56-
----------------------
57-
58-
For ``getdata_params`` uses in :ref:`collections <collections>`
59-
60-
* ``datasource_id`` : the unique ID defines for a parquet data source under ``initial_params``
61-
62-
A collection example of using parquet collection provider :
63-
64-
.. code-block:: json
65-
66-
"collections": {"1":
67-
{"suitability_hytruck_parquet":
68-
{
69-
"title": "Suitability Modelling for Hytruck in parquet data format",
70-
"description": "Desc",
71-
"collection_provider": {
72-
"providerId": "parquet",
73-
"dggrsId": "igeo7",
74-
"maxzonelevel": 9,
75-
"getdata_params": {
76-
"datasource_id" : "hytruck"
77-
}
78-
}
79-
}
80-
}
81-
}
82-
83-
8447
.. |parquet_data_example| image:: ../../../images/parquet_multiple_refinement_levels_in_one_column.png
8548
:width: 600
Lines changed: 17 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,24 @@
11
Zarr Collection Provider
22
==============================
33

4-
The implementation uses `xarray.Datatree <https://docs.xarray.dev/en/latest/generated/xarray.DataTree.html>`_ as the driver to access Zarr data. The provider serves multiple Zarr data sources. At the initialisation stage, it loads the ``datasources`` setting from the ``initial_params`` to get each Zarr data configuration, then it creates an xarray datatree handler for each of them and stores it under ``self.datasources`` with the id as the key.
4+
The implementation uses `xarray.Datatree <https://docs.xarray.dev/en/latest/generated/xarray.DataTree.html>`_ as the driver to access Zarr data. The provider serves multiple Zarr data sources. At the initialisation stage, it loads the ``datasources`` to get each Zarr data configuration, then it creates an xarray datatree handler for each of them and stores it under ``self.datasources`` with the id as the key.
55

66
Each group of the Zarr data source represents data from the same refinement level, with zone IDs as the index. Here is an example of how Zarr data is organised.
77

88
|zarr_data_example|
99

10+
ZarrDatasourceInfo
11+
==================
1012

11-
Constructor parameters
12-
----------------------
13+
- ``filepath`` : String. A file path of the data source. Supports both local, gcs and s3 cloud storage.
14+
- ``id_col``: String. The column name of the zone IDs, default is "".
15+
- ``filehandle``: xarray datatree object to store the connection.
1316

14-
For ``initial_params`` uses in :ref:`collection_providers <collection_providers>`
1517

16-
It is a nested dictionary. At the root level, the dictionary ``datasources`` contains information about one or more Zarr data sources in the form of a child dictionary. The key of the child dictionary represents the unique ID for the Zarr data. Currently, only local storage is supported.
18+
Class initialisation
19+
--------------------
20+
21+
The dictionary ``datasources`` contains information about one or more Zarr data sources in the form of a child dictionary. The key of the child dictionary represents the unique ID for the Zarr data. Currently, only local storage is supported.
1722

1823
An example to define a Zarr collection provider:
1924

@@ -22,57 +27,18 @@ An example to define a Zarr collection provider:
2227
"collection_providers": {"1":
2328
{"zarr":
2429
{"classname": "zarr_collection_provider.ZarrCollectionProvider",
25-
"initial_params":
26-
{ "datasources": {
27-
"my_zarr_data": {
28-
"filepath": "<path to zarr folder>",
29-
"id_col": "zoneId",
30-
"zones_grps" : { "4": "res4", "5": "res5"}
31-
}
30+
"datasources": {
31+
"my_zarr_data": {
32+
"filepath": "<path to zarr folder>",
33+
"id_col": "zoneId",
34+
"zones_grps" : { "4": "res4", "5": "res5"}
3235
}
33-
}
34-
}
36+
}
37+
}
3538
}
3639
}
3740
3841
3942
40-
For each Zarr data, two parameters are required:
41-
42-
* ``filepath`` : the local directory path of the data.
43-
* ``zones_grps`` : a dictionary that maps refinement level to group name of the data.
44-
* ``id_col`` : the coordinate name of the zone IDs, assume that all groups share the same coordinate name. If not supplied, the ``zones_grps`` value is used.
45-
46-
47-
48-
get_data parameters
49-
----------------------
50-
51-
For ``getdata_params`` uses in :ref:`collections <collections>`
52-
53-
* ``datasource_id`` : the unique ID defines for a Zarr data under ``initial_params``
54-
55-
A collection example of using Zarr collection provider :
56-
57-
.. code-block:: json
58-
59-
"collections": {"1":
60-
{"suitability_hytruck_zarr":
61-
{
62-
"title": "Suitability Modelling for Hytruck in Zarr Data format",
63-
"description": "Desc",
64-
"collection_provider": {
65-
"providerId": "zarr",
66-
"dggrsId": "igeo7",
67-
"maxzonelevel": 5,
68-
"getdata_params": {
69-
"datasource_id" : "my_zarr_data"
70-
}
71-
}
72-
}
73-
}
74-
}
75-
76-
7743
.. |zarr_data_example| image:: ../../../images/zarr_data_example.png
7844
:width: 600

docs/source/providers/collection_providers/index.rst

Lines changed: 43 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,64 @@
1+
Abstract Datasource Info
2+
========================
3+
4+
For each Collection Provider, it must have its own DatasourceInfo class that extends from the AbstractDatasourceInfo. The abstract class holds the standardised data source info. The API doesn't directly interact with the data source info; it is mainly used by the ``get_data`` function of the collection provider.
5+
6+
Each data source defined under the ``collection_providers`` is instantiated as the corresponding data source info class when loaded. All data sources are stored in the `datasources` dictionary of the collection provider.
7+
8+
The attributes of Abstract Datasource Info class are:
9+
10+
- ``data_cols``: a list of column names(in string) used to ``get_data``, default to: ['*'] , which means all columns.
11+
- ``exclude_data_cols``: a list of column names(in string) that are excluded from ``get_data``, default to: [].
12+
- ``zone_groups`` : A dictionary to map the refinement level to the column name that stores the zone IDs.
13+
14+
15+
116
Abstract Collection providers
217
=============================
318

4-
To implement a collection provider, users need to provide the implementation of the interface listed below:
19+
To implement a collection provider, users need to provide initialisation of the ``datasources`` variable and the implementation of the interfaces listed below:
20+
21+
Variable:
22+
23+
- ``datasources``: a dictionary with key equals to the ``datasource_id`` that map to the correspodning datasource info class.
24+
25+
Interfaces:
526

627
- ``get_data``: implementation of the data query from the dataset
728
- ``get_datadictionary``: implementation of getting the data dictionary (column names and data types) from the dataset, for the tiles JSON response.
829

9-
Class constructor
10-
-----------------
30+
Class initialisation
31+
--------------------
32+
33+
The :ref:`collection_providers <collection_providers>` must initialise the ``datasources`` dictionary of the class with the ``datasources`` configuration from the ``collection_providers`` table. Users can reference the full example :ref:`here <_collection_provider_config_example>`.
34+
35+
For example, the ``ParquetDatasourceInfo`` and the ``datasources`` configuration:
36+
37+
.. code-block:: python
38+
39+
class ParquetDatasourceInfo(AbstractDatasourceInfo):
40+
filepath: str = ""
41+
id_col: str = ""
42+
conn: duckdb.DuckDBPyConnection = None
1143
12-
The :ref:`collection_providers <collection_providers>` configuration provides a parameters dictionary with the key ``inital_params`` to supply necessary info when initialising the collection provider. Users can reference the full example :ref:`here <_collection_provider_config_example>`.
1344
1445
.. code-block:: json
15-
16-
"initial_params":
17-
{"host": "127.0.0.1",
18-
"user": "user",
19-
"password": "password",
20-
"port": 9000,
21-
"database": "DevelopmentTesting"}
2246
47+
"hytruck_local": {
48+
"filepath": "~/file_path/igeo7_4-10.parquet",
49+
"id_col": "cell_ids",
50+
"data_cols": ["stations_band_1", "pipelines_band_1","pipelines_band_2"],
51+
"exclude_data_cols": ["geometry"]
52+
}
53+
2354
2455
.. _parameters_for_get_data:
2556

2657
Parameters for get_data
2758
-----------------------
28-
The pydggsapi creates collection provider objects at the beginning, and data sources that share the same provider will use the same object instance. Thus, in addition to the standard parameters of the interface ``get_data``, pydggsapi will pass in a parameters dictionary ``getdata_params`` defined in the :ref:`collections <collections>` setting. The extra parameters provide flexibility for the get_data interface if needed.
2959

60+
The pydggsapi creates collection provider objects at the beginning, and data sources that share the same provider will use the same object instance. The ``get_data`` function accepts the parameter ``datasource_id`` defined in the :ref:`collections <collections>` setting to retrieve the corresponding data source info class, which is used to perform queries.
3061

31-
.. code-block:: json
32-
33-
"getdata_params":
34-
{ "table": "testing_suitability_IGEO7",
35-
"zoneId_cols": {"9":"res_9_id", "8":"res_8_id", "7":"res_7_id", "6":"res_6_id", "5":"res_5_id"},
36-
"data_cols" : ["modelled_fuel_stations","modelled_seashore","modelled_solar_wind"]
37-
}
3862

3963
Parameters for get_datadictionary
4064
---------------------------------

0 commit comments

Comments
 (0)