LandscapeGeoinformatics
diff --git a/‎docs/source/providers/collection_providers/implementations/clickhouse_collection_provider.rst‎
Lines changed: 43 additions & 58 deletions b/‎docs/source/providers/collection_providers/implementations/clickhouse_collection_provider.rst‎
Lines changed: 43 additions & 58 deletions
diff --git a/‎docs/source/providers/collection_providers/implementations/parquet_collection_provider.rst‎
Lines changed: 10 additions & 47 deletions b/‎docs/source/providers/collection_providers/implementations/parquet_collection_provider.rst‎
Lines changed: 10 additions & 47 deletions
diff --git a/‎docs/source/providers/collection_providers/implementations/zarr_collection_provider.rst‎
Lines changed: 17 additions & 51 deletions b/‎docs/source/providers/collection_providers/implementations/zarr_collection_provider.rst‎
Lines changed: 17 additions & 51 deletions
diff --git a/‎docs/source/providers/collection_providers/index.rst‎
Lines changed: 43 additions & 19 deletions b/‎docs/source/providers/collection_providers/index.rst‎
Lines changed: 43 additions & 19 deletions
@@ -2,76 +2,61 @@ Clickhouse Collection Provider
 ==============================
 The implementation uses `clickhouse_drive <https://clickhouse-driver.readthedocs.io/en/latest/>`_  to connect to Clickhouse DB. The provider serves multiple tables on the same database, with each table as a data source. It creates an instance of ``clickhouse_diver::Client`` at initialisation and assigns it to ``self.db``. The reference is used in ``get_data`` for data queries. 
 
+ClickhouseDatasourceInfo
+==============================
+- ``table``: A string to indicate the table for query
+- ``aggregation``: A string to indicate which aggregation should use. Currently only for `mode`.
+
 Note on Clickhouse query
 -------------------------
 Clickhouse restricts the query size to 200KB by default. It is controlled by the setting `max_query_size <https://clickhouse.com/docs/operations/settings/settings#max_query_size>`_ . The default size is too small when the number of zone IDs for the query is large. For instance, each zone ID consumes 10 bytes for IGEO7 z7 (in string format) at refinement level 8, the query is limited to 20,000 zones without considering other overheads.
 
 
-Constructor parameters
-----------------------
+Class initialisation
+--------------------
 
-For ``initla_param`` uses in :ref:`collection_providers <collection_providers>`
+Clickhouse prvoider need an extra setting "connection" from the ``datasources`` to define the DB connection:
 
-* ``host``
-* ``user``
-* ``password``
-* ``port``
-* ``compression (default: False)``
-* ``database (default: 'default')``
+.. code-block:: json
+    "connection" {
+        "host": "127.0.0.1",
+        "user": "default",
+        "password": "default",
+        "port": 9000
+        "compression": False,
+        "database": "default"
+    }
 
 An example to define a Clickhouse collection provider:
 
 .. code-block:: json
 
-    "collection_providers": {"1": 
-            {"clickhouse": 
-                {"classname": "clickhouse_collection_provider.ClickhouseCollectionProvider", 
-                  "initial_params": 
-                          {"host": "127.0.0.1", 
-                           "user": "user",
-                           "password": "password", 
-                           "port": 9000, 
-                           "database": "DevelopmentTesting"} 
-                  }
+     "collection_providers": {
+        "1": {
+        "clickhouse": {
+            "classname": "clickhouse_collection_provider.ClickhouseCollectionProvider",
+            "datasources": {
+        	    "connection": {
+        		    "host": "127.0.0.1",
+          		    "user": "default",
+                    "password": "user",
+          		    "port": 9000,
+          		    "database": "DevelopmentTesting"
+        	    },
+        	    "hytruck_clickhouse": {
+            	    "table": "testing_suitability_IGEO7",
+            	    "zone_groups": {
+                            "9": "res_9_id",
+                            "8": "res_8_id",
+                            "7": "res_7_id",
+                            "6": "res_6_id",
+                            "5": "res_5_id"
+                    },
+                    "data_cols": ["data_1", "data_2"]
+         	        }
+                }
             }
+        }
     }
-
-
-
-get_data parameters
-----------------------
-
-For ``getdata_params`` uses in :ref:`collections <collections>`
-
-* ``table``          : table's name
-* ``zoneId_cols``    : a dictionary that maps refinement levels to columns that store the corresponding zone ID. 
-* ``data_cols``      : a list of column names to control which columns should be selected for data queries.
-* ``aggregation``    : default is 'mode'
-* ``max_query_size`` : to be implemented 
-
-A collection example of using clickhouse collection provider :
-
-.. code-block:: json
-
-     "collections": {"1": 
-              {"suitability_hytruck": 
-                  {"title": "Suitability Modelling for Hytruck",
-                    "description": "Desc", 
-                    "collection_provider": {
-                            "providerId": "clickhouse", 
-                            "dggrsId": "igeo7",
-                             "maxzonelevel": 9,
-                             "getdata_params": 
-                                 { "table": "testing_suitability_IGEO7", 
-                                    "zoneId_cols": {"9":"res_9_id", "8":"res_8_id", "7":"res_7_id", "6":"res_6_id", "5":"res_5_id"},
-                                    "data_cols" : ["modelled_fuel_stations","modelled_seashore","modelled_solar_wind",
-                                    "modelled_urban_nodes", "modelled_water_bodies", "modelled_gas_pipelines",
-                                    "modelled_hydrogen_pipelines", "modelled_corridor_points",  "modelled_powerlines", 
-                                    "modelled_transport_nodes", "modelled_residential_areas",  "modelled_rest_areas", 
-                                    "modelled_slope"]
-                                  }
-                        }
-                    }
-              } 
-          }
+    
 
@@ -4,18 +4,24 @@ Parquet Collection Provider
 The implementation uses `duckdb <https://duckdb.org/>`_ as the driver to access the data in parquet format. The duckdb starts in `in-memory` mode and uses the extension ``httpfs`` for cloud storage access. Each data source has its own ``duckdb.DuckDBPyConnection`` object from the ``duckdb.connect()`` function, and sets up the secret if needed for the connection. 
 Therefore, multiple cloud providers can be supported by the same parquet providers and different bucket credentials. All data source info are stored as a dictionary in ``self.datasources`` for retrieval. The key of the dictionary represents the ID of the data source. 
 
+ParquetDatasourceInfo
+=====================
+- ``filepath`` : String. A file path of the data source. Supports both local, gcs and s3 cloud storage.
+- ``id_col``: String. The column name of the zone IDs.
+- ``credential``: String that is in the form of `temporary secrets from duckdb <https://duckdb.org/docs/stable/configuration/secrets_manager.html>`_. To specify a custom s3 endpoint, please refer `here <https://duckdb.org/docs/stable/core_extensions/httpfs/s3api.html>`_.
+- ``conn``: duckdb object to store the connection.
+
 Organisation of the dataset with multiple refinement levels
 -----------------------------------------------------------
 The user must arrange all zone IDs at different refinement levels into a single column (e.g. `cell_id`), and ensure that the data at coarser refinement levels is aggregated; the provider doesn't perform any aggregation on the fly. An example screenshot of a parquet dataset with multiple refinement levels is shown below.
 
 |parquet_data_example|
 
 
-Constructor parameters
-----------------------
-For ``initial_params`` uses in :ref:`collection_providers <collection_providers>`
+Class initialisation
+---------------------
 
-It is a nested dictionary. At the root level, the dictionary ``datasources`` contains information about the parquet data sources. User defines the parquet data sources as child dictionaries under ``datasources``. The key of the child dictionary represents the unique ID for the parquet data. 
+The dictionary ``datasources`` contains information about the parquet data sources. User defines the parquet data sources as child dictionaries under ``datasources``. The key of the child dictionary represents the unique ID for the parquet data. 
 
 An example to define a Parquet collection provider:
 
@@ -25,7 +31,6 @@ An example to define a Parquet collection provider:
     {
       "parquet": {
         "classname": "parquet_collection_provider.ParquetCollectionProvider",
-        "initial_params": {
           "datasources": {
             "hytruck": {
               "filepath": "gcs://<path to parquet file>",
@@ -35,51 +40,9 @@ An example to define a Parquet collection provider:
               "credential": "TYPE gcs, KEY_ID 'myKEY', SECRET 'secretKEY'" 
             }
           }
-        }
       }
     }
     
 
-To define a parquet data source, two mandatory parameters are required: 
-
-* ``filepath``: the file path of the parquet file.
-* ``id_col``: a string to indicate the column name of zone ID.
-
-Optional parameters:
-
-* ``data_cols``: A list of strings specifying a set of column names from the dataset used in the data query. In the case of all columns, the user can use the short form:  ``["*"]``. Default to ``["*"]``
-* ``exclude_data_cols``: A list of strings specifying a set of column names from the dataset that are excluded from the data query. Default to ``[]``
-* ``credential``: a string that is in the form of `temporary secrets from duckdb <https://duckdb.org/docs/stable/configuration/secrets_manager.html>`_. To specify a custom s3 endpoint, please refer `here <https://duckdb.org/docs/stable/core_extensions/httpfs/s3api.html>`_.
-
-
-get_data parameters
-----------------------
-
-For ``getdata_params`` uses in :ref:`collections <collections>`
-
-* ``datasource_id`` : the unique ID defines for a parquet data source under ``initial_params``
-
-A collection example of using parquet collection provider :
-
-.. code-block:: json 
-
-    "collections": {"1": 
-                    {"suitability_hytruck_parquet": 
-                        {
-                         "title": "Suitability Modelling for Hytruck in parquet data format",
-                         "description": "Desc", 
-                         "collection_provider": {
-                                  "providerId": "parquet", 
-                                  "dggrsId": "igeo7",
-                                   "maxzonelevel": 9,
-                                   "getdata_params": { 
-                                           "datasource_id" : "hytruck"
-                                    } 
-                            }
-                        }
-                    }
-                } 
-
-
 .. |parquet_data_example| image:: ../../../images/parquet_multiple_refinement_levels_in_one_column.png
    :width: 600
@@ -1,19 +1,24 @@
 Zarr Collection Provider
 ==============================
 
-The implementation uses `xarray.Datatree <https://docs.xarray.dev/en/latest/generated/xarray.DataTree.html>`_ as the driver to access Zarr data. The provider serves multiple Zarr data sources. At the initialisation stage, it loads the ``datasources`` setting from the ``initial_params`` to get each Zarr data configuration, then it creates an xarray datatree handler for each of them and stores it under ``self.datasources`` with the id as the key.
+The implementation uses `xarray.Datatree <https://docs.xarray.dev/en/latest/generated/xarray.DataTree.html>`_ as the driver to access Zarr data. The provider serves multiple Zarr data sources. At the initialisation stage, it loads the ``datasources`` to get each Zarr data configuration, then it creates an xarray datatree handler for each of them and stores it under ``self.datasources`` with the id as the key.
 
 Each group of the Zarr data source represents data from the same refinement level, with zone IDs as the index. Here is an example of how Zarr data is organised. 
 
 |zarr_data_example|
 
+ZarrDatasourceInfo
+==================
 
-Constructor parameters
-----------------------
+- ``filepath`` : String. A file path of the data source. Supports both local, gcs and s3 cloud storage.
+- ``id_col``: String. The column name of the zone IDs, default is "".
+- ``filehandle``: xarray datatree object to store the connection.
 
-For ``initial_params`` uses in :ref:`collection_providers <collection_providers>`
 
-It is a nested dictionary. At the root level, the dictionary ``datasources`` contains information about one or more Zarr data sources in the form of a child dictionary. The key of the child dictionary represents the unique ID for the Zarr data. Currently, only local storage is supported.
+Class initialisation
+--------------------
+
+The dictionary ``datasources`` contains information about one or more Zarr data sources in the form of a child dictionary. The key of the child dictionary represents the unique ID for the Zarr data. Currently, only local storage is supported.
 
 An example to define a Zarr collection provider:
 
@@ -22,57 +27,18 @@ An example to define a Zarr collection provider:
     "collection_providers": {"1": 
             {"zarr": 
                 {"classname": "zarr_collection_provider.ZarrCollectionProvider", 
-                  "initial_params": 
-                          { "datasources": {
-                                    "my_zarr_data": {
-                                        "filepath": "<path to zarr folder>",
-                                        "id_col": "zoneId",
-                                        "zones_grps" : { "4": "res4", "5": "res5"}
-                                    } 
+                    "datasources": {
+                           "my_zarr_data": {
+                                "filepath": "<path to zarr folder>",
+                                "id_col": "zoneId",
+                                 "zones_grps" : { "4": "res4", "5": "res5"}
                             } 
-                        }
-                }
+                        } 
+                    }
             }
     }
 
    
 
-For each Zarr data, two parameters are required: 
-
-* ``filepath``   : the local directory path of the data.
-* ``zones_grps`` : a dictionary that maps refinement level to group name of the data.
-* ``id_col``     : the coordinate name of the zone IDs, assume that all groups share the same coordinate name. If not supplied, the ``zones_grps`` value is used.
-
-
-
-get_data parameters
-----------------------
-
-For ``getdata_params`` uses in :ref:`collections <collections>`
-
-* ``datasource_id`` : the unique ID defines for a Zarr data under ``initial_params``
-
-A collection example of using Zarr collection provider :
-
-.. code-block:: json 
-
-    "collections": {"1": 
-                    {"suitability_hytruck_zarr": 
-                        {
-                         "title": "Suitability Modelling for Hytruck in Zarr Data format",
-                         "description": "Desc", 
-                         "collection_provider": {
-                                  "providerId": "zarr", 
-                                  "dggrsId": "igeo7",
-                                   "maxzonelevel": 5,
-                                   "getdata_params": { 
-                                           "datasource_id" : "my_zarr_data"
-                                    } 
-                            }
-                        }
-                    }
-                } 
-
-
 .. |zarr_data_example| image:: ../../../images/zarr_data_example.png
    :width: 600
@@ -1,40 +1,64 @@
+Abstract Datasource Info
+========================
+
+For each Collection Provider, it must have its own DatasourceInfo class that extends from the AbstractDatasourceInfo. The abstract class holds the standardised data source info. The API doesn't directly interact with the data source info; it is mainly used by the ``get_data`` function of the collection provider.
+
+Each data source defined under the ``collection_providers`` is instantiated as the corresponding data source info class when loaded. All data sources are stored in the `datasources` dictionary of the collection provider.
+
+The attributes of Abstract Datasource Info class are: 
+
+- ``data_cols``: a list of column names(in string) used to ``get_data``, default to:  ['*'] , which means all columns.
+- ``exclude_data_cols``: a list of column names(in string) that are excluded from ``get_data``, default to:  [].
+- ``zone_groups`` : A dictionary to map the refinement level to the column name that stores the zone IDs.
+
+
+
 Abstract Collection providers
 =============================
 
-To implement a collection provider, users need to provide the implementation of the interface listed below: 
+To implement a collection provider, users need to provide initialisation of the ``datasources`` variable and the implementation of the interfaces listed below:
+
+Variable:
+
+- ``datasources``: a dictionary with key equals to the ``datasource_id`` that map to the correspodning datasource info class.
+
+Interfaces: 
 
 - ``get_data``: implementation of the data query from the dataset
 - ``get_datadictionary``: implementation of getting the data dictionary (column names and data types) from the dataset, for the tiles JSON response.
 
-Class constructor
------------------
+Class initialisation
+--------------------
+
+The :ref:`collection_providers <collection_providers>` must initialise the ``datasources`` dictionary of the class with the ``datasources`` configuration from the ``collection_providers`` table. Users can reference the full example :ref:`here <_collection_provider_config_example>`.
+
+For example, the ``ParquetDatasourceInfo`` and the ``datasources`` configuration:
+
+.. code-block:: python
+
+   class ParquetDatasourceInfo(AbstractDatasourceInfo):
+    filepath: str = ""
+    id_col: str = ""
+    conn: duckdb.DuckDBPyConnection = None
 
-The :ref:`collection_providers <collection_providers>` configuration provides a parameters dictionary with the key ``inital_params`` to supply necessary info when initialising the collection provider.  Users can reference the full example :ref:`here <_collection_provider_config_example>`.
 
 .. code-block:: json
-    
-    "initial_params":
-                       {"host": "127.0.0.1",
-                        "user": "user",
-                        "password": "password",
-                        "port": 9000,
-                        "database": "DevelopmentTesting"} 
 
+    "hytruck_local": {
+        "filepath": "~/file_path/igeo7_4-10.parquet",
+	    "id_col": "cell_ids",
+	    "data_cols": ["stations_band_1", "pipelines_band_1","pipelines_band_2"],
+		"exclude_data_cols": ["geometry"]
+	}
+    
 
 .. _parameters_for_get_data:
 
 Parameters for get_data
 -----------------------
-The pydggsapi creates collection provider objects at the beginning, and data sources that share the same provider will use the same object instance. Thus, in addition to the standard parameters of the interface ``get_data``,  pydggsapi will pass in a parameters dictionary ``getdata_params`` defined in the :ref:`collections <collections>` setting. The extra parameters provide flexibility for the get_data interface if needed.
 
+The pydggsapi creates collection provider objects at the beginning, and data sources that share the same provider will use the same object instance. The ``get_data`` function accepts the parameter ``datasource_id`` defined in the :ref:`collections <collections>` setting to retrieve the corresponding data source info class, which is used to perform queries.
 
-.. code-block:: json
-   
-   "getdata_params":
-                    { "table": "testing_suitability_IGEO7",
-                      "zoneId_cols": {"9":"res_9_id", "8":"res_8_id", "7":"res_7_id", "6":"res_6_id", "5":"res_5_id"},
-                      "data_cols" : ["modelled_fuel_stations","modelled_seashore","modelled_solar_wind"]
-                    }
 
 Parameters for get_datadictionary
 ---------------------------------