|
| 1 | +# Docker Cluster for Integration Testing |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +The docker cluster in `docker/integ-test` is designed to be used for integration testing. It supports the following |
| 6 | +use cases: |
| 7 | +1. Submitting queries directly to Spark in order to test the PPL extension for Spark. |
| 8 | +2. Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension |
| 9 | + for Spark. |
| 10 | +3. Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying |
| 11 | + S3/Glue datasources. A local container is run rather than using the AWS EMR service. |
| 12 | + |
| 13 | +The cluster consists of several containers and handles configuring them. No tables are created. |
| 14 | + |
| 15 | +## Overview |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +All containers run in a dedicated docker network. |
| 20 | + |
| 21 | +### OpenSearch Dashboards |
| 22 | + |
| 23 | +An OpenSearch dashboards server that is connected to the OpenSearch server. It is exposed to the host OS, |
| 24 | +so it can be accessed with a browser. |
| 25 | + |
| 26 | +### OpenSearch |
| 27 | + |
| 28 | +An OpenSearch server. It is running in standalone mode. It is exposed to the host OS. It is configured to have |
| 29 | +an S3/Glue datasource with the name `mys3`. System indices and system indices permissions are disabled. |
| 30 | + |
| 31 | +This container also has a docker volume used to persist data such as local indices. |
| 32 | + |
| 33 | +### Spark |
| 34 | + |
| 35 | +The Spark master node. It is configured to use an external Hive metastore in the container `metastore`. The |
| 36 | +Spark master also has the Flint and PPL extensions installed. It can use locally built Jar files when building |
| 37 | +the docker image. |
| 38 | + |
| 39 | +Spark Connect is also running in this container and can be used to easily issue queries to run. The port for |
| 40 | +Spark Connect is exposed to the host OS. |
| 41 | + |
| 42 | +### Spark Worker |
| 43 | + |
| 44 | +The Spark worker node. It is configured to use an external Hive metastore in the container `metastore`. The |
| 45 | +Spark worker also has the Flint and PPL extensions installed. It can use locally built Jar files when building |
| 46 | +the docker image. |
| 47 | + |
| 48 | +### Spark Submit |
| 49 | + |
| 50 | +A temporary container that runs queries for an Async API session. It is started the OpenSearch container. It |
| 51 | +does not connect to the Spark cluster and instead runs the queries locally. It will keep looking for more |
| 52 | +queries to run until it reaches its timeout (3 minutes by default). |
| 53 | + |
| 54 | +The Spark submit container is configured to use an external Hive metastore in the container `metastore`. The |
| 55 | +Flint and PPL extensions are installed. When building the docker image, locally built Jar files can be used. |
| 56 | + |
| 57 | +### Metastore (Hive) |
| 58 | + |
| 59 | +A Hive server that is used as a metastore for the Spark containers. It is configured to use the Minio |
| 60 | +container in the bucket `integ-test`. |
| 61 | + |
| 62 | +This container also has a docker volume used to persist the metastore. |
| 63 | + |
| 64 | +### Minio (S3) |
| 65 | + |
| 66 | +A Minio server that acts as an S3 server. Is used as a part of the workflow of executing an S3/Glue query. |
| 67 | +It will contain the S3 tables data. |
| 68 | + |
| 69 | +This container also has a docker volume used to persist the S3 data. |
| 70 | + |
| 71 | +### Configuration-Updated |
| 72 | + |
| 73 | +A temporary container that is used to configure the OpenSearch and Minio containers. It is run after both |
| 74 | +of those have started up. For Minio, it will add the `integ-test` bucket and create an access key. For |
| 75 | +OpenSearch, it will create the S3/Glue datasource and apply a cluster configuration. |
| 76 | + |
| 77 | +## Running the Cluster |
| 78 | + |
| 79 | +To start the cluster go to the directory `docker/integ-test` and use docker compose to start the cluster. When |
| 80 | +starting the cluster, wait for the `spark-worker` container to finish starting up. It is the last container |
| 81 | +to start. |
| 82 | + |
| 83 | +Start cluster in foreground: |
| 84 | +```shell |
| 85 | +docker compose up |
| 86 | +``` |
| 87 | + |
| 88 | +Start cluster in the background: |
| 89 | +```shell |
| 90 | +docker compose up -d |
| 91 | +``` |
| 92 | + |
| 93 | +Stopping the cluster: |
| 94 | +```shell |
| 95 | +docker compose down -d |
| 96 | +``` |
| 97 | + |
| 98 | +## Creating Tables in S3 |
| 99 | + |
| 100 | +Tables need to be created in Spark as external tables. Their location must be set to a path under `s3a://integ-test/`. |
| 101 | +Can use `spark-shell` on the Spark master container to do this: |
| 102 | +```shell |
| 103 | +docker exec it spark spark-shell |
| 104 | +``` |
| 105 | + |
| 106 | +Example for creating a table and adding data: |
| 107 | +```scala |
| 108 | +spark.sql("CREATE EXTERNAL TABLE foo (id int, name varchar(100)) location 's3a://integ-test/foo'") |
| 109 | +spark.sql("INSERT INTO foo (id, name) VALUES(1, 'Foo')") |
| 110 | +``` |
| 111 | + |
| 112 | +## Configuration of the Cluster |
| 113 | + |
| 114 | +There are several settings that can be adjusted for the cluster. |
| 115 | + |
| 116 | +* SPARK_VERSION - the tag of the `bitnami/spark` docker image to use |
| 117 | +* OPENSEARCH_VERSION - the tag of the `opensearchproject/opensearch` docker image to use |
| 118 | +* DASHBOARDS_VERSION - the tag of the `opensearchproject/opensearch-dashboards` docker image to use |
| 119 | +* MASTER_UI_PORT - port on the host OS to map to the master UI port (8080) of the Spark master |
| 120 | +* MASTER_PORT - port on the host OS to map to the master port (7077) on the Spark master |
| 121 | +* UI_PORT - port on the host OS to map to the UI port (4040) on the Spark master |
| 122 | +* SPARK_CONNECT_PORT - port on the host OS to map to the Spark Connect port (15002) on the Spark master |
| 123 | +* PPL_JAR - The relative path to the PPL extension Jar file. Must be within the base directory of this repository |
| 124 | +* FLINT_JAR - The relative path to the Flint extension Jar file. Must be within the base directory of this |
| 125 | + repository |
| 126 | +* SQL_APP_JAR - The relative path to the SQL application Jar file. Must be within the base directory of this |
| 127 | + repository |
| 128 | +* OPENSEARCH_NODE_MEMORY - Amount of memory to allocate for the OpenSearch server |
| 129 | +* OPENSEARCH_ADMIN_PASSWORD - Password for the admin user of the OpenSearch server |
| 130 | +* OPENSEARCH_PORT - port on the host OS to map to port 9200 on the OpenSearch server |
| 131 | +* OPENSEARCH_PA_PORT - port on the host OS to map to the performance analyzer port (9600) on the OpenSearch |
| 132 | + server |
| 133 | +* OPENSEARCH_DASHBOARDS_PORT - port on the host OS to map to the OpenSearch dashboards server |
| 134 | +* S3_ACCESS_KEY - access key to create on the Minio container |
| 135 | +* S3_SECRET_KEY - secret key to create on the Minio container |
| 136 | + |
| 137 | +## Async API Overview |
| 138 | + |
| 139 | +[Async API Interfaces](https://github.com/opensearch-project/sql/blob/main/docs/user/interfaces/asyncqueryinterface.rst) |
| 140 | + |
| 141 | +[Async API Documentation](https://opensearch.org/docs/latest/search-plugins/async/index/) |
| 142 | + |
| 143 | +The Async API is able to query S3/Glue datasources. This is done by calling the AWS EMR service to use a |
| 144 | +docker container to run the query. The docker container uses Spark and is able to access the Glue catalog and |
| 145 | +retrieve data from S3. |
| 146 | + |
| 147 | +For the docker cluster, Minio is used in place of S3. Docker itself is used in place of AWS EMR. |
| 148 | + |
| 149 | + |
| 150 | + |
| 151 | +1. Client submit a request to the async_search API endpoint |
| 152 | +2. OpenSearch server creates a special index (if it doesn't exist). This index is used to store async API requests |
| 153 | + along with some state information. |
| 154 | +3. OpenSearch server checks if the query is for an S3/Glue datasource. If it is not, then OpenSearch can handle |
| 155 | + the request on its own. |
| 156 | +4. OpenSearch uses docker to start a new container to process queries for the current async API session. |
| 157 | +5. OpenSearch returns the queryId and sessionId to the Client. |
| 158 | +6. Spark submit docker container starts up. |
| 159 | +7. Spark submit docker container searches for index from step 2 for a query in the current session to run. |
| 160 | +8. Spark submit docker container creates a special OpenSearch index (if it doesn't exist). This index is used to |
| 161 | + store the results of the async API queries. |
| 162 | +9. Spark submit docker container looks up the table metadata from the `metastore` container. |
| 163 | +10. Spark submit docker container retrieves the data from the Minio container. |
| 164 | +11. Spark submit docker container writes the results to the OpenSearch index from step 7. |
| 165 | +12. Client submits a request to the async_search results API endpoint using the queryId form step 5. |
| 166 | +13. OpenSearch returns the results to the Client. |
0 commit comments