Skip to content

Commit 9aae9bb

Browse files
Added some documentation for the integration test docker cluster
Signed-off-by: Norman Jordan <norman.jordan@improving.com>
1 parent 5974486 commit 9aae9bb

File tree

8 files changed

+174
-4
lines changed

8 files changed

+174
-4
lines changed

docker/integ-test/.env

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ SQL_APP_JAR=./spark-sql-application/target/scala-2.12/sql-job-assembly-0.7.0-SNA
1111
OPENSEARCH_NODE_MEMORY=512m
1212
OPENSEARCH_ADMIN_PASSWORD=C0rrecthorsebatterystaple.
1313
OPENSEARCH_PORT=9200
14+
OPENSEARCH_PA_PORT=9600
1415
OPENSEARCH_DASHBOARDS_PORT=5601
1516
S3_ACCESS_KEY=Vt7jnvi5BICr1rkfsheT
1617
S3_SECRET_KEY=5NK3StGvoGCLUWvbaGN0LBUf9N6sjE94PEzLdqwO

docker/integ-test/docker-compose.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ services:
137137
target: /var/run/docker.sock
138138
ports:
139139
- ${OPENSEARCH_PORT:-9200}:9200
140-
- 9600:9600
140+
- ${OPENSEARCH_PA_PORT:-9600}:9600
141141
expose:
142142
- "${OPENSEARCH_PORT:-9200}"
143143
- "9300"

docker/integ-test/spark/spark-master-entrypoint.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
#!/bin/bash
22

3+
# Copyright OpenSearch Contributors
4+
# SPDX-License-Identifier: Apache-2.0
5+
36
function start_spark_connect() {
47
sc_version=$(ls -1 /opt/bitnami/spark/jars/spark-core_*.jar | sed -e 's/^.*\/spark-core_//' -e 's/\.jar$//' -e 's/-/:/')
58

docs/docker/integ-test/README.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Docker Cluster for Integration Testing
2+
3+
## Introduction
4+
5+
The docker cluster in `docker/integ-test` is designed to be used for integration testing. It supports the following
6+
use cases:
7+
1. Submitting queries directly to Spark in order to test the PPL extension for Spark.
8+
2. Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension
9+
for Spark.
10+
3. Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying
11+
S3/Glue datasources. A local container is run rather than using the AWS EMR service.
12+
13+
The cluster consists of several containers and handles configuring them. No tables are created.
14+
15+
## Overview
16+
17+
![Docker Containers](images/integ-test-containers.png "Docker Containers")
18+
19+
All containers run in a dedicated docker network.
20+
21+
### OpenSearch Dashboards
22+
23+
An OpenSearch dashboards server that is connected to the OpenSearch server. It is exposed to the host OS,
24+
so it can be accessed with a browser.
25+
26+
### OpenSearch
27+
28+
An OpenSearch server. It is running in standalone mode. It is exposed to the host OS. It is configured to have
29+
an S3/Glue datasource with the name `mys3`. System indices and system indices permissions are disabled.
30+
31+
This container also has a docker volume used to persist data such as local indices.
32+
33+
### Spark
34+
35+
The Spark master node. It is configured to use an external Hive metastore in the container `metastore`. The
36+
Spark master also has the Flint and PPL extensions installed. It can use locally built Jar files when building
37+
the docker image.
38+
39+
Spark Connect is also running in this container and can be used to easily issue queries to run. The port for
40+
Spark Connect is exposed to the host OS.
41+
42+
### Spark Worker
43+
44+
The Spark worker node. It is configured to use an external Hive metastore in the container `metastore`. The
45+
Spark worker also has the Flint and PPL extensions installed. It can use locally built Jar files when building
46+
the docker image.
47+
48+
### Spark Submit
49+
50+
A temporary container that runs queries for an Async API session. It is started the OpenSearch container. It
51+
does not connect to the Spark cluster and instead runs the queries locally. It will keep looking for more
52+
queries to run until it reaches its timeout (3 minutes by default).
53+
54+
The Spark submit container is configured to use an external Hive metastore in the container `metastore`. The
55+
Flint and PPL extensions are installed. When building the docker image, locally built Jar files can be used.
56+
57+
### Metastore (Hive)
58+
59+
A Hive server that is used as a metastore for the Spark containers. It is configured to use the Minio
60+
container in the bucket `integ-test`.
61+
62+
This container also has a docker volume used to persist the metastore.
63+
64+
### Minio (S3)
65+
66+
A Minio server that acts as an S3 server. Is used as a part of the workflow of executing an S3/Glue query.
67+
It will contain the S3 tables data.
68+
69+
This container also has a docker volume used to persist the S3 data.
70+
71+
### Configuration-Updated
72+
73+
A temporary container that is used to configure the OpenSearch and Minio containers. It is run after both
74+
of those have started up. For Minio, it will add the `integ-test` bucket and create an access key. For
75+
OpenSearch, it will create the S3/Glue datasource and apply a cluster configuration.
76+
77+
## Running the Cluster
78+
79+
To start the cluster go to the directory `docker/integ-test` and use docker compose to start the cluster. When
80+
starting the cluster, wait for the `spark-worker` container to finish starting up. It is the last container
81+
to start.
82+
83+
Start cluster in foreground:
84+
```shell
85+
docker compose up
86+
```
87+
88+
Start cluster in the background:
89+
```shell
90+
docker compose up -d
91+
```
92+
93+
Stopping the cluster:
94+
```shell
95+
docker compose down -d
96+
```
97+
98+
## Creating Tables in S3
99+
100+
Tables need to be created in Spark as external tables. Their location must be set to a path under `s3a://integ-test/`.
101+
Can use `spark-shell` on the Spark master container to do this:
102+
```shell
103+
docker exec it spark spark-shell
104+
```
105+
106+
Example for creating a table and adding data:
107+
```scala
108+
spark.sql("CREATE EXTERNAL TABLE foo (id int, name varchar(100)) location 's3a://integ-test/foo'")
109+
spark.sql("INSERT INTO foo (id, name) VALUES(1, 'Foo')")
110+
```
111+
112+
## Configuration of the Cluster
113+
114+
There are several settings that can be adjusted for the cluster.
115+
116+
* SPARK_VERSION - the tag of the `bitnami/spark` docker image to use
117+
* OPENSEARCH_VERSION - the tag of the `opensearchproject/opensearch` docker image to use
118+
* DASHBOARDS_VERSION - the tag of the `opensearchproject/opensearch-dashboards` docker image to use
119+
* MASTER_UI_PORT - port on the host OS to map to the master UI port (8080) of the Spark master
120+
* MASTER_PORT - port on the host OS to map to the master port (7077) on the Spark master
121+
* UI_PORT - port on the host OS to map to the UI port (4040) on the Spark master
122+
* SPARK_CONNECT_PORT - port on the host OS to map to the Spark Connect port (15002) on the Spark master
123+
* PPL_JAR - The relative path to the PPL extension Jar file. Must be within the base directory of this repository
124+
* FLINT_JAR - The relative path to the Flint extension Jar file. Must be within the base directory of this
125+
repository
126+
* SQL_APP_JAR - The relative path to the SQL application Jar file. Must be within the base directory of this
127+
repository
128+
* OPENSEARCH_NODE_MEMORY - Amount of memory to allocate for the OpenSearch server
129+
* OPENSEARCH_ADMIN_PASSWORD - Password for the admin user of the OpenSearch server
130+
* OPENSEARCH_PORT - port on the host OS to map to port 9200 on the OpenSearch server
131+
* OPENSEARCH_PA_PORT - port on the host OS to map to the performance analyzer port (9600) on the OpenSearch
132+
server
133+
* OPENSEARCH_DASHBOARDS_PORT - port on the host OS to map to the OpenSearch dashboards server
134+
* S3_ACCESS_KEY - access key to create on the Minio container
135+
* S3_SECRET_KEY - secret key to create on the Minio container
136+
137+
## Async API Overview
138+
139+
[Async API Interfaces](https://github.com/opensearch-project/sql/blob/main/docs/user/interfaces/asyncqueryinterface.rst)
140+
141+
[Async API Documentation](https://opensearch.org/docs/latest/search-plugins/async/index/)
142+
143+
The Async API is able to query S3/Glue datasources. This is done by calling the AWS EMR service to use a
144+
docker container to run the query. The docker container uses Spark and is able to access the Glue catalog and
145+
retrieve data from S3.
146+
147+
For the docker cluster, Minio is used in place of S3. Docker itself is used in place of AWS EMR.
148+
149+
![OpenSearch Async API Sequence Diagram](images/OpenSearch_Async_API.png "Sequence Diagram")
150+
151+
1. Client submit a request to the async_search API endpoint
152+
2. OpenSearch server creates a special index (if it doesn't exist). This index is used to store async API requests
153+
along with some state information.
154+
3. OpenSearch server checks if the query is for an S3/Glue datasource. If it is not, then OpenSearch can handle
155+
the request on its own.
156+
4. OpenSearch uses docker to start a new container to process queries for the current async API session.
157+
5. OpenSearch returns the queryId and sessionId to the Client.
158+
6. Spark submit docker container starts up.
159+
7. Spark submit docker container searches for index from step 2 for a query in the current session to run.
160+
8. Spark submit docker container creates a special OpenSearch index (if it doesn't exist). This index is used to
161+
store the results of the async API queries.
162+
9. Spark submit docker container looks up the table metadata from the `metastore` container.
163+
10. Spark submit docker container retrieves the data from the Minio container.
164+
11. Spark submit docker container writes the results to the OpenSearch index from step 7.
165+
12. Client submits a request to the async_search results API endpoint using the queryId form step 5.
166+
13. OpenSearch returns the results to the Client.
79 KB
Loading
47.2 KB
Loading
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ sbt clean
1919
sbt assembly
2020
```
2121

22-
Refer to the [Developer Guide](../DEVELOPER_GUIDE.md) for more information.
22+
Refer to the [Developer Guide](../../DEVELOPER_GUIDE.md) for more information.
2323

2424
## Using Docker Compose
2525

@@ -65,7 +65,7 @@ spark.sql("INSERT INTO test_table (id, name) VALUES(2, 'Bar')")
6565
spark.sql("source=test_table | eval x = id + 5 | fields x, name").show()
6666
```
6767

68-
For further information, see the [Spark PPL Test Instructions](ppl-lang/local-spark-ppl-test-instruction.md)
68+
For further information, see the [Spark PPL Test Instructions](../ppl-lang/local-spark-ppl-test-instruction.md)
6969

7070
## Manual Setup
7171

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ sbt clean
1818
sbt assembly
1919
```
2020

21-
Refer to the [Developer Guide](../DEVELOPER_GUIDE.md) for more information.
21+
Refer to the [Developer Guide](../../DEVELOPER_GUIDE.md) for more information.
2222

2323
## Using Docker Compose
2424

0 commit comments

Comments
 (0)